Wikistats reports on article revert trends

At Wikimania Gdansk I talked about the newest addition to Wikistats: article revert trends.
This week the new set of tables and charts has been released for almost 800 Wikimedia wikis. For example here is the report for the English Wikipedia.

In recent years there has been much talk within the Wikimedia community about edit reverts. Is there a growing tendency to revert (undo) new contributions, especially edits by anonymous users? If so, does this discourage new users to contribute? How large is the share of reverts anyway? Is an initiative like Flagged Revisions (FlagRevs), which seeks to shield the general public from deliberate but mostly short-lived lapses in article quality, effective? Is it also non-disruptive?  The latter questions also bear some relevance for the current Pending Changes initiative, which was inspired by FlagRevs.

Particularly on the German Wikipedia, where FlagRevs first took off, there is a set of relevant statistics already, notably these trend charts by user ParaDox. Also some months ago Felipe Ortega was asked by the German Chapter to look at the effect of FlagRevs. Felipe also presented at Wikimania Gdansk. Although there are differences in our approaches, and our research only partially overlapped, I am glad our conclusions are roughly the same (see our presentations).

analysiswpde

As always I looked for an approach that is language independent, and therefore applicable for all our wikis, yet yields solid results. In this case the crucial factor was how to recognize reverts in our archives (xml dumps).

One way is to look for key words in the edit comments that correlate well with reverts. Advantage is: it can detect incomplete reverts, where a user manually undoes most of the last changes , but not quite completely restores the previous state of the article. Drawbacks are: quite many false positives are inevitable, and of course it does not scale well to 280 languages.

There is a way to detect complete reverts efficiently: calculate a checksum for each revision, and compare this with checksums of earlier revisions.  The MD5 checksum algorithm guarantees with near certainty that equal checksums imply equal input, byte for byte, in this case full revision contents. Another important advantage: it is completely irrelevant what language or encoding is used in any wiki. For a few large wikis I tried both approaches: a wiki specific set of keywords in comments, and MD5 checksums. However tables and charts in the new reports are almost solely based on the more reliable MD5 approach.

articlebush1

Here is a summary of the tables and charts that together comprise the new report: Tables: ranking of most reverted users, most reverting users, most reverted articles. Charts: edit trends (some published earlier) , and revert trends. Reverts trends come in two variations: revert ratio per class of editor (registered or anonymous editor, or bot) and breakdown of anonymous edits by class of reverting editor. All charts come in two variations: one shows the raw numbers, another show trends  (raw data minus seasonal patterns and random variations).

revertratios

Surely there is ample risk of information overload. But I am convinced that for many of our wikis some people will be motivated to dig into these stats and come up with a context and some explanations, a story which will have wider implications beyond that particular wiki.

Hopefully the new stats can bring some solid numbers to discussions about our revert policies. Of course these numbers are influenced by methodology, are selective (meaning they don’t address all issues), and probably raise some new questions.The data files generated in this project can be reused  in further research, e.g. to determine which share of bad edits is detected and undone before a an article update revision is released to the public.

Disclaimer: complex social phenomena will never be explained by quantative data only, and seldom yield to what-if questions, neither do they lend themselves easily to double blind experiments.

See also my slides for Wikimania Gdansk presentation.

Note: in the new reports some language links are missing. Until fixed please use this table to access all reports.

Update: anyone interested in studying revert patterns be advised to look into the work of Palo Alto Research Center scientist Ed Chi and his colleagues.  Ed has been studying Wikipedia and its growth patterns for many years now. Vandalism and reverts  were part of that research. Among his many publications several focus on Wikipedia, two recent ones relevant to this topic are The Singularity is Not Near: Slowing Growth of Wikipedia (2009) and What’s in Wikipedia? Mapping Topics and Conflict Using Socially Annotated Category Structure (2009).

Wikimedia page views, some good and bad news

First the good news: there is new summary report for Wikimedia page views that presents trends for nearly all projects on a single page.

Now the bad news: a few days ago it was established that the server that collects and aggregates log data for all squids could not keep up with all incoming messages, and hence underreported page views. When I suggested that recent page view trends looked very suspicious Tim Starling and Mark Bergsma quickly analyzed the cause and fixed server overload. Kudos to them. For April - July 2010 I could still infer the amount of underreporting from available log files. Counts for these months have been corrected. For earlier months, possibly from Nov 2009 till March 2010 counts are still too low. For details on the error correction see this pdf.

Reports affected: all wikistats reports that are based on dammit.lt hourly log files are affected, notably page view reports and server request reports. The same goes for the monthly Report Card. Earlier editions of the monthly server request reports are not yet corrected like the page view trend reports (maybe just a notice will be added), and of course even though absolute numbers are too low, comparisons are not affected (e.g. market share per browser or OS) . Other sites that build on these log data will be also affected, notably stats.grok.se , trendingtopics, amaglamate.

Edit Trends per Wikimedia Project

The two charts below show the trend in number of edits per month, first for the German Wikipedia, then for all Wikipedias combined.

ploteditsde1

ploteditszz

The blue lines show us that since early 2007 the monthly volume of manual edits by registered (and logged-in) users has been fairly consistent, both for the  German Wikipedia and overall.

The red lines show us that since mid 2007  the number of manual edits by so called anonymous users has dropped considerably on the German Wikipedia, unlike the overall trend.

This has relevance for the assessment of flagged revisions, which has been a much discussed feature since its inception on the German Wikipedia. According to Wikipedia “On May 6, 2008, Flagged Revisions were enabled, first for a test, on the German Wikipedia. After an extended test period with lots of bug fixing and massive controversial discussions, a straw poll was organized in August. Over 1,200 users contributed, with the majority voting in favor of the new system.”

One of the expected effects was that some anonymous editors would become less motivated to contribute, now that immediate gratification was no longer available: no longer enjoyed they instantaneous world wide visibility of their actions. Hopefully this would discourage vandals to even attempt vandalism, but it might also discourage well intentioned unregistered contributors (for many registered users so called ‘auto sighting’ meant for them there was no difference).

It is true that the German Wikipedia witnessed a much steeper decline in anonymous edits than the average Wikipedia but then again it is certainly not unique in this respect, the Japanese and Polish Wikipedia’s also are examples where anonymous edits are declining steeply.  Perhaps more significant is that the moment of introduction of flagged revisions on the German Wikipedia is really not visible in the chart. No strong shift in mid 2008 in the red line, rather a gradual downward trend.

Edit trend charts are available for all Wikimedia projects

Charts like the ones shown here have been generated for all 800 or so Wikimedia projects and will be refreshed regularly. A couple of Wikimedia projects are now using the flagged revisions feature; implementations differ. Please share your thoughts on possible interpretations for those wikis.

For each project there is a page showing counts plus links to individual charts.
For each project there is also a page where all charts are displayed together.
For the largest projects there is also a page with most edited wikis only.

editcountstable


Wikipedia Table & Charts / Wikibooks Table & Charts / Wikinews Table & Charts
Wikiquote Table & Charts / Wikisource Table & Charts / Wikiversity Table & Charts
Wiktionary Table & Charts / Other projects Table & Charts


Clearly edit and page views counts are mutually dependent via the virtuous circle.

Therefore the Edit Trend charts can also be viewed from the Page Views reports.

pageviews1

Full history dump for English Wikipedia is back

Normally I do not blog about the arrival of a new Wikipedia dump, but this occasion is truely something to be celebrated!

The dump at hand contains all revisions of all articles in the English Wikipedia (except articles and single revisions that were deleted from the database). It is hugely important for researchers: many cherish the English Wikipedia as their favorite object of study, it being the oldest, largest, most viewed and edited Wikipedia. This particular dump has had a long history of ill fortune: for 4.5 years no new full archive dump for the English Wikipedia has been consistently available 1.

Over the years there has not been a shortage of outcry over its absence, despite serious attempts to remedy the situation. Mind you, it was not for a lack of trying that no new dump made it to completion. Wikimedia’s lead developer and later CTO Brion Vibber, amid a thousand other responsabilities, did improve the robustness of the dump job over time (many mishaps can occur over a period of months of running time, e.g. network outages). But fixing the dump proved a moving target. On every attempt the sheer amount of work to be done had increased enormously. Once the script had been made totally robust, processing time became the next hurdle. At some stage a new phenomenon occurred: during a run, month after month the expected time of arrival (ETA) kept going up instead of down. Memorably, when on one run the job signaled there was still a year to go before completion, it was manually aborted.

All is well that ends well. In 2009 lead developer Tim Starling managed to find the time to revise the internal compression scheme used to store articles efficiently in the database, thus significantly shortening dump time. Meanwhile Tomasz Finc worked hard to upgrade and overhaul the entire dump infrastructure, both on hardware and software level. On March 11, 2010 after a measly 40 days processing time, a full archive English Wikipedia dump was finally produced, which in another 15 days was recompressed into much more compact format. Meanwhile wikistats started to do its part and after 21 days data crunching complete wikistats are again available for the English Wikipedia!

editsenglishwikipedia1

1: I said ‘consistently available’: because in 2007 at least one valid dump was indeed produced, but it got lost within weeks due to a overzealous space cleanup script, alas before it could be processed by wikistats. Several incomplete dumps followed (some believe there was a complete and valid dump in 2008 but if so it was also short lived. Until now the latest full dump that wikistats had been able to process dated from Sep 2006.

Some stats on the new dump

Dump size

  • 5.6 Tb uncompressed
  • 280 Gb in bz2 compression format
  • 32 Gb in 7z compression format

Edits all time

Edits Jan 2010

  • There are 4 million article edits per month
  • 40,000+ registered people made 5+ article edits, 4000+ made 100+ edits
  • 10,000+ users made 5+ discussion page edits, 1000+ made 100+ edits
  • 10,000+ users made 5+ other page edits, 1000+ made 100+ edits
  • 100+ bots edited 5+ times, 7 bots even 10,000+ times

Content (article namespace only)

  • 3.2 MM article pages
  • 4 MM redirect pages
  • 1.9 MM links to images (800 k unique images)
  • 7 MM links to other Wikimedia projects
  • 78 MM links to other pages in this Wikipedia
  • 11 MM links to other web sites
  • 1.8 billion words

Wikipedia page views, a global perspective (2)

Today I released four new wikistats reports, which all focus on our global readership.
Since years a report provide insights in how many pages each Wikipedia serves monthly.
More recent reports provide detailed cross sections through traffic.

But our understanding of where in the world those pages are read could be improved.
The following reports, with varying level of details, aim to fill this gap.

Like the other mentioned reports these are also based on the sampled server logs. One in thousand server accesses are logged and stored for a limited time for analysis purposes. A company called MaxMind publishes a database for translation of ip addresses into geographic information (which is even free on country level). Thanks to Mark Bergsma for building the MaxMind lookup filter. See below the screen shot for dirty details.

overview

Dirty details

The 1:1000 sampled squid logs are scanned one day at a time. Counts per day are added to monthly totals. Since each record stands for thousand original server accesses, totals are multiplied by 1000.

The counts do not include page requests by bots (crawlers, spiders, etc). Many bots make themselves known by sending identifying infomation with the page requests. Filtering those is easy. Unfortunately quite a few bots chose to stay anonymous.

On average for every thousand pages viewed on one day by one person just one server request ends up in the log. (on average meaning it can be zero on one day, two on a another).  But thousand page views is a huge amount for an ordinary person. By far the most ip addresses that occur more than once on a single day are bots, which can easily issue many thousands of page requests on a day. Therefore all ip addresses that occur more than once are skipped. A few very avid readers may thus be missing (false negatives). A huge school that sends all outgoing traffic over one ip address may also result in a handful of log records, that are discarded (again false negatives). At the other hand anonymous bots with modest activity may appear only once in the log and be counted (false positives). I am confident that these minor unavoidable imprecisions have little influence on final results.

Note: do not mistake the page views presented in the reports for visits or even unique visitors. Those are entirely different metrics  that can not be deduced from the sampled log.

Anomalies

There is at least one clear anomaly in the presented numbers: 19% of traffic originating in Australia seems to go to the Japanese Wikipeda. This is almost certainly incorrect. Five large organizations (RIR’s), organized roughly by continent, are responsible for handing out large ranges of ip addresses to internet service providers. Ronald Beelaard (who has a lot of experience in hunting down mischievous bots) was able to explain this: a large chunk of ip addresses had recently been reassigned by Apnic, and in absence of a proper administration that network defaults to country code AU. Time will cure this.

Also rather unexpected is that 85% of the page views to the Swahili Wikipedia originated from the USA. This may well be caused by the Google contest in recent months where nice prices are offered for new content on the Swahili Wikipedia. Also keep in mind that access to the Swahili Wikipedia from Kenya and Tanzania is up till now very limited with just tens of thousands of monthly views.