Wikistats is back again

This post is a repost. A few weeks ago I blogged about the fact that new wikistats are available for the first time since May 2008, and about a number of functional updates in the scripts. Unfortunately part of the wikistats output was wrong due to a localization bug. See ‘Wikistats is back’ for an brief explanation of what went wrong. So here is almost the same story with new data and new screen prints.

The reports have been generated on the new wikistats server ‘Bayes’, which is operational since about two months. New reports are available for Wikibooks, Wiktionary, Wikinews, Wikiquote, Wikisource, Wikiversity and Wikispecial.

The dump process itself had been restarted some weeks earlier, new dumps are now available for all 700+ wiki projects (with the English Wikipedia as the usual exception*).

Revised schedule

From now on the wikistats reports will be updated much more frequently. The actual processing of any new dump starts soon after the dump becomes available, results will be stored in intermediate files. Once every week updated reports will be published in English. Once every month in all 27 languages.

In the old days when wikistats ran as a guest job on a web server the job was manually started only after new database dumps had been generated for all wikis. Thus several weeks further delay were added to an already very long dump process. Another effect that made the reports always seem out of date is the fact that data for the month(s) in which the dump is generated are ignored: for the largest dumps articles that have been dumped early in the process are inevitably days or even weeks less up to date than articles dumped near the end, which makes counts for incomplete months totally useless.**

For now the wikistats job can keep up well with the speed in which dumps are generated. The data gathering step in wikistats (counts section) has been optimized over and over in the past 6 years, to accomodate larger and larger dumps and cope with steadily more limiting hardware resources, but recent performance profiling and analysis showed there is still room for considerable improvement. The reporting section really needed some attention first. Running it four times a month at least for English is now doable. There was a time when the production of hundreds of reports in 27 languages took a mere 10 minutes. This had grown to almost a day, but its now back to a couple of hours. The technical term for databases that grow in size continuously is autonomous growth. Wikistats contains abundant evidence that this term is made for what we do.

Functional updates

Sort sequence
In the sitemap and comparison reports languages are now sorted by page views in recent days. This is in my opinion a marked improvement over sorting by any other metric (and the reports changed already a couple of times in this respect). The resulting changes in sort order are food for thought: in some cases amount of articles and total article views per time unit are quite different. Of course number of speakers of a language is a major factor in article views, but so it is in number of articles.

Old sort sequence: article counts
New sort sequence: article views

More rankings
Any sort sequence is inevitably seen as some sort of a ranking by itself, almost as measure of accomplishment. Over the years competition between Wikimedia language projects for some grew up to a point that arguably it became detrimental to our endeavour. Some bots were welcomed at least partly because of their boosting effects on article count. This competition will never die out completely. Hopefully some day impartially produced automated rankings for average article quality will enter into the equation. Until then it may help to diffuse the attention by offering rankings on more metrics. Note that a wiki can’t score high on every metric: a high ranking for ‘new articles per day’ in a given month will inversely influence the ranking for ‘average edit count per article’ in that month, as can be clearly seen from this example. The fact that the Dutch Wikipedia ranked first in new articles per day in the third quarter of 2008 suggests major bot activity. But that is for now an educated guess at best.

Wikistats progress tracker
Complimentary to the concise dump progress report there is now a wikistats data gathering progress report.

Wikipedia project wide totals without English

For a long time the table and bar charts that showed project wide totals for all Wikipedias has been frozen in time at Sep 2006. In order to get at least a better understanding of how other projects fared there is now also a version with the English Wikipedia excluded: see this table and these bar charts.

New contributors to Wikipedia project for all languages

New contributors to Wikipedia project for all languages, except English

Note that the chosen example shows about the only metric that is in decline on non English wikipedias: ‘New contributors per month’. A lot has been speculated on mailing lists about Wikipedia projects being in decline. In a general sense this is a severe overstatement. Some large wikipedias do indeed see a decline in (very) active contributors, others an increase. For many metrics the project wide totals are steady.

People often quote the overcomplicated user interface as culprit for waning influx of new editors: the learning curve has become too steep ***. This may be a factor, and the upcoming project funded by the Stanton Foundation to improve this situation is indeed most welcome. But another factor seems to me just as important: for many of the largest wikipedias the awareness of its existence among potential editors (people with web access that speak the language) must be very near 100% now. In recent years, while Wikipedia gained notoriety, many more potential editors were thrilled to learn of the existence of the project. Now almost everyone knows. To a lesser extent the same effect might still explain part of the influx to our other projects for some time.

The following chart, with ‘total edits to Wikipedia projects for all languages except English’ suggests that Wikipedia has moved from rapid growth to sustainability, as the numbers are fairly steady for a long time already. It may be however that a growing number of bots edits masks a diminishing activity of human editors. This I don’t know yet. Some day I hope to present human and bot edits separately.

Total edits to Wikipedia projects for all languages except English

According to a  database scan by wikipedian Dragons Flight the number of very active bots on Wikipedia is clearly rising, but the overall monthly edit totals are not presented there (rather editor totals).

Smaller tweaks and fixes
Most noteworthy: svg images function are now shown properly in Firefox

*The absence of a full archive dump for the English Wikipedia dates back to October 2006. Now that WMF has more resources at its disposal finally some work is in progress to address this situation, which is unsettling for many more reasons than just missing wikistats. Possibly the dump code will be reworked to support paralellization.

**The newest attempt to dump the English Wikipedia runs for 103 days and has 284 days to go (four weeks weeks ago this was 77 and 219 days to go).

In the event that this job really runs succesfully to completion at the end of October 2009 wikistats can only report new figures up to September 2008: any month after that is only partially presented in the dump. See also previous note.

***In my opinion for many articles the content of the edit box has become almost undecipherable, which poses much more of a problem for newcomers (and to a lesser extent experienced editors) than the overall site usability.

This entry was posted in Wikistats Production, Wikistats Reports. Bookmark the permalink.

5 Responses to Wikistats is back again

  1. Pingback: Infodisiac » Wikistats is back

  2. bawolff says:

    Thanks for the stats – they provide interesting insights into whats happening.

    I was thinking, it might be useful to have stats that compare wikinews month by month than overall, since unlike most projects, articles have a limited time frame, so stuff like the number of articles per day is much more meaningful than total number of articles. Just a thought.


  3. bawolff says:

    The SVg doesn’t seem to work for me. Firefox keeps trying to download them (and having a blank image spot on the page.) Using: Mozilla/5.0 (X11; U; Linux i686; en-GB; rv: Gecko/20081114 BonEcho/

  4. Erik says:

    Bawolff, I see what you mean. That is possible. It takes a bit more than flipping one switch, so I’ll have to put if on the todo list.
    Cheers, Erik

  5. GreenReaper says:

    It will be interesting to see whether the edit count actually drops as a result of AbuseFilter.

Leave a Reply

Your email address will not be published. Required fields are marked *