New statistics for the English Wikipedia (2)

This post is an in depth technical coverage of the new statistics which were introduced in the previous post.

Dump process

In recent months dumps for all but one Wikimedia wikis have been produced with delightful speed and regularity. Tomasz Finc, thank you for this! This leaves the English full archive dump as pièce de résistance. After three years of good intentions and reassurances (which on hindsight probably delayed a new approach), and commitments that time and again had to yield to even higher priorities, it became clear that workaround and compromise were the only way to make progress really soon. [1] Again Tomasz’ resourcefulness and enthusiasm were very helpful. [2] A custom made enriched version of an existing partial ‘stub’ dump allows the wikistats scripts now to also collect article counts, edit and editor counts for the colossal English Wikipedia. [3]

So now for the first time since October 2006 there are new wikistats data for the English Wikipedia. Without those English counts, project wide totals (= for all wikipedias combined) made no sense, so also for that reason the new data are very welcome. [4]

Refined editor and bot stats, split by page group

For many years only two levels of activity were registered. A user was considered active in any given month with 5 or more edits and very active with 100 or more edits. Now as you can see from the charts in the previous post more activity levels are registered (and then some, which are not in those plots [5]).

Also editor activity levels are now available for article edits, talk page edits and other page edits separately, and separately for users and bots as well. See for an example this new table

Like always users that edit on more than one wikipedia are counted double. Since Single User Login has been implemented most user names are now unique over all projects. So now in theory double counts can be eliminated. In practice this is dependant on availability of relevant SUL table dumps, which is being worked on.

New wikipedians counted differently

I want to state clearly here that my method of counting new wikipedians may have had something to do with an overestimation of the decline in newcomers to the projects. So beginning with today’s report for the English Wikipedia I changed this method.

From the beginning a user has always been considered a wikipedian after 10 edits. However up till now a user that met this criterium was counted from the moment of registration. From now on the user is counted from the month that the 10th edit had been made. This seemed a minor difference in early days, most users either left after one or two edits, satisfied that yes they could edit if only they wanted to. Other newcomers, who liked the wiki concept, often exceeded those 10 edits in the first hour or so.

But then the law of large numbers (or call it the long tail) kicked in. It turned out that many of those that left early came back later when Wikipedia gained notoriety, and were not counted maybe a month early but one or two years, hence skewing the chart. Call it a design bug, incremental insight, a different perspective, you choose.

It was done deliberately and openly, but over time I learned that the former approach was counter intuitive for many, and like I said no longer a simplification with negligible effect. There was a good technical reason to do it that way though: today’s new approach implies comparison of up to 10 timestamps on each of 200 million revisions on the English Wikipedia only. [6]

Added July 20: Do it yourself spreadsheet

The following zip file contains for all languages user activity data in csv format. Plus an Excel spreadsheet with English data.  You can replace the English charts by copy and pasting data from any language specific csv file. Carefully copy proper columns. Copy rows from Jan 2003.

Notes


1. This does not take away the need for the full archive dumps, which are of course needed for many more purposes than wikistats (research, right to fork). Tomasz is making progress in that direction as well, and will tell us more in a presentation at Wikimania 2009.

2. After some deliberation and trials there is now a special enriched version of the so called meta history dump, enriched is such a way that a new tag <redirect /> is added when the current version of the article merely contains a pointer to another page, called a redirect. These redirects (think of aliasses) are a very important feature of Mediawiki software. They make it possible to lead someone to the proper article even when one makes a common spelling error, or knows only part of the name of a famous person. In fact the English Wikipedia currently has more redirect pages (May 09: 3.6 million or 21% of all pages) than proper articles (May 09: 2.9 million or 17% of all pages).
Note that this implies the following approximation (the only real compromise that I am aware of): now a redirect, always a redirect

3. The dump with only meta data for each revision, no article contents, measures an unbelievable 70 Gb (uncompressed) for the English Wikipedia. It do not even want to guess how large the full dump wil be.

4. Since a year there were special reports that excluded the English Wikipedia entirely in order to allow trend reports for the remaining wikipedias. These are mostly obsolete but still useful for those data that still are missing for the English Wikipedia, like word and link counts, and proportion of articles above a minimum size. See these tables and charts.

5. Also the geometric means of two consecutive powers of ten: 10v10, 100v10, 1000v10 …, which are ideal intermediate levels for advanced plotting (the ratios 10 : 10v10 and 10v10 : 100 are equal). (read ‘v’ as square root sign, WordPress anomaly)

6. All wikistats data are completely regenerated on every run. This may seem inefficient, but actually allows for regular introduction of new functionality with complete coverage back to early days. A blessing and burden at the same time, made possible by complete retention of all article history.considered

This entry was posted in Wikistats Production. Bookmark the permalink.

2 Responses to New statistics for the English Wikipedia (2)

  1. Pingback: Infodisiac » New statistics for the English Wikipedia

  2. Pingback: The Wikipedia Signpost (wikisignpost) 's status on Saturday, 18-Jul-09 06:14:04 UTC - Identi.ca

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>