After deliberation I decided to publish this story as one article despite its length, for clarity sake.
*Dec 7th: I changed the exclamation mark into a question mark. See also remark at end of this blog on how to interpret the blog title.
“Volunteers Log Off as Wikipedia Ages”, “Wikipedia editors are leaving in droves”. The Wall Street Journal (WSJ) knows how to capture an audience. It had picked up a study by Spanish researcher Dr. Felipe Ortega, made the most of one finding among many, asked WMF to validate these numbers on really short notice, which was not possible, and put it on their front page. Thousands of newspapers and web sites dutifully copied the news.
WMF deputy CEO Erik Moeller and I published an official response, in which the validity of some premises used for the study and article were questioned.
One observation we made: counting every person as editor who made one update over the years requires a very extreme definition of that word. Wikipedia’s internal statistics only count a person as editor who has 5 or more edits in one month, and even that threshold we consider an extreme all embracing lower limit.
A second issue that raised our eyebrows was the criteria by which people were counted in or out. I quote “Dr. Ortega qualifies as the editor’s “log-off date” the last time they contributed. This is a snapshot in time and doesn’t predict whether the same person will make an edit in the future, nor does it reflect the actual number of active editors in that month.”
I will follow the customary terms for survival analysis, which are also used in Dr. Ortega’s study: birth for the first edit by a person, death for the last edit if sufficiently back in time. Excusez les mots.
This post describes an attempt to get to grips with this inherent weakness of survival studies. Or should I say the interpretation of findings beyond acceptable safety limits? First I’ll demonstrate the inherent problem with a set of line plots, then I’ll estimate the size of the introduced error, which leads to a conclusion that WSJ might want to publish on their front page.
Graphical demonstration of systemic error
The first line plot is from Mr Ortega’s study, and describes the net loss in number of editors: births – deaths.
Below follows my version of this plot, built from a full history ‘stub’ dump from Nov 3, 2009. Essentially the same line plot as Dr Ortega’s, but with a few months extra data, and a breakdown to several levels of activity. The 1+ blue line matches mr Ortega’s line.
Now how much of the decline in the last months is an artefact? In other words how many editors were in fact ‘sleeping’ and not ‘dead’? In yet other words how much will the lines have shifted for these same months when looking back in one or two years from now?
I can’t predict the future. But I can remake the same plot as though it were made a year ago by skipping the last 12 months of edits.
Look carefully, the great walk-away happens much much earlier in time when the plot covers less months. The following three plots will demonstrate that, by filtering out the last 12 months of data and do that again and again. All plots have same scale.
Note: unlike wikistats all namespaces are counted. Like wikistats only registered users are counted.
The following plot shows the difference between the first and second version of the survivals plots, between the one that runs to 2009 and the one that runs to 2008. In other words these values show the size of the error in the 2008 plot that surfaced in the 2009 plot. Likewise in 2010 there will be yet again a (smaller) correction for 2008 data and a large one for 2009 data.
From this plot is becomes very clear that the amount of deaths in the latest months (even years) of a survival plot is exaggerated considerably.
The last plot shows the editors that arrive (births) and leave (deaths) separately. The birth count can be trusted, no artefacts there. The death count is as shown above more and more overestimated in more recent months.
Prediction of survival rate after error correction
Wall Street Journal totaled the perceived net deficits (births-deaths) for Jan/Feb/Mar 2009 and concluded that more than 49,000 editors had left the English Wikipedia. We know now that this deficit needs a serious correction, but how much?
First thought might be to wait two years, rerun the script on the then newest data dump, look again at births and deaths for Jan/Feb/Mar 2009, and conclude how much of the perceived deficit was a methodological error. However that new data dump is not simply the current stub dump with 2 years of revisions added. In the meantime many revisions that exist in the current database will have been deleted. As a result births and deaths for people who contribute very infrequently will shift considerably in time (think of newbie’s and incidental vandals), or these users will disappear alltogether from the database. All of this masks part of the error in the survival rate, and is a undesirable artefact.
Instead let us use a cleaner approach, without artefacts, that yields anwers now. Let us look back in time again, create virtual historic dumps by filtering out a varying number of recent months from the last dump, and count births and deaths, up to but excluding the last two months (like Dr Ortega did). People who edited at least once in those last two months are declared alive, others declared dead. Unlike the line plots above, we now go back now in increments of 5 months. (the newest dump contains 5 more months than the one used for the WSJ article).
This part of the analysis was done in cooperation with Dr. Ortega. The following table is based on the method he proposed to quantify the error. It shows the number of deaths for certain months (the most recent three months for which survival rates are calculated) on consecutive runs with 20, 15, 10, 5 and 0 months filtered out. Next to that the overall error in the oldest assesment that surfaced on subsequent runs. I adopted Dr. Ortega’s definition for error percentage: change in death counts as percentage of corrected death counts. Unlike Dr. Ortega I used the last dump for all runs, as described.
The trend is clear and consistent: deaths for the last months in a dump are overestimated hugely. When more and more months of data become available the earlier calculated deaths need to be corrected again and again.
Another metric we need is the total number of unique editors over the three months that contributed to WSJ’s number of 49,000 quitters, again using the all inclusive definition of editor where one edit suffices.
Here are the numbers:
Jan 2009: 139793
Feb 2009: 135505
Mar 2009: 142112
Jan-Mar 2009: 313266
Conservative and realistic new estimates for survival rate
Let us be quite conservative and postulate a final error after the numbers have settled down (looking back from a sufficiently distant future) for Jan-09 of 25%, Feb-09 of 30%, Mar-09 of 35%. Again, error percentage defined as change in death counts as portion of new death counts.
Let us use the death counts from Dr. Ortega’s study for Jan/Feb/Mar 2009 (derived from the dump of May 2009) and apply these corrections.
In a less conservative scenario, assuming the error trend is consistent over time:
Both the all-inclusive definition of the term editor, and the systemic error, make the survival analysis as used a highly questionable method for analyzing the size of a community for recent years. By not realizing these inherent weaknesses the Wall Street Journal drew and published conclusions which are completely out of line with reality.
In wikistats we will continue to use editors with x+ edits per month as measure for ongoing activity. This measurement tells us that the number of editors that contribute monthly above a certain threshold has been relatively stable in the last 12 months, even more so for our core editor base, people that made 100 or more edits in a month.
I want to thank Dr Ortega for this feedback and support that helped to make the above analysis possible.
Before I give the wrong impression: the blog title needs some clarification. The title “New editors are joining English Wikipedia in droves!” is what should have been the title of the WSJ article at the time, based on a error corrected survival analysis. It does not take into account the deletions that have happened since Dr. Ortega’s study. Neverthless deletions do happen and should be accounted for when looking back at the period from today’s perspective.
In wikistats the general approach is to rebuild all data from scratch on every run. Deleted articles are treated as though they never existed. If we use the same approach here, there might be still a small net loss in editors, on hindsight. But for a reason outside of the scope of the study used for the WSJ article.
More in this in a new blog.