Wikipedia page views, a global perspective (2)

Today I released four new wikistats reports, which all focus on our global readership.
Since years a report provide insights in how many pages each Wikipedia serves monthly.
More recent reports provide detailed cross sections through traffic.

But our understanding of where in the world those pages are read could be improved.
The following reports, with varying level of details, aim to fill this gap.

Like the other mentioned reports these are also based on the sampled server logs. One in thousand server accesses are logged and stored for a limited time for analysis purposes. A company called MaxMind publishes a database for translation of ip addresses into geographic information (which is even free on country level). Thanks to Mark Bergsma for building the MaxMind lookup filter. See below the screen shot for dirty details.

overview

Dirty details

The 1:1000 sampled squid logs are scanned one day at a time. Counts per day are added to monthly totals. Since each record stands for thousand original server accesses, totals are multiplied by 1000.

The counts do not include page requests by bots (crawlers, spiders, etc). Many bots make themselves known by sending identifying infomation with the page requests. Filtering those is easy. Unfortunately quite a few bots chose to stay anonymous.

On average for every thousand pages viewed on one day by one person just one server request ends up in the log. (on average meaning it can be zero on one day, two on a another).  But thousand page views is a huge amount for an ordinary person. By far the most ip addresses that occur more than once on a single day are bots, which can easily issue many thousands of page requests on a day. Therefore all ip addresses that occur more than once are skipped. A few very avid readers may thus be missing (false negatives). A huge school that sends all outgoing traffic over one ip address may also result in a handful of log records, that are discarded (again false negatives). At the other hand anonymous bots with modest activity may appear only once in the log and be counted (false positives). I am confident that these minor unavoidable imprecisions have little influence on final results.

Note: do not mistake the page views presented in the reports for visits or even unique visitors. Those are entirely different metrics  that can not be deduced from the sampled log.

Anomalies

There is at least one clear anomaly in the presented numbers: 19% of traffic originating in Australia seems to go to the Japanese Wikipeda. This is almost certainly incorrect. Five large organizations (RIR’s), organized roughly by continent, are responsible for handing out large ranges of ip addresses to internet service providers. Ronald Beelaard (who has a lot of experience in hunting down mischievous bots) was able to explain this: a large chunk of ip addresses had recently been reassigned by Apnic, and in absence of a proper administration that network defaults to country code AU. Time will cure this.

Also rather unexpected is that 85% of the page views to the Swahili Wikipedia originated from the USA. This may well be caused by the Google contest in recent months where nice prices are offered for new content on the Swahili Wikipedia. Also keep in mind that access to the Swahili Wikipedia from Kenya and Tanzania is up till now very limited with just tens of thousands of monthly views.

Wikipedia page views, a global perspective

Today I want to announce new wikistats reports. While browsing for a suitable illustration
on Flickr, I came across this picture by Alexander Duret-Lutz. A bookshelf
with neatly organized shelves often serves as a metaphore for Wikipedia.
This picture brings fond memories of our more disorganized past.
I think this image deserves better than a mere supportive role.
So follow this link to my next blog post for the announcement.

bookshop

La caverne aux livres is a secondhand bookshop in Auvers-sur-Oise (Mercator projection).
Alexander Duret-Lutz from Flickr, licensed CC-BY-SA.


Why I changed the title of my previous blog

Today I nuanced my previous blog post (actually mostly the title). I realize the title will be seen as the bottom line of how well we are doing today. I need to nuance that.

I repeat the addendum from that post:

Before I give the wrong impression: the blog title needs some clarification. The title “New editors are joining English Wikipedia in droves!” is what should have been the title of the WSJ article at the time, based on an error-corrected survival analysis. It does not take into account the deletions that have happened since Dr. Ortega’s study. Neverthless deletions do happen and should be accounted for when looking back at the period from today’s perspective.

In wikistats the general approach is to rebuild all data from scratch on every run. Deleted articles are treated as though they never existed. If we use the same approach here there might be still a small net loss in editors, on hindsight. But for a reason outside of the scope of the study used for the WSJ article.

trends7

I chose to write a new post to explain above figure.

The figure shows in a series of solid lines how a huge net deficit in editors, births minus deaths (see previous blog) for the last month that features in a line plot based on Dr. Ortega’s parameters, gets smaller and smaller on retrospect when looking back from a more distant future. This is because more and more deaths turned out not be deaths after all (editors were merely ’sleeping’). All of this explained in the previous post for which the logic has not changed, it is just not the complete story.

The light brown dotted line shows the first correction on the last death count provided to the Wall Street Journal by Dr Ortega. After more successive corrections this line will go from negative to positive numbers and markedly so (compare solid lines) . This is the data from the May 09 dump, with 1st correction using percentages from my study.

However, the dark brown dotted line (based on the numbers as provided  last week by Dr. Ortega) shows the de facto correction so far, when also article deletions are taking into account. It is hard to tell whether this line will cross from negative to positive after succesive corrections as per my analysis. The shift downwards due to deletions also gets weaker and weaker on more successive future runs.

New editors are joining English Wikipedia in droves?

After deliberation I decided to publish this story as one article despite its length, for clarity sake.

*Dec 7th: I changed the exclamation mark into a question mark. See also remark at end of this blog on how to interpret the blog title.

Introduction

Volunteers Log Off as Wikipedia Ages, “Wikipedia editors are leaving in droves”. The Wall Street Journal (WSJ) knows how to capture an audience. It had picked up a study by Spanish researcher Dr. Felipe Ortega, made the most of one finding among many, asked WMF to validate these numbers on really short notice, which was not possible, and put it on their front page. Thousands of newspapers and web sites dutifully copied the news.

WMF deputy CEO Erik Moeller and I published an official response, in which the validity of some premises used for the study and article were questioned.

One observation we made: counting every person as editor who made one update over the years requires a very extreme definition of that word. Wikipedia’s internal statistics only count a person as editor who has 5 or more edits in one month, and even that threshold we consider an extreme all embracing lower limit.

A second issue that raised our eyebrows was the criteria by which people were counted in or out. I quote “Dr. Ortega qualifies as the editor’s “log-off date” the last time they contributed. This is a snapshot in time and doesn’t predict whether the same person will make an edit in the future, nor does it reflect the actual number of active editors in that month.”

I will follow the customary terms for survival analysis, which are also used in Dr. Ortega’s study: birth for the first edit by a person, death for the last edit if sufficiently back in time. Excusez les mots.

This post describes an attempt to get to grips with this inherent weakness of survival studies. Or should I say the interpretation of findings beyond acceptable safety limits? First I’ll demonstrate the inherent problem with a set of line plots, then I’ll estimate the size of the introduced error, which leads to a conclusion that WSJ might want to publish on their front page.

Graphical demonstration of systemic error

The first line plot is from Mr Ortega’s study, and describes the net loss in number of editors: births - deaths.

ortegachart

Below follows my version of this plot, built from a full history ’stub’ dump from Nov 3, 2009. Essentially the same line plot as Dr Ortega’s, but with a few months extra data, and a breakdown to several levels of activity. The 1+ blue line matches mr Ortega’s line.

Now how much of the decline in the last months is an artefact? In other words how many editors were in fact ’sleeping’ and not ‘dead’? In yet other words how much will the lines have shifted for these same months when looking back in one or two years from now?

I can’t predict the future. But I can remake the same plot as though it were made a year ago by skipping the last 12 months of edits.

Look carefully, the great walk-away happens much much earlier in time when the plot covers less months. The following three plots will demonstrate that, by filtering out the last 12 months of data and do that again and again. All plots have same scale.

Note: unlike wikistats all namespaces are counted. Like wikistats only registered users are counted.

skewededitorsurvivalrate14

skewededitorsurvivalrate22

skewededitorsurvivalrate31

skewededitorsurvivalrate41

The following plot shows the difference between the first and second version of the survivals plots, between the one that runs to 2009 and the one that runs to 2008. In other words these values show the size of the error in the 2008 plot that surfaced in the 2009 plot. Likewise in 2010 there will be yet again a (smaller) correction for 2008 data and a large one for 2009 data.

From this plot is becomes very clear that the amount of deaths in the latest months (even years) of a survival plot is exaggerated considerably.

skewededitorsurvivalrate52

The last plot shows the editors that arrive (births) and leave (deaths) separately. The birth count can be trusted, no artefacts there. The death count is as shown above more and more overestimated in more recent months.

skewededitorsurvivalrate64

skewededitorsurvivalrate6b

Prediction of survival rate after error correction

Wall Street Journal totaled the perceived net deficits (births-deaths) for Jan/Feb/Mar 2009 and concluded that more than 49,000 editors had left the English Wikipedia. We know now that this deficit needs a serious correction, but how much?

First thought might be to wait two years, rerun the script on the then newest data dump, look again at births and deaths for Jan/Feb/Mar 2009, and conclude how much of the perceived deficit was a methodological error. However that new data dump is not simply the current stub dump with 2 years of revisions added. In the meantime many revisions that exist in the current  database will have been deleted. As a result births and deaths for people who contribute very infrequently will shift considerably in time (think of newbie’s and incidental vandals), or these users will disappear alltogether from the database. All of this masks part of the error in the survival rate, and is a undesirable artefact.

Instead let us use a cleaner approach, without artefacts, that yields anwers now. Let us look back in time again, create virtual historic dumps by filtering out a varying number of recent months from the last dump, and count births and deaths, up to but excluding the last two months (like Dr Ortega did). People who edited at least once in those last two months are declared alive, others declared dead. Unlike the line plots above,  we now go back now in increments of 5 months. (the newest dump contains 5 more months than the one used for the WSJ article).

This part of the analysis was done in cooperation with Dr. Ortega. The following table is based on the method he proposed to quantify the error. It shows the number of deaths for certain months (the most recent three months for which survival rates are calculated) on consecutive runs with 20, 15, 10, 5 and 0 months filtered out. Next to that the overall error in the oldest assesment that surfaced on subsequent runs. I adopted Dr. Ortega’s definition for error percentage: change in death counts as percentage of corrected death counts. Unlike Dr. Ortega I used the last dump for all runs, as described.

errortrend1

The trend is clear and consistent: deaths for the last months in a dump are overestimated hugely. When more and more months of data become available the earlier calculated deaths need to be corrected again and again.

Another metric we need is the total number of unique editors over the three months that contributed to WSJ’s number of 49,000 quitters, again using the all inclusive definition of editor where one edit suffices.

Here are the numbers:
Jan 2009: 139793
Feb 2009: 135505
Mar 2009: 142112
Jan-Mar 2009: 313266

Conservative and realistic new estimates for survival rate

Let us be quite conservative and postulate a final error after the numbers have settled down (looking back from a sufficiently distant future) for Jan-09 of 25%, Feb-09 of 30%, Mar-09 of 35%.  Again, error percentage defined as change in death counts as portion of new death counts.

Let us use the death counts from Dr. Ortega’s study for Jan/Feb/Mar 2009 (derived from the dump of May 2009) and apply these corrections.

calculation1

In a less conservative scenario, assuming the error trend is consistent over time:

calculation24

Conclusion

Both the all-inclusive definition of the term editor, and the systemic error, make the survival analysis as used a highly questionable method for analyzing the size of a community for recent years. By not realizing these inherent weaknesses the Wall Street Journal drew and published conclusions which are completely out of line with reality.

In wikistats we will continue to use editors with x+ edits per month as measure for ongoing activity. This measurement tells us that the number of editors that contribute monthly above a certain threshold has been relatively stable in the last 12 months, even more so for our core editor base, people that made 100 or more edits in a month.

editspermonth

I want to thank Dr Ortega for this feedback and support that helped to make the above analysis possible.


*Addendum

Before I give the wrong impression: the blog title needs some clarification. The title “New editors are joining English Wikipedia in droves!” is what should have been the title of the WSJ article at the time, based on a error corrected survival analysis. It does not take into account the deletions that have happened since Dr. Ortega’s study. Neverthless deletions do happen and should be accounted for when looking back at the period from today’s perspective.

In wikistats the general approach is to rebuild all data from scratch on every run. Deleted articles are treated as though they never existed. If we use the same approach here, there might be still a small net loss in editors, on hindsight. But for a reason outside of the scope of the study used for the WSJ article.

More in this in a new blog.

The Wikimedia Report Card

After several months of incremental development and internal circulation within WMF, the Wikimedia Report Card goes public. It was commissioned to help board, staff, community and outsiders get a quick overview over trends.

The nature of the report dictates conciseness, especially in the summarized versions. For further analysis the report contains links that lead to other, more detailed wikistats reports. All charts cover exactly one year.

The Report Card comes in three layouts:

  1. A detailed report with tabs for different views on the data. Most charts come in three variations:
    • Absolute numbers on a linear scale
    • Absolute numbers on a logarithmic scale
      This allows widely diverging values to appear in one chart, but makes growth curves less intuitive to interpret
    • An indexed version, with allows comparison of relative growth trends: each line starts at 100% for first month
  2. A summary in one column for online browsing
  3. A summary in two columns for printing on one large sheet
    (follow link on the page for print instructions)

reportcardfullpage