Full history dump for English Wikipedia is back

Normally I do not blog about the arrival of a new Wikipedia dump, but this occasion is truely something to be celebrated!

The dump at hand contains all revisions of all articles in the English Wikipedia (except articles and single revisions that were deleted from the database). It is hugely important for researchers: many cherish the English Wikipedia as their favorite object of study, it being the oldest, largest, most viewed and edited Wikipedia. This particular dump has had a long history of ill fortune: for 4.5 years no new full archive dump for the English Wikipedia has been consistently available 1.

Over the years there has not been a shortage of outcry over its absence, despite serious attempts to remedy the situation. Mind you, it was not for a lack of trying that no new dump made it to completion. Wikimedia’s lead developer and later CTO Brion Vibber, amid a thousand other responsabilities, did improve the robustness of the dump job over time (many mishaps can occur over a period of months of running time, e.g. network outages). But fixing the dump proved a moving target. On every attempt the sheer amount of work to be done had increased enormously. Once the script had been made totally robust, processing time became the next hurdle. At some stage a new phenomenon occurred: during a run, month after month the expected time of arrival (ETA) kept going up instead of down. Memorably, when on one run the job signaled there was still a year to go before completion, it was manually aborted.

All is well that ends well. In 2009 lead developer Tim Starling managed to find the time to revise the internal compression scheme used to store articles efficiently in the database, thus significantly shortening dump time. Meanwhile Tomasz Finc worked hard to upgrade and overhaul the entire dump infrastructure, both on hardware and software level. On March 11, 2010 after a measly 40 days processing time, a full archive English Wikipedia dump was finally produced, which in another 15 days was recompressed into much more compact format. Meanwhile wikistats started to do its part and after 21 days data crunching complete wikistats are again available for the English Wikipedia!

editsenglishwikipedia1

1: I said ‘consistently available’: because in 2007 at least one valid dump was indeed produced, but it got lost within weeks due to a overzealous space cleanup script, alas before it could be processed by wikistats. Several incomplete dumps followed (some believe there was a complete and valid dump in 2008 but if so it was also short lived. Until now the latest full dump that wikistats had been able to process dated from Sep 2006.

Some stats on the new dump

Dump size

  • 5.6 Tb uncompressed
  • 280 Gb in bz2 compression format
  • 32 Gb in 7z compression format

Edits all time

Edits Jan 2010

  • There are 4 million article edits per month
  • 40,000+ registered people made 5+ article edits, 4000+ made 100+ edits
  • 10,000+ users made 5+ discussion page edits, 1000+ made 100+ edits
  • 10,000+ users made 5+ other page edits, 1000+ made 100+ edits
  • 100+ bots edited 5+ times, 7 bots even 10,000+ times

Content (article namespace only)

  • 3.2 MM article pages
  • 4 MM redirect pages
  • 1.9 MM links to images (800 k unique images)
  • 7 MM links to other Wikimedia projects
  • 78 MM links to other pages in this Wikipedia
  • 11 MM links to other web sites
  • 1.8 billion words
This entry was posted in Wikistats Production. Bookmark the permalink.

10 Responses to Full history dump for English Wikipedia is back

  1. Stephen B. says:

    This is fascinating. I’ve been here since 2004 on a near daily basis and can really associate with these numbers, personally felt the culture and mood change as it doubled and doubled again.

    Obviously a peak was reached in the summer of 2007. Has anyone asked what caused it? I suspect the obvious answers are probably wrong. Given the steady flat-line since, it suggests a latent potential growth that is being held back for some systemic reason. It may be not enough admins to keep the peace which causes discontent from editors who leave the project in frustration. It could be anything over that peak just somehow breaks the magic that makes Wikipedia work, that it doesn’t scale beyond. It would be interesting to find out.

  2. Erik says:

    My favourite explanation for peak in summer 2007 is that around that time Wikipedia reached almost 100% name recognition among internet users in western world. When novelty wore off less people wanted to give it a brief trial run just to see what editing is like. Some vandals moved on elsewhere. Most ‘naturally born wikipedians’ now knew of the project and had already joined. All of this just a hunch and at best partial explanation.

  3. Great news! This is really a valuable resource for research.

    The size of the uncompressed dump might be a problem for small research teams. Are smaller full history dumps available? For instance, including only ‘Featured Articles’.

    I have been doing some research work based on Wikipedia’s data, and would like to know where is the best place to discuss issues related to data access? Specifically, I would like to know if HTTP access logs are being kept and if they can be made available. I think that HTTP referrers info could be of great value.

    Thanks in advance for your feedback.

  4. Erik says:

    > Are smaller full history dumps available? For instance, including only ‘Featured Articles’.

    No, in that case I can advise using the API to extract relevant info.

    http://www.mediawiki.org/wiki/API

    1:1000 sampled squid access logs are kept for 6 months, but those are not open to the public. Wikimedia has a strict privacy policy.

    http://wikimediafoundation.org/wiki/Privacy_policy

  5. Alex says:

    40 days to compress and 15 to recompress? how long it would take to decompress this thing on my laptop

  6. Erik says:

    According to http://tinyurl.com/y4f9dkh

    “Uncompressing and md5ing the bz2 took well over a week. Uncompressing and
    md5ing the 7z took less than a day.”

    I have no idea how much of that was for decompressing, how much for checksumming.
    Please let me know if you have own results.

  7. emijrp says:

    Hi Erik,

    Thanks for the analysis and the dump (Tim and Tomasz). It is curious how the edit rate is declining, but we knew that from months ago. Now, I hope that the new MediaWiki interface (Vector) can boost the edits from newbies. We will see the effects in the next months.

    Regards

  8. Mark says:

    Erik: it looks like the squid stats are public at http://dammit.lt/wikistats/

  9. Erik says:

    Mark, those are aggregated squid logs, produced by Domas Mituzas: page view counts per article per hour.

    No sensitive raw data, like ip address, etc.

  10. Pingback: links for 2010-04-14 « Mandarine

Leave a Reply

Your email address will not be published. Required fields are marked *