Normally I do not blog about the arrival of a new Wikipedia dump, but this occasion is truely something to be celebrated!
The dump at hand contains all revisions of all articles in the English Wikipedia (except articles and single revisions that were deleted from the database). It is hugely important for researchers: many cherish the English Wikipedia as their favorite object of study, it being the oldest, largest, most viewed and edited Wikipedia. This particular dump has had a long history of ill fortune: for 4.5 years no new full archive dump for the English Wikipedia has been consistently available 1.
Over the years there has not been a shortage of outcry over its absence, despite serious attempts to remedy the situation. Mind you, it was not for a lack of trying that no new dump made it to completion. Wikimedia’s lead developer and later CTO Brion Vibber, amid a thousand other responsabilities, did improve the robustness of the dump job over time (many mishaps can occur over a period of months of running time, e.g. network outages). But fixing the dump proved a moving target. On every attempt the sheer amount of work to be done had increased enormously. Once the script had been made totally robust, processing time became the next hurdle. At some stage a new phenomenon occurred: during a run, month after month the expected time of arrival (ETA) kept going up instead of down. Memorably, when on one run the job signaled there was still a year to go before completion, it was manually aborted.
All is well that ends well. In 2009 lead developer Tim Starling managed to find the time to revise the internal compression scheme used to store articles efficiently in the database, thus significantly shortening dump time. Meanwhile Tomasz Finc worked hard to upgrade and overhaul the entire dump infrastructure, both on hardware and software level. On March 11, 2010 after a measly 40 days processing time, a full archive English Wikipedia dump was finally produced, which in another 15 days was recompressed into much more compact format. Meanwhile wikistats started to do its part and after 21 days data crunching complete wikistats are again available for the English Wikipedia!
1: I said ‘consistently available’: because in 2007 at least one valid dump was indeed produced, but it got lost within weeks due to a overzealous space cleanup script, alas before it could be processed by wikistats. Several incomplete dumps followed (some believe there was a complete and valid dump in 2008 but if so it was also short lived. Until now the latest full dump that wikistats had been able to process dated from Sep 2006.
Some stats on the new dump
- 5.6 Tb uncompressed
- 280 Gb in bz2 compression format
- 32 Gb in 7z compression format
- 220 million edits on articles, 70 million (32%) of those by anonymous users.
- 550 thousand unique registered users contributed 10 or more edits each.
- On average 70 edits per article.
- Over 90% of all edits by less than 8% of registered users (still a formidable 200 thousand users).
- Most edited article: ‘Wikipedia:Administrator_intervention_against_vandalism’ : 640+ thousand times
- Largest article archive: ‘Wikipedia:Administrators’ noticeboard/Incidents’ : 117 Gb uncompressed
- There are 4 million article edits per month
- 40,000+ registered people made 5+ article edits, 4000+ made 100+ edits
- 10,000+ users made 5+ discussion page edits, 1000+ made 100+ edits
- 10,000+ users made 5+ other page edits, 1000+ made 100+ edits
- 100+ bots edited 5+ times, 7 bots even 10,000+ times
- 3.2 MM article pages
- 4 MM redirect pages
- 1.9 MM links to images (800 k unique images)
- 7 MM links to other Wikimedia projects
- 78 MM links to other pages in this Wikipedia
- 11 MM links to other web sites
- 1.8 billion words