Full archive dumps are being processed again, first since 2010

There is not Wikistats issue for which I received more mails than this: since 2010 some metrics on article content were no longer updated: word count, articles above 200 chars, mean size in bytes, percentage above 0.5 or 2 Kb, database size, word count, images and links (internal, interwiki, external). Word count in particular was often mentioned.

Table

 

 

 

 

 

 

Example: Polish Wikipedia

All these metrics need to be collected from the ‘full archive dumps’, the dumps which contain the full raw content of every revision of every page. The sheer amount of data that needs to processed made it no longer feasible to process those full dumps on a monthly basis (it didn’t help that I do rather ambitious cleaning up of the raw page content before counts are generated (e.g. for word count to approach ‘readable body text’).

So in 2010 for most Wikipedias I switched to processing stub dumps, which contain all meta data for every revision, but not the raw page content. For sister projects with much smaller dumps I continued processing full archive dumps.

Now finally I can announce I applied a fix which makes it possible to update those missing metrics roughly on a quarterly cycle. Full archive dumps are now processed on a different server, running as continuous low priority job, and the reporting process combines metrics from both servers.

In the last two weeks some 260 wikis were processed. Only 10 large wikis remain to be done:  Arabic, English, French, German, Hebrew, Italian, Japanese, Spanish, Swedish, Russian.  I expect in a month time all but English will be ready. English may arrive -fingers crossed- a month later.

 

 

 

This entry was posted in uncategorized. Bookmark the permalink.

2 Responses to Full archive dumps are being processed again, first since 2010

  1. Nemo says:

    Thank you, Erik, I’m speechless. I think this should be circulated more widely: how about making a CentralNotice for all users above a certain edit count, or leaving a message in some strategic page (like Special:Statistics itself) on all Wikimedia wikis? There will be hundreds, no, thousands users crunching tables and numbers for a while!

  2. Erik says:

    Nemo, I’m glad to be of service, and regret that it took so long. Rather than CentralNotice I’d prefer people to be pleasantly surprised when they stroll past that page. Kind of fits the season. :-) People who really need to know asap, may have read this blog.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>