Wikimedia’s almost perfect memory

Many editors and readers of Wikimedia projects (and other Mediawiki wikis) may know that for every article the full history of prior revisions is available online, for static linking, edit comparisons and browsing. Even most of these better informed people will not realize that for every edit of any article a full copy of the whole article text is stored in an archive table.

This approach has been chosen deliberately by Mediawiki developers, in order to keep the PHP code base manageable, and retrieval and maintenance of older revisions efficient, at the cost of larger storage requirements for online databases and dumps. Consider an alternative where only diffs between revisions would be stored: rebuilding a historic version of a revision with 5000 prior revisions would be practically undoable in real time. Using a mixed approach where every so often a full copy is stored between diffs (say every 10th or 100th revision) complicates a (rare) deletion of any article revision from the database considerably.

Lots of care has been taken by the developers to store the article history in such a manner that compression works very efficiently, both in online databases and dumps. Still any approach to optimize compression has its limits.

I thought it might be useful to give some stats for how severe the problem is.

10 Wikimedia articles with largest uncompressed archive size

Here are two new weekly reports: Top 100 Wikimedia articles in any project ranked by uncompressed archive size and Top 100 Wikimedia articles in any project ranked by total edit count.

In both rankings one and the same bot is top scorer: COIbot managed to produce 19 Gigabyte uncompressed article history for its LinkReports page on meta, in well over 200,000 edits. Perhaps another place to store these results (e.g. a private wiki, or a site that does not store all history forever), or a lesser edit frequency could be a solution.

Given that the full archive dump of meta only comprises 250 Mb this article history compresses incredibly well. But it may be part of the explanation why producing dumps in two formats for meta took 28 hours.

But the majority of articles in both lists are for project housekeeping or discussion pages, and presumably many or even most edits there are manual edits. For example Wikipedia:De kroeg is the Dutch equivalent of the English Wikipedia:Village_Pump, and the discussions in De Kroeg (‘The Pub’) are numerous and often very long. In the Village Pump there are four separate sections which are actually separate pages, this makes the storage overhead somewhat less, not four times less though, as not all four sections will be edited as often.

Note: these stats are based on Wikimedia dumps, so the usual caveat applies: figures for the English Wikipedia lag behind as no recent dump has been produced for quite a while. Database dump script redesign has been commissioned.

This entry was posted in Observations, Wikimedia Edit(or)s, Wikistats Reports. Bookmark the permalink.

7 Responses to Wikimedia’s almost perfect memory

  1. Graham87 says:

    There are stats for pages with the highest edit count on the English Wikipedia from April 2008 at Wikipedia:Pages with the most revisions. One can figure out the number of edits in a page by taking a diff between the edit with the lowest rev ID (usually the first one, except for early pages), and the most recent edit. Add 2 to the given number of intermediate revisions, and you’ll find that at the time of writing, Wikipedia:Administrator intervention against vandalism has 492,079 edits. I wouldn’t be surprised to find out that Wikipedia:Administrators’ noticeboard/Incidents takes up more space than the COIBot report on Meta. Speaking of the title of this post about an “*almost* perfect memory”, I would appreciate any comments you have about my user subpage, <a href=”” User:Graham87/Page history observations.

  2. Graham87 says:

    Oops, the last link needs an extra ““. The link is: User:Graham87/Page history observations. Is there a way to preview text on this blog that I’ve missed?

  3. Erik says:

    Thanks for the tips.

    As you say the Wikipedia:Page with the most revisions is also not very up to date, albeit much better than the history dump.

    Rather than spending more time on this quick report, I’d rather await a new English dump after the dump process has been fixed, which is work in progress.

    “Is there a way to preview text on this blog that I’ve missed?”

    That would be a good idea.

    I’ll check your observations page tomorrow. Cheers, Erik.

  4. Erik says:

    I activated WordPress ‘Live Comment Preview’ plugin. I seems to do what you asked for.

  5. Pingback: The AboutUs Weblog » Blog Archive » Tutorial: Reading Page Histories

  6. Graham87 says:

    I’m very, very late, but it turns out that an updated list of pages with the most edits on the English Wikipedia was released a few days before your post at Wikipedia:Database reports/Pages with the most revisions. I’ll make a redirect from the old page to the new location.

  7. Darrick says:

    Enough statistics and everything that I am interested I am looking for in Wikipedia because only there you can find all the information you need.

Leave a Reply

Your email address will not be published. Required fields are marked *