Many editors and readers of Wikimedia projects (and other Mediawiki wikis) may know that for every article the full history of prior revisions is available online, for static linking, edit comparisons and browsing. Even most of these better informed people will not realize that for every edit of any article a full copy of the whole article text is stored in an archive table.
This approach has been chosen deliberately by Mediawiki developers, in order to keep the PHP code base manageable, and retrieval and maintenance of older revisions efficient, at the cost of larger storage requirements for online databases and dumps. Consider an alternative where only diffs between revisions would be stored: rebuilding a historic version of a revision with 5000 prior revisions would be practically undoable in real time. Using a mixed approach where every so often a full copy is stored between diffs (say every 10th or 100th revision) complicates a (rare) deletion of any article revision from the database considerably.
Lots of care has been taken by the developers to store the article history in such a manner that compression works very efficiently, both in online databases and dumps. Still any approach to optimize compression has its limits.
I thought it might be useful to give some stats for how severe the problem is.
Here are two new weekly reports: Top 100 Wikimedia articles in any project ranked by uncompressed archive size and Top 100 Wikimedia articles in any project ranked by total edit count.
In both rankings one and the same bot is top scorer: COIbot managed to produce 19 Gigabyte uncompressed article history for its LinkReports page on meta, in well over 200,000 edits. Perhaps another place to store these results (e.g. a private wiki, or a site that does not store all history forever), or a lesser edit frequency could be a solution.
Given that the full archive dump of meta only comprises 250 Mb this article history compresses incredibly well. But it may be part of the explanation why producing dumps in two formats for meta took 28 hours.
But the majority of articles in both lists are for project housekeeping or discussion pages, and presumably many or even most edits there are manual edits. For example Wikipedia:De kroeg is the Dutch equivalent of the English Wikipedia:Village_Pump, and the discussions in De Kroeg (‘The Pub’) are numerous and often very long. In the Village Pump there are four separate sections which are actually separate pages, this makes the storage overhead somewhat less, not four times less though, as not all four sections will be edited as often.
Note: these stats are based on Wikimedia dumps, so the usual caveat applies: figures for the English Wikipedia lag behind as no recent dump has been produced for quite a while. Database dump script redesign has been commissioned.