The compaction of hourly page request files into daily, then daily into monthly is now operational. I fixed and simplified earlier scripts.
For Dec 2012 data reduction is as follows:
| 744 hourly files: | 65 Gb compressed |
| 31 daily files: | 16 Gb compressed |
| 1 monthly file: | 5 Gb compressed |
Space is saved as follows
1) each article title occurs only once instead of up to 744 times
2) bz2 compression
3) threshold of 5+ requests per month in final monthly file
Still all versions contain hourly resolution.
Each file starts with description on file format. (in nutshell: after the monthly total follow hourly counts, sparsely indexed: each count is preceded by two letters, one for day of month, one for hour of day)
http://dumps.wikimedia.org/other/pagecounts-ez/merged/
As a spin-off the new data stream is also used for new monthly page request reports, e.g. English Wikipedia. The full index contains about 800 wikis (alas no friendly front-end yet)
One of the benefits of these archives is easy external archiving (e.g. on Internet Archive), similar to tweet archive of Library of Congress, which will be a rich Zeitgeist dataset for posterity.
Most requested not existing files are split into tree sections: articles, content in other namespaces, binary files.
Caveat: redirects are not resolved. Encoded and unencoded url’s are counted separately.


Wikipedia and sister projects.




Hi Erik,
I know in the greater scheme of things 5GB is not very much, but for processing simplicity is there are dump with a daily or even monthly resolution available? Of course I could just take the first two columns of the 5GB download and discard the rest, but…
Pingback: Wikimedia Research Newsletter, January 2013 — Wikimedia blog
Hi Erik,
It is a nice work. But when count page views of pages, add the page views of redirect pages to redirected pages ,will provide more accurate data. This inaccuracy especially influence larger in Chinese projects because Chinese traditional and Chinese simplified are different pages correspond same content. For example, “Germany” in Chinese is “??” and “??”. The former is the main article which daily views is about 760~1000 and the latter is redirect page which daily views is about 200~250. If they sum together, it will be provide more accurate data.
If you look closely in the chtras above there is a note Wiktionary data minus excessive spam . In recent months half of wiktionary traffic came from one ip address, all requesting endless random pages. Some oversized test setup?
@Harry Burt, if you look at http://dumps.wikimedia.org/other/pagecounts-ez/merged/ there are now also files without hourly details, monthly totals only. They are indeed much smaller.