Monthly page requests, new archives and reports

The compaction of hourly page request files into daily, then daily into monthly is now operational. I fixed and simplified earlier scripts.

For Dec 2012 data reduction is as follows:

744 hourly files: 65 Gb compressed
31 daily files: 16 Gb compressed
1 monthly file: 5 Gb compressed

Space is saved as follows

1) each article title occurs only once instead of up to 744 times
2) bz2 compression
3) threshold of 5+ requests per month in final monthly file

Still all versions contain hourly resolution.

Each file starts with description on file format.  (in nutshell: after the monthly total follow hourly counts, sparsely indexed: each count is preceded by two letters, one for day of month, one for hour of day)

http://dumps.wikimedia.org/other/pagecounts-ez/merged/

As a spin-off the new data stream is also used for new monthly page request reports, e.g. English Wikipedia. The full index contains about 800 wikis  (alas no friendly front-end yet)

One of the benefits of these archives is easy external archiving (e.g. on Internet Archive), similar to tweet archive of Library of Congress, which will be a rich Zeitgeist dataset for posterity.

Monthly page requests

Monthly page requests

Most requested not existing files are split into tree sections: articles, content in other namespaces, binary files.

Missing files

Missing files

Caveat: redirects are not resolved. Encoded and unencoded url’s are counted separately.

This entry was posted in Wikimedia View(er)s, Wikistats Reports. Bookmark the permalink.

5 Responses to Monthly page requests, new archives and reports

  1. Harry Burt says:

    Hi Erik,

    I know in the greater scheme of things 5GB is not very much, but for processing simplicity is there are dump with a daily or even monthly resolution available? Of course I could just take the first two columns of the 5GB download and discard the rest, but… :P

  2. Pingback: Wikimedia Research Newsletter, January 2013 — Wikimedia blog

  3. Alexchris says:

    Hi Erik,
    It is a nice work. But when count page views of pages, add the page views of redirect pages to redirected pages ,will provide more accurate data. This inaccuracy especially influence larger in Chinese projects because Chinese traditional and Chinese simplified are different pages correspond same content. For example, “Germany” in Chinese is “??” and “??”. The former is the main article which daily views is about 760~1000 and the latter is redirect page which daily views is about 200~250. If they sum together, it will be provide more accurate data.

  4. Suseth says:

    If you look closely in the chtras above there is a note Wiktionary data minus excessive spam . In recent months half of wiktionary traffic came from one ip address, all requesting endless random pages. Some oversized test setup?

  5. Erik says:

    @Harry Burt, if you look at http://dumps.wikimedia.org/other/pagecounts-ez/merged/ there are now also files without hourly details, monthly totals only. They are indeed much smaller.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>