The compaction of hourly page request files into daily, then daily into monthly is now operational. I fixed and simplified earlier scripts.
For Dec 2012 data reduction is as follows:
|744 hourly files:||65 Gb compressed|
|31 daily files:||16 Gb compressed|
|1 monthly file:||5 Gb compressed|
Space is saved as follows
1) each article title occurs only once instead of up to 744 times
2) bz2 compression
3) threshold of 5+ requests per month in final monthly file
Still all versions contain hourly resolution.
Each file starts with description on file format. (in nutshell: after the monthly total follow hourly counts, sparsely indexed: each count is preceded by two letters, one for day of month, one for hour of day)
One of the benefits of these archives is easy external archiving (e.g. on Internet Archive), similar to tweet archive of Library of Congress, which will be a rich Zeitgeist dataset for posterity.
Most requested not existing files are split into tree sections: articles, content in other namespaces, binary files.
Caveat: redirects are not resolved. Encoded and unencoded url’s are counted separately.