Note: the low figures for March 3and 4 are an artifact: no data were collected for several hours on both days.
Since December 2007 all page requests to all Wikimedia projects are counted, and hourly totals per article title written to file, thanks to Domas Mituzas. The amount of data collected this way is huge: over 100 million lines per day! For this reason Domas preserves only a couple of weeks of detailed history (but aggregate counts for all time). To my delight Mathias Schindler told me at Alexandria that he had downloaded and preserved all data for posterity and even brought it with him on a portable hard drive, which I could borrow. Lars Aronsson had studied the files as well, and gave me useful background information.
My first action was to write a script that consolidates 24 hourly files into one daily file, without losing hourly granularity, but with about 50-60% file shrinkage, thus alleviating storage and/or download burden. There is much more to say about interpretation, validation and optimization of the files, which I’ll reserve for some other time.
The charts on this page have been generated and tweaked manually using Excel. I’ll post other examples soon (distribution patterns per day of the week and time of day).
In the coming weeks I will continue to develop scripts that extract meaningful information from the current files, and start to produce server generated charts for all projects and languages, using Ploticus. Once the promised wikistats server finally materializes these charts could be refreshed at least weekly, perhaps daily.
After that I will see, in cooperation with others, who may already work on it, how to extend the scripts in order to collect other page view statistics (unique visitor counts, and counts per country of origin would be prime examples). The sheer volume of data to process will pose a challenge. Whether sampling will be needed in some cases remains to be seen.
Frank Schulenburg asked me to visualize visitor patterns for the Arabic Wikipedia. The last figure shows something interesting. Unlike what I expected the ratio of page requests on the Arabic Wikipedia compared to the total of all Wikipedia’s also shows a weekly pattern. On closer inspection it turned out that all high points in the chart fall on Saturdays. I never knew that Arabic countries celebrate weekend on different days, explained e.g. in this article in Arab News. Forgive my cultural ignorance. By the way, the highest point in the chart falls on December 25.
Again, to be sure, I will produce similar charts for other languages and projects. My focus will be on consolidated statistics, at least in the beginning. For particulars there are already several very fine reports:
- Daily view counts per article title (and monthly top 1000) by Henrik (temporarily halted).
- Daily averages and most accessed pages (per project) by Melancholie