Today I released four new wikistats reports, which all focus on our global readership.
Since years a report provide insights in how many pages each Wikipedia serves monthly.
More recent reports provide detailed cross sections through traffic.
But our understanding of where in the world those pages are read could be improved.
The following reports, with varying level of details, aim to fill this gap.
- Wikipedia Page Views By Country – Overview
- Wikipedia Page Views By Country – Breakdown
- Wikipedia Page Views By Country – Trends
- Page Views Per Wikipedia Language – Breakdown
Like the other mentioned reports these are also based on the sampled server logs. One in thousand server accesses are logged and stored for a limited time for analysis purposes. A company called MaxMind publishes a database for translation of ip addresses into geographic information (which is even free on country level). Thanks to Mark Bergsma for building the MaxMind lookup filter. See below the screen shot for dirty details.
The 1:1000 sampled squid logs are scanned one day at a time. Counts per day are added to monthly totals. Since each record stands for thousand original server accesses, totals are multiplied by 1000.
The counts do not include page requests by bots (crawlers, spiders, etc). Many bots make themselves known by sending identifying infomation with the page requests. Filtering those is easy. Unfortunately quite a few bots chose to stay anonymous.
On average for every thousand pages viewed on one day by one person just one server request ends up in the log. (on average meaning it can be zero on one day, two on a another). But thousand page views is a huge amount for an ordinary person. By far the most ip addresses that occur more than once on a single day are bots, which can easily issue many thousands of page requests on a day. Therefore all ip addresses that occur more than once are skipped. A few very avid readers may thus be missing (false negatives). A huge school that sends all outgoing traffic over one ip address may also result in a handful of log records, that are discarded (again false negatives). At the other hand anonymous bots with modest activity may appear only once in the log and be counted (false positives). I am confident that these minor unavoidable imprecisions have little influence on final results.
Note: do not mistake the page views presented in the reports for visits or even unique visitors. Those are entirely different metrics that can not be deduced from the sampled log.
There is at least one clear anomaly in the presented numbers: 19% of traffic originating in Australia seems to go to the Japanese Wikipeda. This is almost certainly incorrect. Five large organizations (RIR’s), organized roughly by continent, are responsible for handing out large ranges of ip addresses to internet service providers. Ronald Beelaard (who has a lot of experience in hunting down mischievous bots) was able to explain this: a large chunk of ip addresses had recently been reassigned by Apnic, and in absence of a proper administration that network defaults to country code AU. Time will cure this.
Also rather unexpected is that 85% of the page views to the Swahili Wikipedia originated from the USA. This may well be caused by the Google contest in recent months where nice prices are offered for new content on the Swahili Wikipedia. Also keep in mind that access to the Swahili Wikipedia from Kenya and Tanzania is up till now very limited with just tens of thousands of monthly views.