Wikipedia page views, a global perspective (2)

Today I released four new wikistats reports, which all focus on our global readership.
Since years a report provide insights in how many pages each Wikipedia serves monthly.
More recent reports provide detailed cross sections through traffic.

But our understanding of where in the world those pages are read could be improved.
The following reports, with varying level of details, aim to fill this gap.

Like the other mentioned reports these are also based on the sampled server logs. One in thousand server accesses are logged and stored for a limited time for analysis purposes. A company called MaxMind publishes a database for translation of ip addresses into geographic information (which is even free on country level). Thanks to Mark Bergsma for building the MaxMind lookup filter. See below the screen shot for dirty details.

overview

Dirty details

The 1:1000 sampled squid logs are scanned one day at a time. Counts per day are added to monthly totals. Since each record stands for thousand original server accesses, totals are multiplied by 1000.

The counts do not include page requests by bots (crawlers, spiders, etc). Many bots make themselves known by sending identifying infomation with the page requests. Filtering those is easy. Unfortunately quite a few bots chose to stay anonymous.

On average for every thousand pages viewed on one day by one person just one server request ends up in the log. (on average meaning it can be zero on one day, two on a another).  But thousand page views is a huge amount for an ordinary person. By far the most ip addresses that occur more than once on a single day are bots, which can easily issue many thousands of page requests on a day. Therefore all ip addresses that occur more than once are skipped. A few very avid readers may thus be missing (false negatives). A huge school that sends all outgoing traffic over one ip address may also result in a handful of log records, that are discarded (again false negatives). At the other hand anonymous bots with modest activity may appear only once in the log and be counted (false positives). I am confident that these minor unavoidable imprecisions have little influence on final results.

Note: do not mistake the page views presented in the reports for visits or even unique visitors. Those are entirely different metrics  that can not be deduced from the sampled log.

Anomalies

There is at least one clear anomaly in the presented numbers: 19% of traffic originating in Australia seems to go to the Japanese Wikipeda. This is almost certainly incorrect. Five large organizations (RIR’s), organized roughly by continent, are responsible for handing out large ranges of ip addresses to internet service providers. Ronald Beelaard (who has a lot of experience in hunting down mischievous bots) was able to explain this: a large chunk of ip addresses had recently been reassigned by Apnic, and in absence of a proper administration that network defaults to country code AU. Time will cure this.

Also rather unexpected is that 85% of the page views to the Swahili Wikipedia originated from the USA. This may well be caused by the Google contest in recent months where nice prices are offered for new content on the Swahili Wikipedia. Also keep in mind that access to the Swahili Wikipedia from Kenya and Tanzania is up till now very limited with just tens of thousands of monthly views.

This entry was posted in Wikimedia View(er)s, Wikistats Reports. Bookmark the permalink.

15 Responses to Wikipedia page views, a global perspective (2)

  1. Pingback: Infodisiac » Wikipedia page views, a global perspective

  2. Ivan Lanin says:

    Thank you, Erik. The data mean a lot for us to analyse the trend for our other projects (id, jv, su, ace, etc).

  3. ???? ????? says:

    It seems that there are to many requests for bs: and mk: coming from Slovenia. Maybe a similar case as with jp:/Australia (at much smaller scale of course).

  4. Waldir says:

    I find it interesting how Vatican City is way ahead of all others regarding monthly page views per capita (58, while the 2nd, Luxembourg, has only 19). I wonder what this means :)

  5. Darkoneko says:

    it probably means many tourists accessing Wikipédia from there :)

  6. Circeus says:

    Is there any possibility of making a weighted version against Internet access by country?

  7. Pingback: Wikipedia Page Views Per Country with Internet users « ??????? ????

  8. Pingback: Wikipedia Page Views By Country – Breakdown with Wikipedia Size and Quality « ??????? ????

  9. Pingback: Wikipedia Stats « My Tech Life

  10. Waldir says:

    @Circeus: Nice idea. Thanks to the WP Signpost, I just learned that this analysis was already made by Nikola Smolenski in his post Wikipedia Page Views Per Country with Internet users.

  11. Wikiuser says:

    Thanks for the time and patience to wade through this. As usual, poor adminstration (laziness) is part of the problem (re: IPs defaulting to AU). Let’s hope time will cure it.

  12. Pingback: WikiStu » Blog Archive » Wikipedia usage by country

  13. Stu West says:

    I continued this analysis and turned into a map of Wikipedia usage worldwide. See http://wikistu.org/2010/02/wikipedia-usage-by-country/.

  14. Le Deluge says:

    I’ve just kicked around the numbers at the abovelinked smolenski.rs post – by my reckoning, en.wiki accounted for 52.1% of total hits, and servers reporting as US-based accounted for 53.1% of those en.wiki hits.

    On the basis that global usage is growing faster than US usage, those numbers should both soon (have already?) drop below 50%, which will be quite a big deal IMO. Also worth mentioning that aside from the Aus/Japan issue, a lot of multinationals have US IP’s, or VPN’s that appear to the internet in the US, so the US figure is overstated too.

  15. Pingback: Infodisiac » Wikipedia page edits, a global perspective

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>