What Wikipedia readers devour

Recently I wrote a new tool for Wikipedia which makes good use of the consolidated page request files and the page category system, to rank thematic sets of Wikipedia pages by popularity. Tool and request files are both hobby projects.


Visually challenged satyan master demonstrating reading malayalam wikipedia using free software,
E-speak ,screen reader, author Fotokannan, copyright CC BY-SA 3.0

For selected languages your can browse the top 2500 or even top 10,000 most requested articles within a certain category or one of it’s subcategories. You can also browse the category hierarchy used for selection of these pages. Reports are grouped by language and month.

Sometimes entries in these lists seem oddly out of place. Any Wikipedia article can have tens of categories assigned to it. A popular article will rank high in any list where it’s featured, regardless of the category under review. Thus a well-known singer may be top ranking in a list about politicians, because he/she also played a minor or brief role in politics.

For a selected set of categories these stats will be refreshed monthly. Some popular languages, like Russian and Japanese, will be added once Unicode support is complete.

Michael Hale published a video the same day I first tweeted about this, which demonstrates a related more interactive tool. Highly recommended to also watch that demo. Both approaches are quite different, each with different merits.

We could do much more with article request counts. For instance we could weigh likelihood of a page popping up at Random article based on it’s popularity. Purists may object, as the selection would no longer be really random, but we could rename the button, e.g. to ‘Feeling lucky?’.

Examples

Politicians by country (and politics, and in some languages celebrities with party affiliations, as these are in the same category hierarchy): 
American (en), Brasilian (pt), British (en), Dutch (nl), French (fr), German (de), 
Indonesian (id), Italian (it), Polish (pl), Portuguese (pt), Spanish (es), Swedish (sv)

Museums by country: Australia (en), France (fr), Germany (de), India (en),
Indonesia (id), Mexico (es), Netherlands (nl), Poland (pl), Portugal (pt), Spain (es), 
Sweden (sv), UK (en), US (en)

Outreach: Bookshelf (en), Education (en),  GLAM (en)

Misc: GLAM on wp:en (en), Lists on wp:enMeta (en)

 

 

 

 

 

Posted in uncategorized | 5 Comments

Boston Marathon bombings

On April 15, 2013 at 2.49 PM local time (EDT) two bombs exploded near the finish line of the Boston Marathon. At 3.19 PM the first mention of the tragedy was made on the English Wikipedia in article Boston Marathon. At 3.27 PM a separate article about the attack was started, which received over 4000 updates in the next 4 days.

This chart shows how many hourly page views were registered in the next 4 days for related pages on the English Wikipedia: 2.5 million, including 287 thousand image views (not counting pictures embedded in an article), and 96 thousand image views on Commons.

WikipediaPageViewsBostonMarathonBombings

Although impressive and testimony of Wikipedia’s relevance as news aggregator duringmajor events earlier major news event draw even more readers. After Michael Jackson’s death Wikipedia servers received over 1.2 million requests for the English article about him in the first hour only, 8.7 million  in the first 24 hours. John McCain’s announcement of Sarah Palin as running mate drew over 500 thousand page views in the first hour.

See also most viewed articles about terrorist incidents in the US by year and most viewed articles on terrorism, both on English Wikipedia, in March 2013.

 

 

Posted in Wikimedia View(er)s | 1 Comment

Monthly edits on Wikimedia wikis still on the rise

This chart shows that the overall volume of manual edits by registered users on all Wikimedia wikis combined is still increasing, slowly but steadily. When we substract rookie edits (arbitrarily set to first 100) the rise of highly effective edits (red line) rises even a bit sharper.

EditsOnAllProjectsTillDec2012

Posted in Nice Charts, Wikimedia Edit(or)s | 9 Comments

Which single Wikimedia metric would inspire you most?

The statistics we maintain within the Wikimedia movement broadly fall into two categories. Part is there to inspire us, with limited feedback to our daily activities. Part is there to reveal patterns or to signal mishaps, both meant to lead to actionable knowledge. Of course the boundary is not strict. Your qualification may depend on how you judge your own capacity to influence trends. Broadly speaking much of Wikistats currently falls inside the first category, I would say: numbers to inspire.

I do not mean to belittle the importance of general metrics. I would rate the superb talks by web celebrity Hans Rosling as being foremost inspirational, less actionable, for all of us normal earthlings, who seldom make decisions with global impact (except at the ballot). If you’re a dedicated contributor to one of our projects it can lighten you day when trends are favorable. It can awe the press and the public at large, it feeds into our fundraiser, even helps to open doors (be it to photo shoot opportunities, or to GLAM institutes). But other than that it is nice to know.

Much as I welcome more actionable metrics, still my focus in this blog post is on that inspirational aspect. Forget research, forget operations. As a thought experiment I want to raise this hypothetical question: If for one year we could only update one metric about any aspect of the Wikimedia movement, which one would help best to inspire you? It could be a complicated metric, but just one. Let me try to answer that myself. I’m side stepping issues on data gathering. For sake of argument let’s assume any metric is spot on. The issue is more how unambiguous each metric is and how much it tells about our success.

Would I choose page counts? Definitely not. We use this metric too often to impress our audience. Too often some of our projects play one-upmanship (or so it seems) with article creation bots. Also our definition of article is very permissive, and the barrier for importing data collections seems to lower steadily. Pardon the hyperbole: we could add millions of ‘articles’ from a star database any day. See also next metric.

Number of languages? We cite this everywhere so this must be an number which inspires. You may want to know that 83 of our wikis (of which 10 Wikipedias) are locked for updates, due to lack of editors. 61 or our 281 Wikipedias have less than 1000 articles. Many of those wikis exist for half a decade. And even some wikis with 2K+ articles may consist of year stubs mostly, at least that happened a lot in early years. There is a huge wikistats table showing article counts per wiki per month. Note how many wikis grew by 1K or 2K articles at once, early in their history, often bot induced stubs. That still leaves a whopping number of Wikipedia’s which are an unbelievable success. No need to stretch the imagination I would say.

Editors? Likely candidate, but the number is somewhat ambiguous. We could gain in editors but lose faster in average editor activity, for instance because of ever increasing competition of other web initiatives (social sites). Even a large influx of new editors by itself could overtax our veteran users, and ultimately cause burn-out and estrangement. While this seems to be a problem we can only dream of in many projects, it is already realistic in some others.

Edits? Attractive option. We already count edits by registered users on primary content (mostly namespace 0). Ideally we would substract reverts and reverted content (vandal fighters are arguably our most important lifeline, but if they would need to step up their activity due to increase in vandalism that would not exactly put smiles on our faces). Although edits on talk pages etc are very important, if they don’t lead to extra encyclopedic content, they are irrelevant, so let’s exclude those (like wikistats already does). This leads to ’non reverting/reverted edits, by registered and anonymous users (thus excluding bots) on primary content’.

Page views? Better navigation tools could have a negative effect on our page view count, and we’d still welcome that. It has been said that Google strives to get people from their site asap, as that signals they found what they’re after faster. Only part of our traffic comes from intentional searches. There is also lots of random browsing, which is great fun and educational, but still I would rate that differently (see below).

Unique visitors? If for some reason the average number of visits per person dropped faster than the number of visitors rose, our usage would decline and from this one metric alone we wouldn’t know.

Completion of targets? I said our metric could be a complicated one. So what about a (weighed?) average of the percentage fulfillment for each of our strategic targets for 2015? While useful for general guidance, I personally think targets in many cases are more about how optimistic we were at the time they were set, and less about realistic expectations (to some degree the most realistic expectations about complex social movements are: no expectations at all, trends might be distinctly non-linear). Goals define a direction, targets pretend travel speed can be predicted, which works better in explored than in new territory. Also, for some of our targets we still have no operational definition, or method to measure them reliably (content quality, gender ratio).

Total amount donated? If we would raise funds (with lower intensity) all year round the amount donated monthly could seem a nice metric to measure appreciation by our readers, and thus our impact on the world. But the average amount donated could fluctuate because of external influences. Also the message to the public, and how this is delivered, is always in flux, banners and stories evolve, so the playing field to compare consecutive measurements is not level.

Number of donations? That comes closer. I assume external (economic) factors will influence the average amount donated more than the decision to donate at all. Just my guess.

Gender ratio of our editors? If that would improve it would be hugely inspiring. But not my first choice, if we can sneak peek so little from under our blindfolds. (It could even mean more male editors dropped out).

Total visits? This would be my candidate of choice today. I imagine most people visit Wikipedia foremost to get a specific question answered. After their initial curiosity has been fulfilled they might browse further and learn more, but their initial question is what drove them to visit. If someone visits us once a month Wikipedia clearly has a limited role in their lives. For that share of our visitors which visit us several times a week or even more (*) something essential has changed. They have learned that the answer to many of their questions is within reach, and affordable in terms of time invested. They might even have become more inquisitive in general. That I find very inspirational.

*According to comScore we received 2.45 M visits to all our projects in January 2012, and 2.50 M a year later, with somewhat lower values in between (mobile access may be underrepresented). On average that would be slightly over 5 visits per unique visitor per month. Like with other metrics above, there are complications: seasonality, uneven distribution, socially or geographically, and external influences (mainly rise of mobile, which actually will amplify this metric). 

Note these are my personal opinions, some based on limited anecdotal evidence.

I would love to hear your feedback.  Which metric would you choose?

 

 

 

 

 

 

 

 

 

 

Posted in Musings | 13 Comments

Monthly page requests, new archives and reports

The compaction of hourly page request files into daily, then daily into monthly is now operational. I fixed and simplified earlier scripts.

For Dec 2012 data reduction is as follows:

744 hourly files: 65 Gb compressed
31 daily files: 16 Gb compressed
1 monthly file: 5 Gb compressed

Space is saved as follows

1) each article title occurs only once instead of up to 744 times
2) bz2 compression
3) threshold of 5+ requests per month in final monthly file

Still all versions contain hourly resolution.

Each file starts with description on file format.  (in nutshell: after the monthly total follow hourly counts, sparsely indexed: each count is preceded by two letters, one for day of month, one for hour of day)

http://dumps.wikimedia.org/other/pagecounts-ez/merged/

As a spin-off the new data stream is also used for new monthly page request reports, e.g. English Wikipedia. The full index contains about 800 wikis  (alas no friendly front-end yet)

One of the benefits of these archives is easy external archiving (e.g. on Internet Archive), similar to tweet archive of Library of Congress, which will be a rich Zeitgeist dataset for posterity.

Monthly page requests

Monthly page requests

Most requested not existing files are split into tree sections: articles, content in other namespaces, binary files.

Missing files

Missing files

Caveat: redirects are not resolved. Encoded and unencoded url’s are counted separately.

Posted in Wikimedia View(er)s, Wikistats Reports | 5 Comments