Wiki Loves Monuments 2010-2012 – Retention / Stats per Country

A few weeks ago, at the Amsterdam Hackaton, Lodewijk Gelauff (aka Effeietsanders) asked me to look into the Wiki Loves Monuments (WLM) data and provide a list of images, who uploaded them, when, and for which country. Primary goal: to provide accurate data for the submission of the 2012 contest as Guinness World Record (world record currently held by WLM 2011), and was indeed a global event (WLM scope in 2010: The Netherlands, 2011: Europe, 2012: Earth). Once these data were extracted, an update on retention rates, and a breakdown per country for images and contributors would also be nice.

Retention rates

With cleaner data than half a year ago we did rerun editor retention numbers.

In Sep 2011 3,497 new users contributed to WLM. Out of these, 397 still contributed in Nov 2011 or later, on any Wikimedia wiki, thus leading to an awesome retention rate for WLM 2011 of 11.4% !!

In Sep 2012 10,825 new users contributed to WLM, in Oct another 412 (WLM Israel ended later). Out of these, 635 still contributed in Nov 2012 or later, on any Wikimedia wiki, thus leading to a retention rate for WLM 2012 of 5.7%. (it will grow, as did the number for 2011)

Retention_Rate_WLM_2011
Retention_Rate_WLM_2012

 

Breakdown by country

Here are four charts that show the breakdown per country of contributors and uploads, either for the last year 2012, or in comparison with earlier years (see also notes below).

Click image for full size
WLM Contributors by Country, Comparison
WLM_Contributors_by_Country_2010-2012
WLM Uploads by Country, Comparison
WLM_Uploads_by_Country_2010-2012
WLM Uploads by Country, Cumulative
WLM_Uploads_by_Country_2010-2012_Cumulative
WLM Uploads by Country 2012
WLM_Uploads_by_Country_2012


Notes:
1) Country is about the monument, not the uploader. A person may have contributed images for several countries.
2) Please do not read too much in the chart showing cumulative uploads for three WLM contests, and particularly not %completed. Some countries may have more monuments than others, or have them more easily accessible (travel time). Some countries have a more active Wikimedia community than others. Some countries have fast internet, in which case uploading multiple images per monument is less of a burden.


Every WLM image in Wikimedia Commons is tagged with a template. This template specifies year and country. In a wiki environment there is always some data normalization to be done, as most of the data was added automatically but some image sets were hand-coded (or uploaded with custom scripts). So a couple of variations occurred. For instance: why specify country code ‘de’ if you can also specify ‘{{lc:DE}}’ 🙂 So it took a few iterations to detect and fix the anomalies.


There are two raw data files: wlm_images_contributors.zip and wlm_retention.zip (sparingly documented). Let me know if you have questions.

Posted in Nice Charts, Wiki Loves Monuments, Wikimedia Edit(or)s | Tagged | 5 Comments

What Wikipedia readers devour

Recently I wrote a new tool for Wikipedia which makes good use of the consolidated page request files and the page category system, to rank thematic sets of Wikipedia pages by popularity. Tool and request files are both hobby projects.


Visually challenged satyan master demonstrating reading malayalam wikipedia using free software,
E-speak ,screen reader, author Fotokannan, copyright CC BY-SA 3.0

For selected languages your can browse the top 2500 or even top 10,000 most requested articles within a certain category or one of it’s subcategories. You can also browse the category hierarchy used for selection of these pages. Reports are grouped by language and month.

Sometimes entries in these lists seem oddly out of place. Any Wikipedia article can have tens of categories assigned to it. A popular article will rank high in any list where it’s featured, regardless of the category under review. Thus a well-known singer may be top ranking in a list about politicians, because he/she also played a minor or brief role in politics.

For a selected set of categories these stats will be refreshed monthly. Some popular languages, like Russian and Japanese, will be added once Unicode support is complete.

Michael Hale published a video the same day I first tweeted about this, which demonstrates a related more interactive tool. Highly recommended to also watch that demo. Both approaches are quite different, each with different merits.

We could do much more with article request counts. For instance we could weigh likelihood of a page popping up at Random article based on it’s popularity. Purists may object, as the selection would no longer be really random, but we could rename the button, e.g. to ‘Feeling lucky?’.

Examples

Politicians by country (and politics, and in some languages celebrities with party affiliations, as these are in the same category hierarchy): 
American (en), Brasilian (pt), British (en), Dutch (nl), French (fr), German (de), 
Indonesian (id), Italian (it), Polish (pl), Portuguese (pt), Spanish (es), Swedish (sv)

Museums by country: Australia (en), France (fr), Germany (de), India (en),
Indonesia (id), Mexico (es), Netherlands (nl), Poland (pl), Portugal (pt), Spain (es), 
Sweden (sv), UK (en), US (en)

Outreach: Bookshelf (en), Education (en),  GLAM (en)

Misc: GLAM on wp:en (en), Lists on wp:enMeta (en)

 

 

 

 

 

Posted in uncategorized | 5 Comments

Boston Marathon bombings

On April 15, 2013 at 2.49 PM local time (EDT) two bombs exploded near the finish line of the Boston Marathon. At 3.19 PM the first mention of the tragedy was made on the English Wikipedia in article Boston Marathon. At 3.27 PM a separate article about the attack was started, which received over 4000 updates in the next 4 days.

This chart shows how many hourly page views were registered in the next 4 days for related pages on the English Wikipedia: 2.5 million, including 287 thousand image views (not counting pictures embedded in an article), and 96 thousand image views on Commons.

WikipediaPageViewsBostonMarathonBombings

Although impressive and testimony of Wikipedia’s relevance as news aggregator duringmajor events earlier major news event draw even more readers. After Michael Jackson’s death Wikipedia servers received over 1.2 million requests for the English article about him in the first hour only, 8.7 million  in the first 24 hours. John McCain’s announcement of Sarah Palin as running mate drew over 500 thousand page views in the first hour.

See also most viewed articles about terrorist incidents in the US by year and most viewed articles on terrorism, both on English Wikipedia, in March 2013.

 

 

Posted in Wikimedia View(er)s | 1 Comment

Monthly edits on Wikimedia wikis still on the rise

This chart shows that the overall volume of manual edits by registered users on all Wikimedia wikis combined is still increasing, slowly but steadily. When we substract rookie edits (arbitrarily set to first 100) the rise of highly effective edits (red line) rises even a bit sharper.

EditsOnAllProjectsTillDec2012

Posted in Nice Charts, Wikimedia Edit(or)s | 9 Comments

Which single Wikimedia metric would inspire you most?

The statistics we maintain within the Wikimedia movement broadly fall into two categories. Part is there to inspire us, with limited feedback to our daily activities. Part is there to reveal patterns or to signal mishaps, both meant to lead to actionable knowledge. Of course the boundary is not strict. Your qualification may depend on how you judge your own capacity to influence trends. Broadly speaking much of Wikistats currently falls inside the first category, I would say: numbers to inspire.

I do not mean to belittle the importance of general metrics. I would rate the superb talks by web celebrity Hans Rosling as being foremost inspirational, less actionable, for all of us normal earthlings, who seldom make decisions with global impact (except at the ballot). If you’re a dedicated contributor to one of our projects it can lighten you day when trends are favorable. It can awe the press and the public at large, it feeds into our fundraiser, even helps to open doors (be it to photo shoot opportunities, or to GLAM institutes). But other than that it is nice to know.

Much as I welcome more actionable metrics, still my focus in this blog post is on that inspirational aspect. Forget research, forget operations. As a thought experiment I want to raise this hypothetical question: If for one year we could only update one metric about any aspect of the Wikimedia movement, which one would help best to inspire you? It could be a complicated metric, but just one. Let me try to answer that myself. I’m side stepping issues on data gathering. For sake of argument let’s assume any metric is spot on. The issue is more how unambiguous each metric is and how much it tells about our success.

Would I choose page counts? Definitely not. We use this metric too often to impress our audience. Too often some of our projects play one-upmanship (or so it seems) with article creation bots. Also our definition of article is very permissive, and the barrier for importing data collections seems to lower steadily. Pardon the hyperbole: we could add millions of ‘articles’ from a star database any day. See also next metric.

Number of languages? We cite this everywhere so this must be an number which inspires. You may want to know that 83 of our wikis (of which 10 Wikipedias) are locked for updates, due to lack of editors. 61 or our 281 Wikipedias have less than 1000 articles. Many of those wikis exist for half a decade. And even some wikis with 2K+ articles may consist of year stubs mostly, at least that happened a lot in early years. There is a huge wikistats table showing article counts per wiki per month. Note how many wikis grew by 1K or 2K articles at once, early in their history, often bot induced stubs. That still leaves a whopping number of Wikipedia’s which are an unbelievable success. No need to stretch the imagination I would say.

Editors? Likely candidate, but the number is somewhat ambiguous. We could gain in editors but lose faster in average editor activity, for instance because of ever increasing competition of other web initiatives (social sites). Even a large influx of new editors by itself could overtax our veteran users, and ultimately cause burn-out and estrangement. While this seems to be a problem we can only dream of in many projects, it is already realistic in some others.

Edits? Attractive option. We already count edits by registered users on primary content (mostly namespace 0). Ideally we would substract reverts and reverted content (vandal fighters are arguably our most important lifeline, but if they would need to step up their activity due to increase in vandalism that would not exactly put smiles on our faces). Although edits on talk pages etc are very important, if they don’t lead to extra encyclopedic content, they are irrelevant, so let’s exclude those (like wikistats already does). This leads to ‘non reverting/reverted edits, by registered and anonymous users (thus excluding bots) on primary content’.

Page views? Better navigation tools could have a negative effect on our page view count, and we’d still welcome that. It has been said that Google strives to get people from their site asap, as that signals they found what they’re after faster. Only part of our traffic comes from intentional searches. There is also lots of random browsing, which is great fun and educational, but still I would rate that differently (see below).

Unique visitors? If for some reason the average number of visits per person dropped faster than the number of visitors rose, our usage would decline and from this one metric alone we wouldn’t know.

Completion of targets? I said our metric could be a complicated one. So what about a (weighed?) average of the percentage fulfillment for each of our strategic targets for 2015? While useful for general guidance, I personally think targets in many cases are more about how optimistic we were at the time they were set, and less about realistic expectations (to some degree the most realistic expectations about complex social movements are: no expectations at all, trends might be distinctly non-linear). Goals define a direction, targets pretend travel speed can be predicted, which works better in explored than in new territory. Also, for some of our targets we still have no operational definition, or method to measure them reliably (content quality, gender ratio).

Total amount donated? If we would raise funds (with lower intensity) all year round the amount donated monthly could seem a nice metric to measure appreciation by our readers, and thus our impact on the world. But the average amount donated could fluctuate because of external influences. Also the message to the public, and how this is delivered, is always in flux, banners and stories evolve, so the playing field to compare consecutive measurements is not level.

Number of donations? That comes closer. I assume external (economic) factors will influence the average amount donated more than the decision to donate at all. Just my guess.

Gender ratio of our editors? If that would improve it would be hugely inspiring. But not my first choice, if we can sneak peek so little from under our blindfolds. (It could even mean more male editors dropped out).

Total visits? This would be my candidate of choice today. I imagine most people visit Wikipedia foremost to get a specific question answered. After their initial curiosity has been fulfilled they might browse further and learn more, but their initial question is what drove them to visit. If someone visits us once a month Wikipedia clearly has a limited role in their lives. For that share of our visitors which visit us several times a week or even more (*) something essential has changed. They have learned that the answer to many of their questions is within reach, and affordable in terms of time invested. They might even have become more inquisitive in general. That I find very inspirational.

*According to comScore we received 2.45 M visits to all our projects in January 2012, and 2.50 M a year later, with somewhat lower values in between (mobile access may be underrepresented). On average that would be slightly over 5 visits per unique visitor per month. Like with other metrics above, there are complications: seasonality, uneven distribution, socially or geographically, and external influences (mainly rise of mobile, which actually will amplify this metric). 

Note these are my personal opinions, some based on limited anecdotal evidence.

I would love to hear your feedback.  Which metric would you choose?

 

 

 

 

 

 

 

 

 

 

Posted in Musings | 13 Comments

Monthly page requests, new archives and reports

The compaction of hourly page request files into daily, then daily into monthly is now operational. I fixed and simplified earlier scripts.

For Dec 2012 data reduction is as follows:

744 hourly files: 65 Gb compressed
31 daily files: 16 Gb compressed
1 monthly file: 5 Gb compressed

Space is saved as follows

1) each article title occurs only once instead of up to 744 times
2) bz2 compression
3) threshold of 5+ requests per month in final monthly file

Still all versions contain hourly resolution.

Each file starts with description on file format.  (in nutshell: after the monthly total follow hourly counts, sparsely indexed: each count is preceded by two letters, one for day of month, one for hour of day)

http://dumps.wikimedia.org/other/pagecounts-ez/merged/

As a spin-off the new data stream is also used for new monthly page request reports, e.g. English Wikipedia. The full index contains about 800 wikis  (alas no friendly front-end yet)

One of the benefits of these archives is easy external archiving (e.g. on Internet Archive), similar to tweet archive of Library of Congress, which will be a rich Zeitgeist dataset for posterity.

Monthly page requests

Monthly page requests

Most requested not existing files are split into tree sections: articles, content in other namespaces, binary files.

Missing files

Missing files

Caveat: redirects are not resolved. Encoded and unencoded url’s are counted separately.

Posted in Wikimedia View(er)s, Wikistats Reports | 5 Comments

Wikipedia page reads, breakdown by region

As we know in some regions of the world people have more easily access to Wikipedia than in others. The majority of reads come from the so called Global North (*). Now is this unbalance between North and South diminishing? Not so easy to get the needle moving, as a whole range of regional differences come into play: average internet speed and latency differ widely between regions, in some regions internet access is almost ubiquitous, at any time or place, at home and at work, via desktop/mobile/tablet, yet in large parts of the world many can only access internet via shared computers (schools, cyber cafes). The saying goes that the second billion internet users will use a mobile phone as main access point, a true game changer. I hope and expect Wikipedia Zero will vastly speed up this development.

WMF’s monthly report card shows trends per region on reach and unique visitors (data from comScore), but those metrics are only part of the story. For one comScore uses indirect measurement (as a consequence of strict WMF privacy policy). Also, a metric like unique visitors does not weigh in total activity per user (one page view per month or thousand both count as one UV).

Fortunately we can also count page views directly, and break these down by region, target wiki, mobile or main site. Wikistats has many reports on this, e.g. page views/edits per region, page views per platform and target wiki.

Here is another set of charts. This time emphasis is not on absolute trends, but on relative content consumption per region. Again focusing on: do we see a shift in global distribution of page reads?

Please remember mobile in the charts is about traffic to mobile site, not traffic from mobile devices! A considerable part of web access from phone and tablets is to the main site.

 

The chart above shows how Africa still has a long way to go to gain equal access to internet: with about 15% of the worlds population, 1.4 % of Wikipedia page views is low, but still one and a half as much as 3 years ago.

 

Asia

Asia

Australia

Central America

Europe

North America

Oceania

South America

North vs South
Main site vs Mobile site

North vs South

Sub Saharan Africa


A page request is defined here as any request for html content (mime mode ‘text/html’).  So it includes non existing pages (e.g. 404’s), and maybe other cruft. Unlike in some other reports we do discern between human and bot page requests here. (**)

Data source for all these charts is one file, extracted from the same 1:1000 sampled server logs we already use for other reports. There is a rudimentary perl script (***) to extract cross sections from these data, and produce a csv file ready for import into a spreadsheet, so as to produce charts like above. Over time we may feed some of the results into our monthly report card.

Of course our upcoming data beast Kraken will take care of data collecting soon, with a better resolution than ever, more flexible breakdowns, and faster available. So consider this data stream not strategic, rather putting legacy to good use to fill in a void.  

Disclaimer: some of the anomalies that occurred over time in our data have been filtered out (those data points are blanked). And we had some serious data collecting mishaps over the years. For this reason data before 2010 are omitted altogether (****).

* = Not same as geographical north, in fact WMF uses its own breakdown on N vs S
** = We will overhaul definitions and move to standardized metrics, which by itself causes a new challenge: to somehow integrate old and new metrics in one timeline, until the new metrics have acquired enough history.
*** = Low on documentation, lots of room for improvement on filter and aggregation  options
****= We have chosen to include first half of 2010 despite major server under-reporting we faced at that time: because these charts are about relative rather than absolute traffic numbers, despite the possibility that server overload affected some times of the day and hence some regions somewhat more than others.

 

Posted in Nice Charts, Wikimedia View(er)s, Wikistats Reports | 2 Comments

Evernote ignores security flaw for months

Usually this blog is about Wikimedia statistics. Today I need to digress. My favorite cross-platform archival system is the hugely popular Evernote. I use Evernote for all kinds of data and images, and love the product. So much that I stored all kinds of (mildly) personal data. No longer.

As a paying subscriber I get a few goodies, like setting a pin code. In fact apart from the higher upload limit there are just a few of these goodies so this pin code thing is prominently featured on their sign up page, especially on iPad.

Evernote subscription benefits

In July I stumbled over a security flaw, and reported it first to the help desk (after all as a paid subscriber I get ‘top priority support’). They confirmed the bug quickly and said they reported it to the engineers. A lively debate on their support site followed, with an Evernote employee participating. 3.5 months, several small updates and one major new release later the bug still stands.

So what is it about? On iOS devices one can circumvent the pin code simple and fast. All that is needed is to remove the app and download it again. Takes less than a minute. Since iOS6 no Apple password is needed for updates. Remember Everything Evernote helpfully suggests to reuse existing account but forgets there was a pin code set. Oops!

Evernote dialog box

 

Evernote employee responds as follows (paraphrasing, see exact response here): it’s Apple’s fault: they changed their system, and the iOS device has its own pin code which one needs to bypass first, also Evernote supports encryption, that makes this less of an issue.

Evernote, if you rely on the general iOS login code why did you offer an extra pincode in the first place, and brag about it? Maybe some users prefer a short device login code to keep their daily news and amusement within easy reach, but treat Evernote as their trusted vault and use a more solid pin code there. Also encryption support is minimal, only for plain text, not for scans, pdf’s etc.

As much as I love your product, shouldn’t you care for your client’s security before adding new goodies? How difficult can it be to either disable import of stored account data, or remember the pin code as well?!

Posted in uncategorized | 2 Comments

Growth in article count at largest 20 Wikipedias

There is a lot of variation in article growth rate among mature Wikipedias. Growth slows down at some, is steady or even accelerates at others. Many have tried to model these trends. I have little to offer in explanation but can offer an at a glance overview of article growth trends for top wikis. You can switch between small and large charts, and either look at growth trends alone or match those visually with overall editor activity per wiki.



These charts existed for a while, via wikistats portal you can find similar charts for all other wikis. Navigate to sitemap pages, e.g. for wikipedia, click link Summary for any wiki. More links at bottom of sitemap to grouped summary pages per project.

New: trends are now broken down by type of editor: registered editors, anonymous editors, bots.

Some observations

Manual article growth has been slowing down on English Wikipedia from 2007, but seems to stabilize in last two years.

Growth in articles on German, French, Italian, and Polish articles has been pretty stable for many years.

Both observations seem relevant and somewhat opposite to the low-hanging fruit hypothesis, as all of these wikis can be considered fully mature Wikipedias. More about this hypothesis here and here.

Unrecognized bots?

Usually spikes in editor activity are caused by bots. A few charts show spikes in article creation rate for registered users. My hunch is these are anomalies, caused by bots not being recognized by name (roughly meaning they do not contain ‘bot’ in name) and not being registered as bot either (which I believe on many wikis is mandatory).

Any feedback on bots that fall in this category is very welcome. If some of these bots are registered after all, next month of charts will reflect this, for all history. Likely candidates for mis-attribution are Spanish spike in 2011, Chinese in 2012, Vietnamese in 2011/2012, Norwegian in 2008, Czech in 2010.

Portuguese Wikipedia

One particular issue prompted this overview, so let me ask: growth in new articles on the Portuguese Wikipedia dropped significantly early in 2011 (it seems to pick up again recently). The number of active editors did not change much in recent years. Any thoughts on this in general? Also it seems counting methodology changed (not on wikistats), or at least was questioned,  in March, according to this discussion (Google translates anexos as attachments (?))

Thanks in advance for any insights into these trends.

Upd: for further analysis you can download data files (csv)

 

Posted in Wikimedia Edit(or)s | 11 Comments

Wikistats editor counts are broken (upd recovery complete)

Wikistats editor counts are too low for some languages, for all reported months. The issue has made it into the German Wikipedia’s “Kurier” newsletter (18.1) , and is discussed at this page.

First, there has not been a definition change or re-evaluation of editor counts. Current counts are wrong. Let me explain why all months display lower counts: on every monthly run of wikistats nearly all data are regenerated from the dumps. This is on purpose. This way new functionality and (rare) bug fixes apply to all months since the creation of the wiki. The other side is that when stats scripts or dumps are broken reports will show wrong data for all months, which is what happened now (a hybrid situation with data retention and update runs would complicate processing further and easily be a source of errors itself). I put up a notice on the stats pages.

I am investigating. So far test runs have been inconclusive. My first priority is to find out whether this is caused by a change in the dumps, or the scripts, or a config change on the server. The latter is most likely. In the past month Wikistats data and script were moved to a new server, with a number of modifications to overall configuration and shell scripts.

I will keep you posted on any new findings. My apologies for the inconvenience and confusion caused by this mishap.

Update June 17

The problem has been analyzed and fixed. A few weeks ago, during substantial overhaul on stats scripts (to add new metrics) an error had creeped in, which was not recognized during tests: as a result far too many articles were flagged as redirects, resulting in far too low counts for articles and editors: redirect pages are not counted as articles, and for consistency, not taken into accounts for edits and editors (*).

Now it still will take about 7-10 days to reprocess all dumps. Slightly delayed further by scheduled maintenance on the stats server, which is still ongoing.

Thanks so much for your patience.

*: This is of itself a point that could be debated, but this is how it works or is supposed to work. For one it prevents skewing of edits per article metric.

Update June 26

All 800+ wikis have been updated now.

 

 

 

 

 

 

Posted in uncategorized | 1 Comment