WikiProject Medicine Translation Task Force

JamesHeilman At Wikimania 2014 James Heilman – Canadian emergency room physician – gave a presentation on Wikipedia and Medicine HeilmanOnHealth.
Logos2He explained how leading non-profit health organizations like the Cochrane Collaboration, Cancer Research UK and the National Institutes of Health (NIH) help to improve Wikipedia’s content, and that of its sister projects. LogosSmallIn particular James talked about the WikiProject Medicine Translation task force where medical content on the English Wikipedia is improved and simplified by non-profit Wiki Project Med Foundation (100 people from 20 countries), then translated into many languages by volunteers from Translators without Borders , co-founded by CEO Lori Thicke.  LoriBanner
ProjectPagePartialIn good open project fashion the ambitions are stellar: bring 100 medical articles to good/featured article status (GA/FA), and translate those, plus 1000 abbreviated articles, into as many languages as there are Wikipedias. The project page shows their very impressive progress in languages as Hindi, Chinese, Persian, Indonesian, Turkish, Swahili. But many more language projects have been started, also for languages where overall Wikipedia coverage is very limited, like Quechua, Yoruba to name just a few. PageViewsTaskForce

After the conference I reached out to the project and made a script to parse the project page, extract the links to the published articles for all languages, look-up the monthly page views  in our monthly aggregated page view dump and regularly present the results in a (I hope) informative status page, with a (I’m certain) boring layout. As page views per article are currently only collected for WMF’s non-mobile site,  the extra mobile page views were by necessity estimated (we do know overall percentage mobile traffic per wiki).

For many languages the stats are clearly encouraging: on the Japanese Wikipedias the 5 articles get on average almost 25K views each per month, on Spanish 23K, Italian 15K. For some languages, particularly those where Wikipedia after many years are still in the start-up phase, numbers seem disappointing: and some may indeed be, but there is a technical artifact that also comes into play (*).

Aedes_aegypti_during_blood_meal Suppose the 23 monthly page views for the article on Dengue Fever on the Farsi Wikipedia (110M speakers) are indeed accurate (it may well be the technical artifact fools us here, but suppose), if someone prints 10 copies of the article and puts these up for display at 10 health posts. Wouldn’t that already make it worthwhile?

* Tech details: one issue with the stats is many of lowest scoring ‘page titles’ are actually redirects. I query the API to find the proper article to which the redirect resolves, but as there is no standard encoding for the page titles in the dump file (titles are counted and written in the encoding in which they are received) not all resolved redirects were actually found in the dump file. A good reason to apply standard encoding to page titles, if not before actual counting takes places (may be too costly for this real-time process), then in aggregation phase (post processing). 

Posted in uncategorized | Leave a comment

Setting the standard for a new unit of measurement can be tedious and even hazardous

A small essay for your amusement: I am reading a book about how some of today’s most fundamental units of measurement were defined and calibrated.

The more precision one requires, the harder a measurement becomes (this was an known adagium even before Heisenberg): in 1790 a new length unit was proposed: the meter. It would consist of 1/1000 of 1/10,000 of the distance from the equator to the poles. It only remained to measure that distance as precise as possible.

The French took the lead here, and thus the requirements for which meridian to use as a base line was stated as: “the one with the longest stretch over land, being well charted territory, which ends at both sides at sea level”. By sheer coincidence this happened to be in France.


So what was needed was to measure the total distance between both ends (near Dunkirk and Barcelona) by triangulation, and measure the length of one side of one of those triangles  in terms of a provisional meter (using a platinum bar, compensating for temperature at each position, and for the curvature of the road).

Two expeditions, one headed by Delambre, the other by Méchain, were formed to do the triangulation first, then measure the exact latitude of the end-points. They were to report findings within a year. I would take six years to arrive at acceptable results.

The triangulation itself took several months, as many mountains lay on the path. One of numerous perils encountered was an angry mob in Paris. People discovered the esoteric instruments the team carried in their baggage, and suspicion arose these might be spying tools, to aid reactionary forces opposing the Revolution. An angry mob gathered and demanded an explanation. That explanation would better not meet deaf ears, because at that time a mob, when in doubt, tended to regard the guillotine the safer option. Fortunately the team leader was an experienced teacher, so he knew how to balance between saying too little, and saying too much (and thus lose most of the audience), but the ‘trial’ still lasted for many hours.

After the triangulation was done all that remained was getting the exact positioning of the end points. An new device, called the Borda Circle (with two telescopes) made it possible to determine the angle between two stars with twice as much precision as before. All that remained was to repeat the measurement 10,000 times to reduce the human reading error by averaging the outcomes.

Unfortunately the expedition leader suffered an almost fatal accident and recovery took months, in which the expedition could not travel, so the team settled down in a place just 2 (provisional) kilometers away from the end-point of the meridian: biding their time they took it upon themselves to repeat the initial measurements and do another batch of 10,000. To their dismay the two averages diverged noticeably. They knew of the imperfect curvature of the earth, (Earth radius to the equator is 6400 km, 64 km more than the radius at the poles) [3]. What they didn’t know yet (and which brought them to despair) was that every meridian has a different length, as the earth’s curvature isn’t even uniform for every place at the same latitude. It took years to get this sorted out.

Tien verdwenen dagen

This story came from an excellent book by Michiel van Straten, called “Tien verdwenen dagen” (“Ten lost days”, alas in Dutch only), about humanities’ struggle to define good units of measurement.

There is a story about the difficult transition from Julian to Gregorian calendar (which has still not been completed as this Wikipedia chart shows);


a story about how before the invention of time zones not only every town had their own unique time, but with the emergence of railroads different railroad companies used a different time for the same town (and we think planning a trip with stop-overs today is time consuming), a story about Napoleons new metric calendar, and how the Catholic Church made him retract it after several years (10-day weeks with only one day off didn’t help either), stories about long debates to establish which meridian to make the Prime Meridian, and where to draw the international date line, and many more.


Posted in uncategorized | Leave a comment

Isotype diagrams are now easier to build on Wikipedia

Did you ever study a table with many large numbers, where the moment you put it away you realized nothing from what you just saw had stuck? I guess most of us suffer from this handicap that large numbers are difficult to absorb or evaluate.

In the 1930’s Gerd Arntz created a coherent set of 4000 pictograms and together with Otto Neurath built from these ‘words’ a ‘language’ called Isotype.

Wouldn’t it be nice if we could use a similar method to convert numbers into symbols on Wikipedia? Now we can:

I created a small Lua script (my first) called ‘Repeat_symbols’ to translate numbers into icons. Within 20 minutes the entire script had been replaced by a better version,. Thanks again, Jackmcbarn.




Using fast and slow traffic to symbolize land size was a compromise. I tried many symbols but most didn’t work at this small size. Suggestions welcome.

The following image is a tiny section from a huge table, that compares all European countries (and then some), in terms of population, land area, GDP. You can see at a glance that Russia is much larger in land area than my country: the Netherlands. Who would have thought 🙂

What did surprise me was that Russia’s GDP is not even 3 times as much a that of the Netherlands. I told several people and they hardly believed. (click to zoom)

Also, instead of presenting the raw numbers I show all metrics as percentages of the EU total, which makes it much easier to evaluate and even remember some of them, especially when you see the full table with European countries (and a few more).



Posted in uncategorized | Leave a comment

Traffic to Wikipedia’s mobile site is growing fast

Since 2008 WMF count monthly page views for the non-mobile site.
Since June 2010 also for the mobile site.

From the respective monthly totals we can calculate which share of the traffic goes to the mobile site. Evidently this share has grown dramatically over recent years.

The first chart shows the trend for the eight most read Wikipedias.


The second chart shows the same trends, now for the nine ‘most mobile’ Wikipedias
(which also are above a threshold popularity of 1 million views a month).




Please don’t confuse traffic to the mobile site with traffic from mobile devices. One can choose to visit the non-mobile site from a phone or tablet. One can choose to visit the mobile site from a desktop computer.

These numbers have been collected with webstatscollector. There are a numbers of issues with that tool. My colleagues Christian Aistleitner and Andrew Otto are working on a new version of the tool, which will be more robust, more precise in which messages to count, and draw data from the new Kafka infrastructure instead of direct messages from each server (via udp2log). Later on with that new infrastructure we will also be able to do a more complete breakdown, by country, and hence by region.

Data files

The following data files are available for offline analysis:

Pageview reports

The Wikipedia pageview reports now also show % mobile for last 24 months. Example: pageviews for Wikipedia, all platforms, normalized.

Breakdown by region (sort of)

Here, for what it’s worth, a breakdown by region, but languages spoken in several regions are listed separately. So please use these regional results with a grain of salt.

region: Africa

languages:aa:Afar, af:Afrikaans, ak:Akan, am:Amharic, arz:Egyptian Arabic, bm:Bambara, ee:Ewe, ff:Fulfulde, ha:Hausa, hz:Herero, ig:Igbo, kab:Kabyle, kg:Kongo, ki:Kikuyu, kj:Kuanyama, kr:Kanuri, lg:Ganda, ln:Lingala, mg:Malagasy, ng:Ndonga, nso:Northern Sotho, ny:Chichewa, om:Oromo, rn:Kirundi, rw:Kinyarwanda, sg:Sangro, sn:Shona, so:Somali, ss:Siswati, st:Sesotho, sw:Swahili, ti:Tigrinya, tn:Setswana, ts:Tsonga, tum:Tumbuka, tw:Twi, ve:Venda, wo:Wolof, xh:Xhosa, yo:Yoruba, zu:Zulu
perc mobile: 22.5%

regions: Africa/Asia
perc mobile: 37.8%

region: Artificial
languages:eo:Esperanto, ia:Interlingua, ie:Interlingue, io:Ido, jbo:Lojban, nov:Novial, vo:Volapük
perc mobile: 13.1%

region: Asia
languages:ab:Abkhazian, ace:Acehnese, arc:Aramaic, as:Assamese, az:Azeri, ba:Bashkir, bcl:Central Bicolano, bh:Bihari, bjn:Banjar, bn:Bengali, bo:Tibetan, bpy:Bishnupriya Manipuri, bug:Buginese, bxr:Buryat, cbk-zam:Chavacano, cdo:Min Dong, ceb:Cebuano, ckb:Sorani, cv:Chuvash, diq:Zazaki, dv:Divehi, dz:Dzongkha, fa:Persian, gan:Gan, glk:Gilaki, gu:Gujarati, hak:Hakka, he:Hebrew, hi:Hindi, hy:Armenian, id:Indonesian, ii:Yi, ilo:Ilokano, ja:Japanese, jv:Javanese, kaa:Karakalpak, kbd:Karbadian, kk:Kazakh, km:Khmer, kn:Kannada, ko:Korean, krc:Karachay-Balkar, ks:Kashmiri, ku:Kurdish, ky:Kirghiz, lad:Ladino, lbe:Lak, lo:Laotian, map-bms:Banyumasan, min:Minangkabau, ml:Malayalam, mn:Mongolian, mr:Marathi, mrj:Western Mari, ms:Malay, my:Burmese, myv:Erzya, mzn:Mazandarani, ne:Nepali, new:Nepal Bhasa, or:Oriya, os:Ossetic, pa:Punjabi, pag:Pangasinan, pam:Kapampangan, pi:Pali, pnb:Western Panjabi, ps:Pashto, sa:Sanskrit, sah:Sakha, sd:Sindhi, si:Sinhala, su:Sundanese, ta:Tamil, te:Telugu, tet:Tetum, tg:Tajik, th:Thai, tk:Turkmen, tl:Tagalog, tpi:Tok Pisin, tt:Tatar, tyv:Tuvan, udm:Udmurt, ug:Uyghur, ur:Urdu, uz:Uzbek, vi:Vietnamese, war:Waray-Waray, wuu:Wu, za:Zhuang, zh:Chinese, zh-classical:Classical Chinese, zh-min-nan:Min Nan, zh-yue:Cantonese
perc mobile: 33.0%

region: Europe
languages:als:Alemannic, an:Aragonese, ang:Anglo-Saxon, ast:Asturian, av:Avar, bar:Bavarian, bat-smg:Samogitian, be:Belarusian, be-x-old:Belarusian (Taraškievica), bg:Bulgarian, br:Breton, bs:Bosnian, ca:Catalan, ce:Chechen, co:Corsican, cs:Czech, csb:Cassubian, cu:Old Church Slavonic, cy:Welsh, da:Danish, de:German, dsb:Lower Sorbian, el:Greek, eml:Emilian-Romagnol, et:Estonian, eu:Basque, ext:Extremaduran, fi:Finnish, fiu-vro:Voro, fo:Faroese, frp:Arpitan, frr:North Frisian, fur:Friulian, fy:Frisian, ga:Irish, gd:Scots Gaelic, gl:Galician, got:Gothic, gv:Manx, hr:Croatian, hsb:Upper Sorbian, hu:Hungarian, is:Icelandic, it:Italian, ka:Georgian, koi:Komi-Permyak, ksh:Ripuarian, kv:Komi, kw:Cornish, lb:Luxembourgish, lez:Lezgian, li:Limburgish, lij:Ligurian, lmo:Lombard, lt:Lithuanian, ltg:Latgalian, lv:Latvian, mdf:Moksha, mhr:Eastern Mari, mk:Macedonian, mo:Moldavian, mt:Maltese, mwl:Mirandese, nap:Neapolitan, nds:Low Saxon, nds-nl:Dutch Low Saxon, nn:Nynorsk, no:Norwegian, nrm:Norman, oc:Occitan, pcd:Picard, pl:Polish, pms:Piedmontese, pnt:Pontic, rm:Romansh, rmy:Romani, ro:Romanian, roa-rup:Aromanian, roa-tara:Tarantino, rue:Rusyn, sc:Sardinian, scn:Sicilian, sco:Scots, se:Northern Sami, sh:Serbo-Croatian, sk:Slovak, sl:Slovene, sq:Albanian, sr:Serbian, stq:Saterland Frisian, sv:Swedish, szl:Silesian, uk:Ukrainian, vec:Venetian, vep:Vepsian, vls:West Flemish, wa:Walloon, xal:Kalmyk, zea:Zealandic
perc mobile: 25.9%

regions: Europe/Asia
languages:crh:Crimean Tatar, ru:Russian, tr:Turkish
perc mobile: 20.6%

regions: Europe/North-America/Oceania/Asia/Africa
languages:en:English, simple:Simple English
perc mobile: 31.5%

regions: Europe/North-America/South-America/Asia/Africa
perc mobile: 31.9%

regions: Europe/North-America/South-America/Asia/Africa/Oceania
perc mobile: 28.0%

regions: Europe/South-America
perc mobile: 27.4%

regions: Europe/South-America/Africa/Asia
perc mobile: 25.0%

region: North-America
languages:cho:Choctaw, chr:Cherokee, chy:Cheyenne, cr:Cree, ht:Haitian, ik:Inupiak, iu:Inuktitut, kl:Greenlandic, mus:Muskogee, nah:Nahuatl, nv:Navajo, pdc:Pennsylvania German
perc mobile: 14.6%

region: Oceania
languages:bi:Bislama, ch:Chamorro, fj:Fijian, haw:Hawai’ian, hif:Fiji Hindi, ho:Hiri Motu, mh:Marshallese, mi:Maori, na:Nauruan, pih:Norfolk, sm:Samoan, to:Tongan, ty:Tahitian
perc mobile: 16.6%

region: South-America
languages:ay:Aymara, gn:Guarani, pap:Papiamentu, qu:Quechua, srn:Sranan
perc mobile: 13.8%

region: World
languages:la:Latin, yi:Yiddish
perc mobile: 11.5%

Posted in uncategorized | 1 Comment

Wiki Loves Monuments 2013

This gallery contains 5 photos.

ere are some charts on the breakdown by country of contributions and contributors to Wiki Loves Monuments 2013. Better late than never. I meant to publish this together with retention stats, but those are still in the pipeline, and may … Continue reading

More Galleries | 2 Comments

Reassessment of active editors

Yesterday I discovered a bug in wikistats which affects our editor counts for the last 2 years.

Wikistats does flag users as ‘anonymous’ based on pattern recognition, rather than relying on the <ip> tag. Reason: many anons with other pattern than just 4 numeric triplets (e.g. ended up in the <username> tag in early years). To my dismay I realized yesterday this recognition code was never adapted for ipv6 addresses. Hence those anonymous ipv6 addresses were counted as normal registered users in many reports. Especially for the last 12 month this visibly affected our totals for active editors (5+ edits a month), hardly so for very active editors (100+ edits a month).

Today I fixed this for our report on total unique (aka deduplicated) registered users for all Wikimedia wikis combined. The chart below show how much counts were lowered because of this.  Other reports will be fixed after June’s dump processing cycle.

My sincere apologies for any confusion or inconvenience caused by this.










Update: Here is a second chart which shows the effect on our active editors in absolute terms. For very active editors the difference is negligible and can not be shown in such a plot.


(for comparison here is the old version of the report)

Posted in uncategorized | 2 Comments

Portal can now be searched

Wikimedia stats portal now features more tools and reports than ever (57 and growing). An often heard complaint was that the portal was a bit overwhelming and hard to navigate.

Two changes hopefully help you find what you need with more ease. First all entries are now in one huge list, no artificial breakdown between internal and external tools. By itself this list may be even more daunting in size, but the new search feature aims to address just that.

You can now filter entries by keywords. Descriptions and search tags will be scanned. The search then returns a table of content, followed by qualifying full entries.

Like before each entry briefly describes a few highlights of the tool, and features a rather small screenshot. This screenshot is not meant to explain the tool or report in detail (it may even be hard to read). Its function is twofold: primarily it can help you find back a report which you used earlier, and which you may still recognize from its visual appearance. It also gives a clue for at a glance scanning for type of output, e.g. tables vs charts.



* Primary objective was to make the current portal easier to use with limited coding effort, short payback time. Any more substantial overhaul is not ruled out, but currently not on the agenda of the Wikimedia Analytics Team.
* Any feedback is of course welcome: suggestions for functional improvement, for entries to add, for keywords to add, for fixing minor layout quirks.
* Current focus is on publicly accessible tools and reports. None of the entries leads to a page which requires log in.
* You’ll find an entry for Wikipedia visualizations, but those can’t be searched individually (yet).
* Even some defunct reports are listed (but clearly marked as such). Partly because some of these are dearly missed and can serve as inspiration for future replacements.

Posted in Wikistats Reports | Leave a comment

Full archive dumps are being processed again, first since 2010

There is not Wikistats issue for which I received more mails than this: since 2010 some metrics on article content were no longer updated: word count, articles above 200 chars, mean size in bytes, percentage above 0.5 or 2 Kb, database size, word count, images and links (internal, interwiki, external). Word count in particular was often mentioned.








Example: Polish Wikipedia

All these metrics need to be collected from the ‘full archive dumps’, the dumps which contain the full raw content of every revision of every page. The sheer amount of data that needs to processed made it no longer feasible to process those full dumps on a monthly basis (it didn’t help that I do rather ambitious cleaning up of the raw page content before counts are generated (e.g. for word count to approach ‘readable body text’).

So in 2010 for most Wikipedias I switched to processing stub dumps, which contain all meta data for every revision, but not the raw page content. For sister projects with much smaller dumps I continued processing full archive dumps.

Now finally I can announce I applied a fix which makes it possible to update those missing metrics roughly on a quarterly cycle. Full archive dumps are now processed on a different server, running as continuous low priority job, and the reporting process combines metrics from both servers.

In the last two weeks some 260 wikis were processed. Only 10 large wikis remain to be done:  Arabic, English, French, German, Hebrew, Italian, Japanese, Spanish, Swedish, Russian.  I expect in a month time all but English will be ready. English may arrive -fingers crossed- a month later.




Posted in uncategorized | 2 Comments

Wikimedia editor trends broken down by project

Since a few years we present monthly deduplicated totals for active and very active editors. Deduplicated meaning: every editor only counts once, regardless of number of wikis edited. We never collected similar trends on a per project basis. So to make up for this, last week I ran some special iterations of Wikistats to collect active editors trends per project.

I want to share with you four charts, as they were presented at today’s Metrics Meeting. There will be a follow-up study, but here are a few quick observations:

1) First chart is the big picture,

  • where English editor community is still somewhat shrinking (but most of that happened earlier)
  • where all non-English Wikipedias combined are fairly stable
  • where non-Wikipedias combined show significant growth especially in 2013

2) Second chart focuses on two largest non Wikipedia projects: Commons and new project Wikidata (together these make up most of the orange line in first chart).

Note how the large peaks in Commons editorship in September are result of hugely successful Wiki Loves Monuments contests

3) Third chart shows smaller Wikimedia projects which are stable or growing

4) Fourth chart shows smaller projects which are slightly or significantly shrinking

Thanks to Dario Taraborelli for inquiring about these metrics. He and I will look into this further, possibly checking correlation with page view trends.


UniqueActiveEditorsOnLargestNonWikipedias UniqueActiveEditorsOnSmallProjects-Growth UniqueActiveEditorsOnSmallProjects-Decline




Posted in uncategorized | 3 Comments

Total editors on Wikipedia compared with same on all Wikimedia wikis


Posted in Nice Charts, Wikimedia Edit(or)s | 4 Comments