Full archive dumps are being processed again, first since 2010

There is not Wikistats issue for which I received more mails than this: since 2010 some metrics on article content were no longer updated: word count, articles above 200 chars, mean size in bytes, percentage above 0.5 or 2 Kb, database size, word count, images and links (internal, interwiki, external). Word count in particular was often mentioned.

Table

 

 

 

 

 

 

Example: Polish Wikipedia

All these metrics need to be collected from the ‘full archive dumps’, the dumps which contain the full raw content of every revision of every page. The sheer amount of data that needs to processed made it no longer feasible to process those full dumps on a monthly basis (it didn’t help that I do rather ambitious cleaning up of the raw page content before counts are generated (e.g. for word count to approach ‘readable body text’).

So in 2010 for most Wikipedias I switched to processing stub dumps, which contain all meta data for every revision, but not the raw page content. For sister projects with much smaller dumps I continued processing full archive dumps.

Now finally I can announce I applied a fix which makes it possible to update those missing metrics roughly on a quarterly cycle. Full archive dumps are now processed on a different server, running as continuous low priority job, and the reporting process combines metrics from both servers.

In the last two weeks some 260 wikis were processed. Only 10 large wikis remain to be done:  Arabic, English, French, German, Hebrew, Italian, Japanese, Spanish, Swedish, Russian.  I expect in a month time all but English will be ready. English may arrive -fingers crossed- a month later.

 

 

 

Posted in uncategorized | 2 Comments

Wikimedia editor trends broken down by project

Since a few years we present monthly deduplicated totals for active and very active editors. Deduplicated meaning: every editor only counts once, regardless of number of wikis edited. We never collected similar trends on a per project basis. So to make up for this, last week I ran some special iterations of Wikistats to collect active editors trends per project.

I want to share with you four charts, as they were presented at today’s Metrics Meeting. There will be a follow-up study, but here are a few quick observations:

1) First chart is the big picture,

  • where English editor community is still somewhat shrinking (but most of that happened earlier)
  • where all non-English Wikipedias combined are fairly stable
  • where non-Wikipedias combined show significant growth especially in 2013

2) Second chart focuses on two largest non Wikipedia projects: Commons and new project Wikidata (together these make up most of the orange line in first chart).

Note how the large peaks in Commons editorship in September are result of hugely successful Wiki Loves Monuments contests

3) Third chart shows smaller Wikimedia projects which are stable or growing

4) Fourth chart shows smaller projects which are slightly or significantly shrinking

Thanks to Dario Taraborelli for inquiring about these metrics. He and I will look into this further, possibly checking correlation with page view trends.

UniqueActiveEditorsOnAllWikimediaWikis

UniqueActiveEditorsOnLargestNonWikipedias UniqueActiveEditorsOnSmallProjects-Growth UniqueActiveEditorsOnSmallProjects-Decline

 

 

 

Posted in uncategorized | 4 Comments

Total editors on Wikipedia compared with same on all Wikimedia wikis

TAEWikipedia_share_TAE

Posted in Nice Charts, Wikimedia Edit(or)s | 3 Comments

New Edit and Revert Stats

Finally on popular request edit/revert stats will now be generated each month for all Wikimedia projects. For the largest Wikipedias some new charts are already online: English, Japanese, German, Spanish, Russian, French, Dutch. Other wikis will follow in coming weeks.

For each wiki there will be a separate page with charts and tables. Charts come in two variations: raw data and trend lines (this may see overkill, but on some wikis one variation is more readable than the other, depending on line patterns) . Tables tell you what kind of content is reverted most, for which users, and by which users and bots.

Some examples:

Edits and reverts on English Wikipedia
The sharp peak in bot edits in 2013 is caused by the migration of interwiki links to Wikidata. You will see in coming months that on many wikis bot edit counts will decline to far below Dec 2012 level, as most interwiki bots stopped working.
Revert ratio on Dutch Wikipedia

On many wikis there is a very distinct seasonal pattern for revert ratio. Here you see how on the Dutch Wikipedia far less anonymous edits are reverted in summer, and a bit less around Christmas. Most probably there is less vandalism in those periods, as schools are closed. Perhaps there are also less edit patrollers on duty during vacations, and more bad edits slip through?
Revert ratio on Spanish Wikipedia

On the Spanish Wikipedia the dip in revert ratio occurs every half year. Same goes for the Portuguese Wikipedia (new charts not yet available). There is a simple explanation: both wikis are edited intensely in the northern and southern hemisphere, with different holiday seasons.
Reverts by actor and acted upon
Breakdown of reverts on English Wikipedia
Most active reverters

Who reverts most? Who and what is reverted most? Here for the English Wikipedia.
Posted in uncategorized | 2 Comments

Wiki Loves Monuments 2010-2012 – Retention / Stats per Country

A few weeks ago, at the Amsterdam Hackaton, Lodewijk Gelauff (aka Effeietsanders) asked me to look into the Wiki Loves Monuments (WLM) data and provide a list of images, who uploaded them, when, and for which country. Primary goal: to provide accurate data for the submission of the 2012 contest as Guinness World Record (world record currently held by WLM 2011), and was indeed a global event (WLM scope in 2010: The Netherlands, 2011: Europe, 2012: Earth). Once these data were extracted, an update on retention rates, and a breakdown per country for images and contributors would also be nice.

Retention rates

With cleaner data than half a year ago we did rerun editor retention numbers.

In Sep 2011 3,497 new users contributed to WLM. Out of these, 397 still contributed in Nov 2011 or later, on any Wikimedia wiki, thus leading to an awesome retention rate for WLM 2011 of 11.4% !!

In Sep 2012 10,825 new users contributed to WLM, in Oct another 412 (WLM Israel ended later). Out of these, 635 still contributed in Nov 2012 or later, on any Wikimedia wiki, thus leading to a retention rate for WLM 2012 of 5.7%. (it will grow, as did the number for 2011)

Retention_Rate_WLM_2011
Retention_Rate_WLM_2012

 

Breakdown by country

Here are four charts that show the breakdown per country of contributors and uploads, either for the last year 2012, or in comparison with earlier years (see also notes below).

Click image for full size
WLM Contributors by Country, Comparison
WLM_Contributors_by_Country_2010-2012
WLM Uploads by Country, Comparison
WLM_Uploads_by_Country_2010-2012
WLM Uploads by Country, Cumulative
WLM_Uploads_by_Country_2010-2012_Cumulative
WLM Uploads by Country 2012
WLM_Uploads_by_Country_2012


Notes:
1) Country is about the monument, not the uploader. A person may have contributed images for several countries.
2) Please do not read too much in the chart showing cumulative uploads for three WLM contests, and particularly not %completed. Some countries may have more monuments than others, or have them more easily accessible (travel time). Some countries have a more active Wikimedia community than others. Some countries have fast internet, in which case uploading multiple images per monument is less of a burden.


Every WLM image in Wikimedia Commons is tagged with a template. This template specifies year and country. In a wiki environment there is always some data normalization to be done, as most of the data was added automatically but some image sets were hand-coded (or uploaded with custom scripts). So a couple of variations occurred. For instance: why specify country code ‘de’ if you can also specify ‘{{lc:DE}}’  :-) So it took a few iterations to detect and fix the anomalies.


There are two raw data files: wlm_images_contributors.zip and wlm_retention.zip (sparingly documented). Let me know if you have questions.

Posted in Nice Charts, Wikimedia Edit(or)s | 1 Comment