At Wikimania Gdansk I talked about the newest addition to Wikistats: article revert trends.
This week the new set of tables and charts has been released for almost 800 Wikimedia wikis. For example here is the report for the English Wikipedia.
In recent years there has been much talk within the Wikimedia community about edit reverts. Is there a growing tendency to revert (undo) new contributions, especially edits by anonymous users? If so, does this discourage new users to contribute? How large is the share of reverts anyway? Is an initiative like Flagged Revisions (FlagRevs), which seeks to shield the general public from deliberate but mostly short-lived lapses in article quality, effective? Is it also non-disruptive? The latter questions also bear some relevance for the current Pending Changes initiative, which was inspired by FlagRevs.
Particularly on the German Wikipedia, where FlagRevs first took off, there is a set of relevant statistics already, notably these trend charts by user ParaDox. Also some months ago Felipe Ortega was asked by the German Chapter to look at the effect of FlagRevs. Felipe also presented at Wikimania Gdansk. Although there are differences in our approaches, and our research only partially overlapped, I am glad our conclusions are roughly the same (see our presentations).
As always I looked for an approach that is language independent, and therefore applicable for all our wikis, yet yields solid results. In this case the crucial factor was how to recognize reverts in our archives (xml dumps).
One way is to look for key words in the edit comments that correlate well with reverts. Advantage is: it can detect incomplete reverts, where a user manually undoes most of the last changes , but not quite completely restores the previous state of the article. Drawbacks are: quite many false positives are inevitable, and of course it does not scale well to 280 languages.
There is a way to detect complete reverts efficiently: calculate a checksum for each revision, and compare this with checksums of earlier revisions. The MD5 checksum algorithm guarantees with near certainty that equal checksums imply equal input, byte for byte, in this case full revision contents. Another important advantage: it is completely irrelevant what language or encoding is used in any wiki. For a few large wikis I tried both approaches: a wiki specific set of keywords in comments, and MD5 checksums. However tables and charts in the new reports are almost solely based on the more reliable MD5 approach.
Here is a summary of the tables and charts that together comprise the new report: Tables: ranking of most reverted users, most reverting users, most reverted articles. Charts: edit trends (some published earlier) , and revert trends. Reverts trends come in two variations: revert ratio per class of editor (registered or anonymous editor, or bot) and breakdown of anonymous edits by class of reverting editor. All charts come in two variations: one shows the raw numbers, another show trends (raw data minus seasonal patterns and random variations).
Surely there is ample risk of information overload. But I am convinced that for many of our wikis some people will be motivated to dig into these stats and come up with a context and some explanations, a story which will have wider implications beyond that particular wiki.
Hopefully the new stats can bring some solid numbers to discussions about our revert policies. Of course these numbers are influenced by methodology, are selective (meaning they don’t address all issues), and probably raise some new questions.The data files generated in this project can be reused in further research, e.g. to determine which share of bad edits is detected and undone before a an article update revision is released to the public.
Disclaimer: complex social phenomena will never be explained by quantative data only, and seldom yield to what-if questions, neither do they lend themselves easily to double blind experiments.
See also my slides for Wikimania Gdansk presentation.
Note: in the new reports some language links are missing. Until fixed please use this table to access all reports.
Update: anyone interested in studying revert patterns be advised to look into the work of Palo Alto Research Center scientist Ed Chi and his colleagues. Ed has been studying Wikipedia and its growth patterns for many years now. Vandalism and reverts were part of that research. Among his many publications several focus on Wikipedia, two recent ones relevant to this topic are The Singularity is Not Near: Slowing Growth of Wikipedia (2009) and What’s in Wikipedia? Mapping Topics and Conflict Using Socially Annotated Category Structure (2009).