PARC study on Wikipedia’s topical distribution

Today I learned through Wikipedia’s Signpost of a new study published by PARC researchers titled “What’s in Wikipedia? Mapping Topics and Conflict Using Socially Annotated Category Structure”. Signpost: “Based on the categories assigned to each article, coverage is sorted according to eleven broad categories used in Wikipedia’s Categorical index”. “To explain their methodology of sorting content, the researchers use the example of Albert Einstein. In January 2008 (the last available full data dump of English Wikipedia), the Einstein article had 26 categories. Each category can be broken down according to its proportional relevance to the 11 top-level categories based on the shortest paths through the category system to the different top-level categories.”

An excellent idea. How could I say otherwise, as I proposed almost the same study (the topical distribution part, not the analysis of conflicts) as part of a (thankfully) much longer list to WMF Staff in March 2008 regarding research on Wikipedia. I quote from that proposal: “Analyse topical distribution of Wikipedia […] We all feel that en: Wikipedia is less geeky in its content than 4 years ago. But that is a very rough statement. More granular observations would be neat. There is ample room for study here. Is balance in topical content still shifting? In what direction? Does breadth of coverage of fine arts follow the same trend as for physics or geometry, only 4 years behind? Is topical distribution quite different form one Wikipedia to another ?[…] Build a more refined internal representation of category hierarchy. Many articles belong to several categories and are therefore featured in several branches of the ‘full tree of knowledge’. Perhaps importance of categories can be weighed: […] An earlier idea of mine was to somehow weigh the importance of categories by their place in the tree. Say a topic is assigned a category which features on level 2 in the tree (e.g. ‘British artists’) and a category which features on level 5 (‘Councilmen and Aldermen of the City of London’) then that might be used to weigh the importance of the categories for the article at hand.”

Actually reading the study I see that SignPost’s synopsis is not exactly right. PARC does not trace the shortest paths through the category system [emphasis mine] to the different top-level categories.” From the study: “In the above approach, we determine semantic relatedness for Wikipedia category nodes through link distance metrics. The simplest path-based method of calculating semantic relatedness is edge counting [8], in which semantic distance is the length of the shortest path between two nodes. Other measures include normalizing the distance by taxonomy depth, or additionally including the depth of the least common subsumer of the two nodes [11]. After investigating the performance of more complex metrics (such as normalizing by taxonomy depth), we found little substantial difference in the results [emphasis mine] when applied to Wikipedia data. In the following analyses we use the simple shortest-path metric as our distance measure, which is also consistent with prior work [10].”

For now I have to take PARC’s word for it, that weighing categories via the link mesh is comparable to analysis of the category taxonomy at further depth. Hopefully when time permits some day I can still follow the taxonomy approach so that we can compare results.

For now I congratulate Aniket Kittur and Ed H. Chi, Bongwon Suh for a fine study. A study which sheds lights on one the most intriguing questions to be asked about Wikipedia: “To which extent has Wikipedia evolved from a nerd manual towards a evenly balanced ontology of human knowledge.”

Lastly, it would be interesting to see the topical distribution one or more levels deeper. The study shows that category “Culture and the Arts” has gained most prominence in recent years  : +210% from July 2006 to January 2008, now 30% of the contents of the English Wikipedia. Many people would be interested to learn how much of that is for action figures and card collectibles and how much for the fine arts.

  Sorry for the mistake. Even after re-reading the relevant passage you highlight, I still don't understand what I got wrong. Are you saying that they used the shortest paths through Wikipedia articles and other pages, and not just the category system? That seems compatible with what the paper says, but I thought it was implicit that they were tracing links withing the category systems.

    Anyhow, feel free to correct the Signpost.

