JWPL

Lately, Wikipedia has been recognized as a promising lexical semantic resource. If Wikipedia is to be used for large-scale NLP tasks, efficient programmatic access to the knowledge therein is required.

JWPL (Java Wikipedia Library) is a open-source, Java-based application programming interface that allows to access all information contained in Wikipedia. The high-performance Wikipedia API provides structured access to information nuggets like redirects, categories, articles and link structure. It is described in our LREC 2008 paper

JWPL contains a Mediawiki Markup parser that can be used to further analyze the contents of a Wikipedia page. The parser can also be used stand-alone with other texts using MediaWiki markup.

Further, JWPL contains the tool JWPLDataMachine that can be used to create JWPL dumps from the publicly available dumps at download.wikimedia.org

In addition to that, JWPL now contains the Wikipedia Revision Toolkit, which consists of two tools, the TimeMachine and the RevisionMachine. The TimeMachine can be used to reconstruct a snapshot of Wikipedia from a specific date, or to create multiple snapshots from a time span. The RevisionMachine offers efficient access to the edit history of Wikipedia articles while storing the revisions in a dedicated storage format which decreases the demand of storage space by 98%. The toolkit is described in our ACL system demonstration paper.

Downloads

Downloads and further instructions how to add JWPL or the Wikipedia Revision Toolkit as Maven dependencies to your project can be found on the GitHub Project Site

The source code is provided under the LGPL v3.

Publications

The Writing Process in Online Mass Collaboration: NLP-Supported Approaches to Analyzing Collaborative Revision and User Interaction

Author Johannes Daxenberger
Date January 2016
Kind Phdthesis
LocationDarmstadt
KeyTUD-CS-2015-1318
Research Areas Ubiquitous Knowledge Processing, reviewed, UKP_p_TextAsProcess, UKP_s_DKPro_TC, UKP_s_JWPL, UKP_reviewed, UKP_a_LangTech4eHum, UKP_a_TexMinAn
Abstract In the past 15 years, the rapid development of web technologies has created novel ways of collaborative editing. Open online platforms have attracted millions of users from all over the world. The open encyclopedia Wikipedia, started in 2001, has become a very prominent example of a largely successful platform for collaborative editing and knowledge creation. The wiki model has enabled collaboration at a new scale, with more than 30,000 monthly active users on the English Wikipedia. Traditional writing research deals with questions concerning revision and the writing process itself. The analysis of collaborative writing additionally raises questions about the interaction of the involved authors. Interaction takes place when authors write on the same document (indirect interaction), or when they coordinate the collaborative writing process by means of communication (direct interaction). The study of collaborative writing in online mass collaboration poses several interesting challenges. First and foremost, the writing process in open online collaboration is typically characterized by a large number of revisions from many different authors. Therefore, it is important to understand the interplay and the sequences of different revision categories. As the quality of documents produced in a collaborative writing process varies greatly, the relationship between collaborative revision and document quality is an important field of study. Furthermore, the impact of direct user interaction through background discussions on the collaborative writing process is largely unknown. In this thesis, we tackle these challenges in the context of online mass collaboration, using one of the largest collaboratively created resources, Wikipedia, as our data source. We will also discuss to which extent our conclusions are valid beyond Wikipedia. We will be dealing with three aspects of collaborative writing in Wikipedia. First, we carry out a content-oriented analysis of revisions in the Wikipedia revision history. This includes the segmentation of article revisions into human-interpretable edits. We develop a taxonomy of edit categories such as spelling error corrections, vandalism or information adding, and verify our taxonomy in an annotation study on a corpus of edits from the English and German Wikipedia. We use the annotated corpora as training data to create models which enable the automatic classification of edits. To show that our model is able to generalize beyond our own data, we train and test it on a second corpus of English Wikipedia revisions. We analyze the distribution of edit categories and frequent patterns in edit sequences within a larger set of article revisions. We also assess the relationship between edit categories and article quality, finding that the information content in high-quality articles tends to become more stable after their promotion and that high-quality articles show a higher degree of homogeneity with respect to frequent collaboration patterns as compared to random articles. Second, we investigate activity-based roles of users in Wikipedia and how they relate to the collaborative writing process. We automatically classify all revisions in a representative sample of Wikipedia articles and cluster users in this sample into seven intuitive roles. The roles are based on the editing behavior of the users. We find roles such as Vandals, Watchdogs, or All-round Contributors. We also analyze the stability of our discovered roles across time and analyze role transitions. The results show that although the nature of roles remains stable across time, more than half of the users in our sample changed their role between two time periods. Third, we analyze the correspondence between indirect user interaction through collaborative editing and direct user interaction through background discussion. We analyze direct user interaction using the notion of turns, which has been established in previous work. Turns are snippets from Wikipedia discussion pages. We introduce the notion of corresponding edit-turn-pairs. A corresponding edit-turn-pair consists of a turn and an edit from the same Wikipedia article; the turn forms an explicit performative and the edit corresponds to this performative. This happens, for example, when a user complains about a missing reference in the discussion about an article, and another user adds an appropriate reference to the article itself. We identify the distinctive properties of corresponding edit-turn-pairs and use them to create a model for the automatic detection of corresponding and non-corresponding edit-turn-pairs. We show that the percentage of corresponding edit-turn-pairs in a corpus of flawed English Wikipedia articles is typically below 5% and varies considerably across different articles. The thesis is concluded with a summary of our main contributions and findings. The growing number of collaborative platforms in commercial applications and education, e.g. in massive open online learning courses, demonstrates the need to understand the collaborative writing process and to support collaborating authors. We also discuss several open issues with respect to the questions addressed in the main parts of the thesis and point out possible directions for future work. Many of the experiments we carried out in the course of this thesis rely on supervised text classification. In the appendix, we explain the concepts and technologies underlying these experiments. We also introduce the DKPro TC framework, which was substantially extended as part of this thesis.
Website http://nbn-resolving.de/urn:nbn:de:tuda-tuprints-52259
Full paper (pdf)
[Export this entry to BibTeX]

Important Copyright Notice:

The documents contained in these directories are included by the contributing authors as a means to ensure timely dissemination of scholarly and technical work on a non-commercial basis. Copyright and all rights therein are maintained by the authors or by other copyright holders, notwithstanding that they have offered their works here electronically. It is understood that all persons copying this information will adhere to the terms and constraints invoked by each author's copyright. These works may not be reposted without the explicit permission of the copyright holder.

Staff

A A A | Drucken Print | Impressum Impressum | Sitemap Sitemap | Suche Search | Kontakt Contact | Webseitenanalyse: Mehr Informationen
zum Seitenanfangzum Seitenanfang