JWPL

Lately, Wikipedia has been recognized as a promising lexical semantic resource. If Wikipedia is to be used for large-scale NLP tasks, efficient programmatic access to the knowledge therein is required.

JWPL (Java Wikipedia Library) is a open-source, Java-based application programming interface that allows to access all information contained in Wikipedia. The high-performance Wikipedia API provides structured access to information nuggets like redirects, categories, articles and link structure. It is described in our LREC 2008 paper

JWPL contains a Mediawiki Markup parser that can be used to further analyze the contents of a Wikipedia page. The parser can also be used stand-alone with other texts using MediaWiki markup.

Further, JWPL contains the tool JWPLDataMachine that can be used to create JWPL dumps from the publicly available dumps at download.wikimedia.org

In addition to that, JWPL now contains the Wikipedia Revision Toolkit, which consists of two tools, the TimeMachine and the RevisionMachine. The TimeMachine can be used to reconstruct a snapshot of Wikipedia from a specific date, or to create multiple snapshots from a time span. The RevisionMachine offers efficient access to the edit history of Wikipedia articles while storing the revisions in a dedicated storage format which decreases the demand of storage space by 98%. The toolkit is described in our ACL system demonstration paper.

Downloads

Downloads and further instructions how to add JWPL or the Wikipedia Revision Toolkit as Maven dependencies to your project can be found on the GitHub Project Site

The source code is provided under the LGPL v3.

Publications

The Quality of Content in Open Online Collaboration Platforms: Approaches to NLP-supported Information Quality Management in Wikipedia

Author Oliver Ferschke
Date 2014
Kind Phdthesis
LocationDarmstadt
KeyTUD-CS-2014-0866
Research Areas Ubiquitous Knowledge Processing, reviewed, UKP_p_TextAsProcess, UKP_a_LangTech4eHum, UKP_s_DKPro_TC, UKP_s_JWPL, UKP_a_NLP4Wikis, UKP_a_TexMinAn
Abstract Over the past decade, the paradigm of the World Wide Web has shifted from static web pages towards participatory and collaborative content production. The main properties of this user generated content are a low publication threshold and little or no editorial control. While this has improved the variety and timeliness of the available information, it causes an even higher variance in quality than the already heterogeneous quality of traditional web content. Wikipedia is the prime example for a successful, large-scale, collaboratively created resource that reflects the spirit of the open collaborative content creation paradigm. Even though recent studies have confirmed that the overall quality of Wikipedia is high, there is still a wide gap that must be bridged before Wikipedia reaches the state of a reliable, citable source. A key prerequisite to reaching this goal is a quality management strategy that can cope both with the massive scale of Wikipedia and its open and almost anarchic nature. This includes an efficient communication platform for work coordination among the collaborators as well as techniques for monitoring quality problems across the encyclopedia. This dissertation shows how natural language processing approaches can be used to assist information quality management on a massive scale. In the first part of this thesis, we establish the theoretical foundations for our work. We first introduce the relatively new concept of open online collaboration with a particular focus on collaborative writing and proceed with a detailed discussion of Wikipedia and its role as an encyclopedia, a community, an online collaboration platform, and a knowledge resource for language technology applications. We then proceed with the three main contributions of this thesis. Even though there have been previous attempts to adapt existing information quality frameworks to Wikipedia, no quality model has yet incorporated writing quality as a central factor. Since Wikipedia is not only a repository of mere facts but rather consists of full text articles, the writing quality of these articles has to be taken into consideration when judging article quality. As the first main contribution of this thesis, we therefore define a comprehensive article quality model that aims to consolidate both the quality of writing and the quality criteria defined in multiple Wikipedia guidelines and policies into a single model. The model comprises 23 dimensions segmented into the four layers of intrinsic quality, contextual quality, writing quality and organizational quality. As a second main contribution, we present an approach for automatically identifying quality flaws in Wikipedia articles. Even though the general idea of quality detection has been introduced in previous work, we dissect the approach to find that the task is inherently prone to a topic bias which results in unrealistically high cross-validated evaluation results that do not reflect the classifier’s real performance on real world data. We solve this problem with a novel data sampling approach based on the full article revision history that is able to avoid this bias. It furthermore allows us not only to identify flawed articles but also to find reliable counterexamples that do not exhibit the respective quality flaws. For automatically detecting quality flaws in unseen articles, we present FlawFinder, a modular system for supervised text classification. We evaluate the system on a novel corpus of Wikipedia articles with neutrality and style flaws. The results confirm the initial hypothesis that the reliable classifiers tend to exhibit a lower cross-validated performance than the biased ones but the scores more closely resemble their actual performance in the wild. As a third main contribution, we present an approach for automatically segmenting and tagging the user contributions on article Talk pages to improve the work coordination among Wikipedians. These unstructured discussion pages are not easy to navigate and information is likely to get lost over time in the discussion archives. By automatically identifying the quality problems that have been discussed in the past and the solutions that have been proposed, we can help users to make informed decisions in the future. Our contribution in this area is threefold: (i) We describe a novel algorithm for segmenting the unstructured dialog on Wikipedia Talk pages using their revision history. In contrast to related work, which mainly relies on the rudimentary markup, this new algorithm can reliably extract meta data, such as the identity of a user, and is moreover able to handle discontinuous turns. (ii) We introduce a novel scheme for annotating the turns in article discussions with dialog act labels for capturing the coordination efforts of article improvement. The labels reflect the types of criticism discussed in a turn, for example missing information or inappropriate language, as well as any actions proposed for solving the quality problems. (iii) Based on this scheme, we created two automatically segmented and manually annotated discussion corpora extracted from the Simple English Wikipedia (SEWD) and the English Wikipedia (EWD). We evaluate how well text classification approaches can learn to assign the dialog act labels from our scheme to unseen discussion pages and achieve a cross-validated performance of F1 = 0.82 on the SEWD corpus while we obtain an average performance of F1 = 0.78 on the larger and more complex EWD corpus.
Website http://tuprints.ulb.tu-darmstadt.de/4092/
Full paper (pdf)
[Export this entry to BibTeX]

Important Copyright Notice:

The documents contained in these directories are included by the contributing authors as a means to ensure timely dissemination of scholarly and technical work on a non-commercial basis. Copyright and all rights therein are maintained by the authors or by other copyright holders, notwithstanding that they have offered their works here electronically. It is understood that all persons copying this information will adhere to the terms and constraints invoked by each author's copyright. These works may not be reposted without the explicit permission of the copyright holder.

Staff

A A A | Drucken Print | Impressum Impressum | Sitemap Sitemap | Suche Search | Kontakt Contact | Webseitenanalyse: Mehr Informationen
zum Seitenanfangzum Seitenanfang