Lately, Wikipedia has been recognized as a promising lexical semantic resource. If Wikipedia is to be used for large-scale NLP tasks, efficient programmatic access to the knowledge therein is required.

JWPL (Java Wikipedia Library) is a open-source, Java-based application programming interface that allows to access all information contained in Wikipedia. The high-performance Wikipedia API provides structured access to information nuggets like redirects, categories, articles and link structure. It is described in our LREC 2008 paper

JWPL contains a Mediawiki Markup parser that can be used to further analyze the contents of a Wikipedia page. The parser can also be used stand-alone with other texts using MediaWiki markup.

Further, JWPL contains the tool JWPLDataMachine that can be used to create JWPL dumps from the publicly available dumps at download.wikimedia.org

In addition to that, JWPL now contains the Wikipedia Revision Toolkit, which consists of two tools, the TimeMachine and the RevisionMachine. The TimeMachine can be used to reconstruct a snapshot of Wikipedia from a specific date, or to create multiple snapshots from a time span. The RevisionMachine offers efficient access to the edit history of Wikipedia articles while storing the revisions in a dedicated storage format which decreases the demand of storage space by 98%. The toolkit is described in our ACL system demonstration paper.


Downloads and further instructions how to add JWPL or the Wikipedia Revision Toolkit as Maven dependencies to your project can be found on the GitHub Project Site

The source code is provided under the LGPL v3.

Project Publications

Displaying results 1 to 7 out of 7

The Writing Process in Online Mass Collaboration: NLP-Supported Approaches to Analyzing Collaborative Revision and User Interaction
Johannes Daxenberger

Approaches to Automatic Text Structuring
Nicolai Erbs
September 2015.

The Quality of Content in Open Online Collaboration Platforms: Approaches to NLP-supported Information Quality Management in Wikipedia
Oliver Ferschke
July 2014.

Hierarchy Identification for Automatically Generating Table-of-Contents
Nicolai Erbs,Iryna Gurevych,Torsten Zesch
In: Galia Angelova and Kalina Bontcheva and Ruslan Mitkov: Proceedings of 9th Conference on Recent Advances in Natural Language Processing (RANLP 2013), p. 252-260, INCOMA Ltd., September 2013. ISSN 1313-8502.

Bringing Order to Digital Libraries: From Keyphrase Extraction to Index Term Assignment
Nicolai Erbs,Iryna Gurevych,Marc Rittberger
In: D-Lib Magazine, vol. 19, no. 9/10, p. 1-16, September 2013.

Wikipedia Revision Toolkit: Efficiently Accessing Wikipedia's Edit History
Oliver Ferschke,Torsten Zesch,Iryna Gurevych
In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. System Demonstrations, p. 97-102, June 2011.

Extracting Lexical Semantic Knowledge from Wikipedia and Wiktionary
Torsten Zesch,Christof Müller,Iryna Gurevych
In: Proceedings of the 6th International Conference on Language Resources and Evaluation, May 2008.


A A A | Drucken Print | Impressum Impressum | Sitemap Sitemap | Suche Search | Kontakt Contact | Webseitenanalyse: Mehr Informationen
zum Seitenanfangzum Seitenanfang