Lately, Wikipedia has been recognized as a promising lexical semantic resource. If Wikipedia is to be used for large-scale NLP tasks, efficient programmatic access to the knowledge therein is required.

JWPL (Java Wikipedia Library) is a open-source, Java-based application programming interface that allows to access all information contained in Wikipedia. The high-performance Wikipedia API provides structured access to information nuggets like redirects, categories, articles and link structure. It is described in our LREC 2008 paper

JWPL contains a Mediawiki Markup parser that can be used to further analyze the contents of a Wikipedia page. The parser can also be used stand-alone with other texts using MediaWiki markup.

Further, JWPL contains the tool JWPLDataMachine that can be used to create JWPL dumps from the publicly available dumps at download.wikimedia.org

In addition to that, JWPL now contains the Wikipedia Revision Toolkit, which consists of two tools, the TimeMachine and the RevisionMachine. The TimeMachine can be used to reconstruct a snapshot of Wikipedia from a specific date, or to create multiple snapshots from a time span. The RevisionMachine offers efficient access to the edit history of Wikipedia articles while storing the revisions in a dedicated storage format which decreases the demand of storage space by 98%. The toolkit is described in our ACL system demonstration paper.


Downloads and further instructions how to add JWPL or the Wikipedia Revision Toolkit as Maven dependencies to your project can be found on the GitHub Project Site

The source code is provided under the LGPL v3.


Hierarchy Identification for Automatically Generating Table-of-Contents

Author Nicolai Erbs, Iryna Gurevych, Torsten Zesch
Date September 2013
Kind Inproceedings
Editor Galia Angelova, Kalina Bontcheva, Ruslan Mitkov
PublisherINCOMA Ltd.
AddressShoumen, Bulgaria
Book titleProceedings of 9th Conference on Recent Advances in Natural Language Processing (RANLP 2013)
LocationHissar, Bulgaria
Research Areas Ubiquitous Knowledge Processing, Knowledge Discovery in Scientific Literature, UKP_a_NLP4Wikis, UKP_p_WIWEB, UKP_p_WIKULU, reviewed, UKP_s_JWPL, UKP_s_DKPro_Lab, UKP_s_DKPro_Core, UKP_p_openwindow
Abstract A table-of-contents (TOC) provides a quick reference to a document’s content and structure. We present the first study on identifying the hierarchical structure for automatically generating a TOC using only textual features instead of structural hints e.g. from HTML-tags. We create two new datasets to evaluate our approaches for hierarchy identification. We find that our algorithm performs on a level that is sufficient for a fully automated system. For documents without given segment titles, we extend out work by auto matically generating segment titles. We make the datasets and our experimental framework publicly available in order to foster future research in TOC generation.
Full paper (pdf)
[Export this entry to BibTeX]

Important Copyright Notice:

The documents contained in these directories are included by the contributing authors as a means to ensure timely dissemination of scholarly and technical work on a non-commercial basis. Copyright and all rights therein are maintained by the authors or by other copyright holders, notwithstanding that they have offered their works here electronically. It is understood that all persons copying this information will adhere to the terms and constraints invoked by each author's copyright. These works may not be reposted without the explicit permission of the copyright holder.


A A A | Drucken Print | Impressum Impressum | Sitemap Sitemap | Suche Search | Kontakt Contact | Webseitenanalyse: Mehr Informationen
zum Seitenanfangzum Seitenanfang