Lately, Wikipedia has been recognized as a promising lexical semantic resource. If Wikipedia is to be used for large-scale NLP tasks, efficient programmatic access to the knowledge therein is required.
JWPL (Java Wikipedia Library) is a free, Java-based application programming interface that allows to access all information contained in Wikipedia.
The high-performance Wikipedia API provides structured access to information nuggets like redirects, categories, articles and link structure. It is described in our LREC 2008 paper.
JWPL now contains a Mediawiki Markup parser that can be used to further analyze the contents of a Wikipedia page. The parser can also be used stand-alone with other texts using MediaWiki markup.
JWPL now contains the tool JWPLDataMachine that can be used to create JWPL dumps from the publicly available dumps at http://download.wikimedia.org.
Download
Free non-profit and non-commercial use is granted provided that the license agreement on the download page is accepted. The authors do not assume any responsibility for downloaded software and the use of the software.
Please contact the Head of Ubiquitous Knowledge Processing (UKP) Lab Prof. Dr. Iryna Gurevych to use this software for commercial purposes.
Download — API
07.07.2009 -- There is a bug in the DataMachine that causes the tranformation of the current English dump to stop with an exception complaining about "Invalid contributor". We have solved that issue and will include the updated DataMachine in the next release (probably after my thesis defense :). In the meantime, you can get an updated version via email. Many thanks to all who reported the bug.
06.02.2009 -- Version v0.453b is available. This is a bug fix release with a few new features.
07.01.2009 -- Version v0.452b is available. This is a bug fix release.
08.12.2008 -- Version v0.45b is available. This release supersedes the previous releases v0.44beta and v0.3beta
Download JWPL — Java Wikipedia
Library v0.452beta
Main improvements in v0.453b
- Fixed a bug that under special circumstances added non-references links to the list of inlinks or outlinks.
- Added a method in the Page object to directly retrieve the link anchor texts.
- A couple of minor stability bug fixes.
Main improvements in v0.452b
- Fixed the issue that getParents() and getChildren() in the Category object had inverted functionality.
Main improvements in v0.45b
- The Page object has two new methods getNumberOfInlinks() and getNumberOfOutlinks(). This is much faster than the old way of writing getInlinks().size().
- The Wikipedia object now contains a method for getting a page by its pageId.
- A lot of bug fixes. Thanks to all who provided feedback.
Main improvements in v0.44b
- The release now contains the tool JWPLDataMachine that can be used to create JWPL dumps from the publicly available dumps at http://download.wikimedia.org.
Main improvements in v0.44beta
- JWPL now contains a Mediawiki Markup parser that can be used to further analyze the contents of a Wikipedia page. The parser can also be used stand-alone with other texts using MediaWiki markup.
- The performance of iterating through pages and categories has been significantly improved (>10 times faster than before).
- Added support for all currently used Wikipedia languages.
- The method getDescendants() now returns an buffered Iterable instead of a Set. This significantly lowers the memory usage.
Other changes
- Fixed the error that the existsPage() methods sometimes throws an org.hibernate.NonUniqueResultException. This was due to a missing COLLATE statement in the database query.
- Changed the getPage(String), getCategory(String), and existsPage(String) methods of the Wikipedia object to also work with lowercase keyword queries or with underscores instead of spaces.
- Changed the package name.
- Added a new constructur to the database configuration object that sets all necessary parameters.
Download — Data
Starting with release 0.44b JWPL contains the tool JWPLDataMachine that allows users to transform the publicly available dumps into the format required for JWPL.
Already transformed Wikipedia data is available for the following languages. Downloads are offered via BitTorrent.
- English (en)
- German (de)
- Czech (cs)
- Ukrainian (uk)
If you cannot use BitTorrent, you may also use the ftp server as a last ressort. Please, try to use the torrents first.
Documentation
See Documentation page and FAQ.
Please use our bugzilla system for reporting bugs and feature requests.
If you use JWPL in scientific work, please cite
@INPROCEEDINGS{ZeschMuellerGurevych2008,
author = {Torsten Zesch and Christof M{\"u}ller and Iryna Gurevych},
title = {{Extracting Lexical Semantic Knowledge from Wikipedia and Wiktionary}},
booktitle = {Proceedings of the Conference on Language Resources and Evaluation (LREC)},
year = {2008},
}
Is JWPL for you?
JWPL is for you:
- if you need structured access to Wikipedia in Java
JWPL is not for you:
- if you need to query live data. To use our Wikipedia API, you need an optimized data transformation available from our ftp server, i.e. you are querying a static Wikipedia dump. This gives much better performance and lightens the load on the Wikipedia servers.
- if you need information about page edit history. This feature is not implemented so far.
JWPL is maintained by Torsten Zesch and Elisabeth Wolf.
If you have any technical questions, please write to jwpl@tk.informatik.tu-darmstadt.de





