In the text mining and text analytics research area, we design algorithms to extract information from unstructured text. These algorithms are used in many contexts, e.g. Digital Humanities, educational research, research about the web 2.0, or information retrieval. We particularly focus on innovative automatized approaches to discover structure in textual documents by means of text classification.

The growing text analytics field heavily relies on supervised text classification to offer services such as sentiment analysis, document categorization, or scientific discovery. In a nutshell, supervised text classification extracts relevant information from manually classified documents and learns a model from the extracted information. Machine learning classifiers learn to take decisions autonomously, so that there is no need to programmatically implement rules that are later used to automatically take decisions.

We apply supervised text classification algorithms to complex language processing problems and novel datasets. In such settings, a textual document is typically enhanced with automatic annotations about grammatical and discourse structure, before the information relevant to the given problem is extracted. To reduce the effort of manually creating training data, we are currently also exploring the use of semi-supervised and unsupervised text mining algorithms.


Beyond supervised text classification for novel language processing tasks, the text mining and analytics area carries out research about:

Current Projects

  • CEDIFOR: This project aims to foster interdisciplinary work between Computer Science and Digital Humanities by providing know-how and research infrastructures for text analytics to humanities researchers in the Rhein-Main area, supporting them to investigate novel research questions. This project is conducted in collaboration with the Goethe-Universität Frankfurt and the German Institute for International Educational Research (DIPF).

  • Audiovisual Content Processing: The goal of this project is the creation of frameworks which facilitate the integration of manual and automatic analysis of audiovisual content, and the identification of the most relevant audiovisual features for different tasks in Digital Humanities. The developed tools will be integrated as audiovisual processing components into the UIMA/DKPro framework.

  • Structuring Story-Chains: Nearly everyone is struggling to keep up with the larger and larger amounts of information, making this information-overload a major problem in todays society. The news domain is no exception. Since current search engines retrieve information based on keywords and sort the results based on their associated relevance for the entered search query, the large amount of returned articles makes it hard to understand the evolution of an event. In this project, we aim to develop novel methods for structuring news stories in a more coherent way by attempting to discover and model causal connections between articles, present complex news stories in a simpler way and reduce the information-overload.

Past Projects

  • Personality Profiling in Books: For the e-book recommendation systems it can be very helpful to know answers to high-level content questions that readers may have, for example "What is the main hero like?", "Is the story complicated?" or "Is the book suitable for children?". The idea of this project is to leverage real-world knowledge resources in order to facilitate estimating answers to such questions with a machine learning system. To reach this goal, the initial research focus lies in identifying suitable approaches to integrate semantic knowledge into the text classification algorithms.

  • VISADOC: This project investigates novel textual features for modeling content-related text properties. It aims to develop an interactive feature engineering approach for complex user-defined semantic properties, as well as visual analysis tools that support the exploration of large document collections with respect to a certain text property.
  • IT Forensics/CASED: New forms of communication in the Web 2.0 are increasingly used for preparing and organizing crimes such as sexual harassment or human trafficking. This project aims to create tools which aid to investigate such crimes. It aims to find relevant documents, identify relevant information bits, and analyze the relations between them.

  • LOEWE TP 2.3: The “Text as Process” research area of the interdisciplinary LOEWE Research Center “Digital Humanities” deals with linguistic properties of collaboratively created texts in the web 2.0. It focusses on the investigation of mass collaboration in online settings by analyzing the quality of content, the history of documents, background discussions, collaboration patterns, and user roles. It this end, the project develops novel datasets, based on article histories and discussion pages from the online encyclopedia Wikipedia.

  • THESEUS TEXO: The THESEUS project strives to develop application oriented base technologies, technical standards, and products, which will allow users and companies to access services, content and knowledge all over the world. TEXO is a use case in the THESEUS program which focuses on the discovery of new services as well as their combination to create new business.

Completed PhD Theses


Additional Attributes


On the "How" and "Why" of Emergent Role Behaviors in Wikipedia

Ofer Arazy, Hila Lifshitz-Assaf, Oded Nov, Johannes Daxenberger, Martina Balestra, Coye Cheshire
In: Proceedings of the 2017 ACM Conference on Computer Supported Cooperative Work and Social Computing, p. 2039-2051, February 2017

Turbulent Stability of Emergent Roles: The Dualistic Nature of Self-Organizing Knowledge Co-Production

Ofer Arazy, Johannes Daxenberger, Hila Lifshitz-Assaf, Oded Nov, Iryna Gurevych
In: Information Systems Research, Vol. 27, p. 792-812, December 2016
[Online-Edition: http://pubsonline.informs.org/doi/abs/10.1287/isre.2016.0647]

A User Interface for the Exploration of Manually and Automatically Coded Scientific Reasoning and Argumentation

Patrick Lerner, Andras Csanadi, Johannes Daxenberger, Lucie Flekova, Christian Ghanem, Ingo Kollar, Frank Fischer, Iryna Gurevych
In: Proceedings of the International Conference of the Learning Sciences (ICLS) 2016, p. 938-941, June 2016
International Society of the Learning Sciences
[Online-Edition: https://reason.ukp.informatik.tu-darmstadt.de:9443/]

Needle in a Haystack: Reducing the Costs of Annotating Rare-Class Instances in Imbalanced Datasets

Emily Jamison, Iryna Gurevych
In: Proceedings of the 28th Pacific Asia Conference on Language, Information and Computing, p. 244--253, December 2014
Department of Linguistics, Chulalongkorn University
[Online-Edition: http://www.arts.chula.ac.th/~ling/paclic28/]

DKPro TC: A Java-based Framework for Supervised Learning Experiments on Textual Data

Johannes Daxenberger, Oliver Ferschke, Iryna Gurevych, Torsten Zesch
In: Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, p. 61-66, June 2014
Association for Computational Linguistics
[Online-Edition: https://github.com/dkpro/dkpro-tc]

Automatically Detecting Corresponding Edit-Turn-Pairs in Wikipedia

Johannes Daxenberger, Iryna Gurevych
In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), p. 187-192, June 2014
Association for Computational Linguistics
[Online-Edition: https://www.ukp.tu-darmstadt.de/data/discourse-analysis/wikipedia-edit-turn-pair-corpus/]

What Makes a Good Biography? Multidimensional Quality Analysis Based on Wikipedia Article Feedback Data

Lucie Flekova, Oliver Ferschke, Iryna Gurevych
In: Proceedings of the 23rd International World Wide Web Conference (WWW 2014), p. 855-866, April 2014
International World Wide Web Conferences Steering Committee
[Online-Edition: https://www.ukp.tu-darmstadt.de/data/quality-assessment/wikipedia-article-feedback/]

Automatically Classifying Edit Categories in Wikipedia Revisions

Johannes Daxenberger, Iryna Gurevych
In: Conference on Empirical Methods in Natural Language Processing (EMNLP 2013), p. 578-589, October 2013
Association for Computational Linguistics
[Online-Edition: www.ukp.tu-darmstadt.de/data/textual-revisions/edit-category-classification/]

Headerless, Quoteless, but not Hopeless? Using Pairwise Email Classification to Disentangle Email Threads

Emily Jamison, Iryna Gurevych
In: Proceedings of 9th Conference on Recent Advances in Natural Language Processing (RANLP 2013), p. 327--335, September 2013

The Impact of Topic Bias on Quality Flaw Prediction in Wikipedia

Oliver Ferschke, Iryna Gurevych, Marc Rittberger
In: Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (ACL 2013), Vol. 1, p. 721--730, August 2013
Association for Computational Linguistics
[Online-Edition: http://www.ukp.tu-darmstadt.de/data/wiki-flaws/]

Primary Contact

Dr. Johannes Daxenberger

A A A | Drucken Print | Impressum Impressum | Sitemap Sitemap | Suche Search | Kontakt Contact | Webseitenanalyse: Mehr Informationen
zum Seitenanfangzum Seitenanfang