In the text mining and text analytics research area, we design algorithms to extract information from unstructured text. These algorithms are used in many contexts, e.g. Digital Humanities, educational research, research about the web 2.0, or information retrieval. We particularly focus on innovative automatized approaches to discover structure in textual documents by means of text classification.

The growing text analytics field heavily relies on supervised text classification to offer services such as sentiment analysis, document categorization, or scientific discovery. In a nutshell, supervised text classification extracts relevant information from manually classified documents and learns a model from the extracted information. Machine learning classifiers learn to take decisions autonomously, so that there is no need to programmatically implement rules that are later used to automatically take decisions.

We apply supervised text classification algorithms to complex language processing problems and novel datasets. In such settings, a textual document is typically enhanced with automatic annotations about grammatical and discourse structure, before the information relevant to the given problem is extracted. To reduce the effort of manually creating training data, we are currently also exploring the use of semi-supervised and unsupervised text mining algorithms.


Beyond supervised text classification for novel language processing tasks, the text mining and analytics area carries out research about:

Current Projects

  • CEDIFOR: This project aims to foster interdisciplinary work between Computer Science and Digital Humanities by providing know-how and research infrastructures for text analytics to humanities researchers in the Rhein-Main area, supporting them to investigate novel research questions. This project is conducted in collaboration with the Goethe-Universität Frankfurt and the German Institute for International Educational Research (DIPF).

  • Audiovisual Content Processing: The goal of this project is the creation of frameworks which facilitate the integration of manual and automatic analysis of audiovisual content, and the identification of the most relevant audiovisual features for different tasks in Digital Humanities. The developed tools will be integrated as audiovisual processing components into the UIMA/DKPro framework.

  • Structuring Story-Chains: Nearly everyone is struggling to keep up with the larger and larger amounts of information, making this information-overload a major problem in todays society. The news domain is no exception. Since current search engines retrieve information based on keywords and sort the results based on their associated relevance for the entered search query, the large amount of returned articles makes it hard to understand the evolution of an event. In this project, we aim to develop novel methods for structuring news stories in a more coherent way by attempting to discover and model causal connections between articles, present complex news stories in a simpler way and reduce the information-overload.

Past Projects

  • Personality Profiling in Books: For the e-book recommendation systems it can be very helpful to know answers to high-level content questions that readers may have, for example "What is the main hero like?", "Is the story complicated?" or "Is the book suitable for children?". The idea of this project is to leverage real-world knowledge resources in order to facilitate estimating answers to such questions with a machine learning system. To reach this goal, the initial research focus lies in identifying suitable approaches to integrate semantic knowledge into the text classification algorithms.

  • VISADOC: This project investigates novel textual features for modeling content-related text properties. It aims to develop an interactive feature engineering approach for complex user-defined semantic properties, as well as visual analysis tools that support the exploration of large document collections with respect to a certain text property.
  • IT Forensics/CASED: New forms of communication in the Web 2.0 are increasingly used for preparing and organizing crimes such as sexual harassment or human trafficking. This project aims to create tools which aid to investigate such crimes. It aims to find relevant documents, identify relevant information bits, and analyze the relations between them.

  • LOEWE TP 2.3: The “Text as Process” research area of the interdisciplinary LOEWE Research Center “Digital Humanities” deals with linguistic properties of collaboratively created texts in the web 2.0. It focusses on the investigation of mass collaboration in online settings by analyzing the quality of content, the history of documents, background discussions, collaboration patterns, and user roles. It this end, the project develops novel datasets, based on article histories and discussion pages from the online encyclopedia Wikipedia.

  • THESEUS TEXO: The THESEUS project strives to develop application oriented base technologies, technical standards, and products, which will allow users and companies to access services, content and knowledge all over the world. TEXO is a use case in the THESEUS program which focuses on the discovery of new services as well as their combination to create new business.

Completed PhD Theses


Additional Attributes


UKP TU-DA at GermEval 2017: Deep Learning for Aspect Based Sentiment Detection

Ji-Ung Lee, Steffen Eger, Johannes Daxenberger, Iryna Gurevych
In: Proceedings of the 27th Conference of the German Society for Computational Linguistics (GSCL 2017), p. (to appear), September 2017

On the "How" and "Why" of Emergent Role Behaviors in Wikipedia

Ofer Arazy, Hila Lifshitz-Assaf, Oded Nov, Johannes Daxenberger, Martina Balestra, Coye Cheshire
In: Proceedings of the 2017 ACM Conference on Computer Supported Cooperative Work and Social Computing, p. 2039-2051, 2017

Turbulent Stability of Emergent Roles: The Dualistic Nature of Self-Organizing Knowledge Co-Production

Ofer Arazy, Johannes Daxenberger, Hila Lifshitz-Assaf, Oded Nov, Iryna Gurevych
In: Information Systems Research, Vol. 27, p. 792-812, December 2016
[Online-Edition: http://pubsonline.informs.org/doi/abs/10.1287/isre.2016.0647]

Automated Text Classification to Capture Scientific Reasoning and Argumentation Processes in Different Professional Problem Solving Contexts

Andras Csanadi, Johannes Daxenberger, Christian Ghanem, Ingo Kollar, Frank Fischer, Iryna Gurevych
July 2016

A User Interface for the Exploration of Manually and Automatically Coded Scientific Reasoning and Argumentation

Patrick Lerner, Andras Csanadi, Johannes Daxenberger, Lucie Flekova, Christian Ghanem, Ingo Kollar, Frank Fischer, Iryna Gurevych
In: Proceedings of the International Conference of the Learning Sciences (ICLS) 2016, p. 938-941, June 2016
International Society of the Learning Sciences
[Online-Edition: https://reason.ukp.informatik.tu-darmstadt.de:9443/]

Emergent Roles in Self-Organizing Knowledge Co-Production: Turbulence and Stability

Ofer Arazy, Johannes Daxenberger, Hila Lifshitz-Assaf, Oded Nov, Iryna Gurevych
June 2016
[Online-Edition: https://sites.google.com/a/stern.nyu.edu/collective-intelligence-conference/]

Mass Collaboration on the Web: Textual Content Analysis by Means of Natural Language Processing

Ivan Habernal, Johannes Daxenberger, Iryna Gurevych
In: Mass Collaboration and Education, Vol. 16, p. 367-390, February 2016
Springer International Publishing
[Online-Edition: http://doi.org/10.1007/978-3-319-13536-6_18]

Adjacency Pair Recognition in Wikipedia Discussions using Lexical Pairs

Emily Jamison, Iryna Gurevych
In: Proceedings of the The 28th Pacific Asia Conference on Language, Information and Computing, p. 479--488, December 2014
Department of Linguistics, Chulalongkorn University
[Online-Edition: http://www.arts.chula.ac.th/~ling/paclic28/]

Needle in a Haystack: Reducing the Costs of Annotating Rare-Class Instances in Imbalanced Datasets

Emily Jamison, Iryna Gurevych
In: Proceedings of the 28th Pacific Asia Conference on Language, Information and Computing, p. 244--253, December 2014
Department of Linguistics, Chulalongkorn University
[Online-Edition: http://www.arts.chula.ac.th/~ling/paclic28/]

Primary Contact

Dr. Johannes Daxenberger

A A A | Drucken Print | Impressum Impressum | Sitemap Sitemap | Suche Search | Kontakt Contact | Webseitenanalyse: Mehr Informationen
zum Seitenanfangzum Seitenanfang