Language Technologies for eHumanities

Objectives

  • Enable automatic processing of historical texts, which often survive only as fragments and do not have standardized orthography
  • Develop semantic techniques supporting large-scale content analysis of unstructured text data in humanities and social sciences
  • Provide the techniques we develop as services to researchers in social sciences and humanities

Projects in Language Technologies for eHumanities

  • LOEWE Research Center for Digital Humanities at the Goethe University of Frankfurt and the Technische Universität Darmstadt

    • Text as a Product – Machine learning for linguistic text analysis

      Machine learning methods are often used in automatic document classification or Information Retrieval. This subproject aims at researching the use of topic models in the automatic analysis of corpora. Topic models are generative probabilistic models that identify the main topics of a document collection, which is analogous to the analysis of cohesion and coherence of single documents. A syntactically and semantically annotated corpus should be extended with lexical cohesion information as part of this subproject. Then, statistical models, like topic models, are applied in order to examine the usefulness of approaches of statistical semantics with regard to the existing linguistic features.
    • Text as an Instance of a Language System – Contrastive comparison of non-canonical grammatical constructions between English and German

      Descriptions of natural language grammars tend to focus on the canonical constructions of a language, yet actual usage also displays constructions that are marked in different ways and thus deviate from the canonical form—the so-called non-canonical constructions. This subproject aims to validate the hypothesis that natural language grammars of a particular language constitute systems of construction that are centered on a set of canonical constructions which are complemented by a set of peripheral non-canonical constructions. As non-canonical constructions are used rarely compared to their canonical counterparts, a broad range of corpora needs to be collected to build the empirical foundation for the intended studies. On these corpora, studies of canonical and non-canonical constructions in German and English—for instance on inversion, extraposition, and cleft sentences in English and their equivalents in German—and comparative analysis between the two languages are to be performed using patterns over automatically identifiable features like parts of speech and parses.

    • Text as a Process – Linguistic properties of collaboratively constructed Web 2.0 text

      Web 2.0 allows novel ways of collaboratively creating textual content. The ease of publication facilitates phenomena such as multiple authorship and editing and reuse of text snippets, as well as merges the roles of author and reader as a common standard rather than an exception. Wikipedia is a unique corpus for linguistic research because of its size, its full revision history, its discussion pages, and its huge number of (mostly anonymous) authors. As part of this subproject, Wikipedia will be linguistically enriched and its pragmatic properties will be studied using state-of-the-art methods of language technology. This includes the modeling of the article revisions and the dialogue acts of the discussion pages. Features for automatic classification using machine learning techniques and resulting insights into text quality and its development over time are to be included into the analysis as well as a cross-lingual and cross-domain comparison.

  • DARIAH-DE: Building Research Infrastructures for eHumanities

    DARIAH-DE is the German contribution to the EU ESFRI project DARIAH-EU. The mission of DARIAH is to enhance and support digitally-enabled research across the arts and humanities. Therefore, DARIAH aims to develop and maintain an infrastructure in support of research practices based on information and communication technology—so-called virtual research environments. The Ubiquitous Knowledge Processing Lab leads the work package Services for Digital Humanities, which aims to contribute to the mission of DARIAH by providing illustrative prototypes and demonstrators which are specified in collaboration with researchers in the humanities, andwhich  build upon the general infrastructure and best practices developed by DARIAH.

  • CLARIN-D: Implementation of a web-based annotation platform for linguistic annotations

    We develop a web-based tool, which runs in a web browser without further installation effort. We support annotations on several linguistic layers within the same user interface. Further, we realize an interface to crowdsourcing platforms, to be able to scale simple annotation tasks to a large amount of annotators. The annotation platform will be connected to the CLARIN-D infrastructure, to be interoperable with the processing pipelines in WebLicht. The development of the tool is supported by a concurrent second curation project, which defines ‘best practices’ for linguistic annotation on several language layers for different annotator status groups.

Selected Publications

Partners

  • LOEWE Research Center for Digital Humanities:
  • DARIAH-DE:
    • Berlin-Brandenburgische Akademie der Wissenschaften
    • DAASI International GmbH
    • Deutsches Archäologisches Institut
    • Technische Universität Darmstadt – Interdisziplinäre Arbeitsgruppe Digital Humanities (Philosophie / Ubiquitous Knowledge Processing / Computerphilologie)
    • Musikwissenschaftliches Seminar Detmold / Paderborn / Universität Paderborn
    • Georg-August-Universität Göttingen – Göttingen Center for Digital Humanities
    • Gesellschaft für wissenschaftliche Datenverarbeitung mbH Göttingen
    • Universität Köln – Historisch-Kulturwissenschaftliche Informationsverarbeitung
    • Institut für Europäische Geschichte Mainz
    • Forschungszentrum Jülich GmbH – Jülich Supercomputing Centre
    • Karlsruher Institut für Technologie
    • Otto-Friedrich-Universität Bamberg – Fakultät für Wirtschaftsinformatik und Angewandte Informatik
    • Max-Planck-Gesellschaft – Max Planck Digital Library
    • Max-Planck-Gesellschaft – Rechenzentrum Garching
    • Salomon Ludwig Steinheim-Institut für deutsch-jüdische Geschichte
    • Niedersächsische Staats- und Universitätsbibliothek Göttingen
    • Universität Trier – Kompetenzzentrum für elektronische Erschließungs- und Publikationsverfahren in den Geisteswissenschaften
    • Julius-Maximilians-Universität Würzburg – Institut für deutsche Philologie – Lehrstuhl für Computerphilologie und Neuere Deutsche Literaturgeschichte

People

A A A | Drucken Print | Impressum Impressum | Sitemap Sitemap | Suche Search | Kontakt Contact
zum Seitenanfangzum Seitenanfang