Text as a Product

Using machine learning for linguistic analysis

Motivation

Structural Linguistics and recent machine learning methods share the notion of text generation from mental representations. While linguistics views the genesis of a text as the encoding of the system of meaning into a text (cf. e.g. Halliday and Hasan, 1976, p. 4), topic models constitute a family of probabilistic generative machine learning approaches that model the generation of documents from a distribution of topics. This project is trying to combine both worlds. At this, we examine whether besides the structural analogy, there are also analogies between parts of the system of meaning and machine-learned topics.

Goals

For our analysis, we annotate a text corpus with lexical cohesion relations and automatically acquire topics. Then, we use the topics to predict lexical cohesion, at this using topic membership of lexical items and significance scores between lexical items to inform an automatic system for lexical chain annotation. Besides aiming at a state-of-the art system for lexical chain identification, we analyse the semiotic interpretability of stochastic methods.

Methods

This project examines the correspondence of linguistic concepts and automatically extracted topic models. Specifically, we utilize LDA topics (Blei & Lafferty, 2009) to model lexical cohesion. For this, we annotate text with the cohesion relation and use topics and lexical co-occurrence statistics as features to assess the cohesion of a text and to compute lexical chains.
Unlike syntactically inspired projects, we focus on semantic aspects of texts. Further, we use topic representations to quantify the experiential function of documents.

Figure: sample lexical chains

References

Cohesion in English

M.A.K. Halliday and R. Hasan
In: English Language Series, Longman, London, 1976

 Topic Models
D. Blei and J. Lafferty
In: A. Srivastava and M. Sahami, editors, Text Mining: Theory and Applications. Taylor and Francis, 2009.  PDF PDF

Project Publications

Displaying results 1 to 5 out of 10

 Page 1 Page 2 Next >
Supervised All-Words Lexical Substitution using Delexicalized Features
György Szarvas and Chris Biemann and Iryna Gurevych
In: Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT 2013), June 2013.

Exploring Cities in Crime: Significant Concordance and Co-occurrence in Quantitative Literary Analysis
Janneke Rauscher, Leonard Swiezinski, Martin Riedl, Chris Biemann
In: Proceedings of the Workshop on Computational Linguistics for Literature, June 2013.

Three Knowledge-Free Methods for Automatic Lexical Chain Extraction
Steffen Remus and Chris Biemann
In: Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT 2013), 2013.

Text: Now in 2D! A Framework for Lexical Expansion with Contextual Similarity
Chris Biemann, Martin Riedl
In: Journal of Language Modelling, vol. 1, no. 1, 2013.

Text Segmentation with Topic Models
Martin Riedl and Chris Biemann
In: Journal for Language Technology and Computational Linguistics (JLCL), vol. 27, no. 1, p. 47--70, August 2012.

 Page 1 Page 2 Next >

People

Related Pages

Funding

The  LOEWE Research Center "Digital Humanities" is funded by the Hessian excellence program "Landes-Offensive zur Entwicklung Wissenschaftlich-ökonomischer Exzellenz" (LOEWE).

A A A | Drucken Print | Impressum Impressum | Sitemap Sitemap | Suche Search | Kontakt Contact
zum Seitenanfangzum Seitenanfang