Using machine learning for linguistic analysis
Structural Linguistics and recent machine learning methods share the notion of text generation from mental representations. While linguistics views the genesis of a text as the encoding of the system of meaning into a text (cf. e.g. Halliday and Hasan, 1976, p. 4), topic models constitute a family of probabilistic generative machine learning approaches that model the generation of documents from a distribution of topics. This project is trying to combine both worlds. At this, we examine whether besides the structural analogy, there are also analogies between parts of the system of meaning and machine-learned topics.
For our analysis, we annotate a text corpus with lexical cohesion relations and automatically acquire topics. Then, we use the topics to predict lexical cohesion, at this using topic membership of lexical items and significance scores between lexical items to inform an automatic system for lexical chain annotation. Besides aiming at a state-of-the art system for lexical chain identification, we analyse the semiotic interpretability of stochastic methods.
This project examines the correspondence of linguistic concepts and automatically extracted topic models. Specifically, we utilize LDA topics (Blei & Lafferty, 2009) to model lexical cohesion. For this, we annotate text with the cohesion relation and use topics and lexical co-occurrence statistics as features to assess the cohesion of a text and to compute lexical chains.
Unlike syntactically inspired projects, we focus on semantic aspects of texts. Further, we use topic representations to quantify the experiential function of documents.
Figure: sample lexical chains
Cohesion in English
M.A.K. Halliday and R. Hasan
In: English Language Series, Longman, London, 1976
Topic Models
D. Blei and J. Lafferty
In: A. Srivastava and M. Sahami, editors, Text Mining: Theory and Applications. Taylor and Francis, 2009.
PDF
| Supervised All-Words Lexical Substitution using Delexicalized Features |
| György Szarvas and Chris Biemann and Iryna Gurevych In: Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT 2013), June 2013. |
| Exploring Cities in Crime: Significant Concordance and Co-occurrence in Quantitative Literary Analysis |
| Janneke Rauscher, Leonard Swiezinski, Martin Riedl, Chris Biemann In: Proceedings of the Workshop on Computational Linguistics for Literature, June 2013. |
| Three Knowledge-Free Methods for Automatic Lexical Chain Extraction |
| Steffen Remus and Chris Biemann In: Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT 2013), 2013. |
| Text: Now in 2D! A Framework for Lexical Expansion with Contextual Similarity |
| Chris Biemann, Martin Riedl In: Journal of Language Modelling, vol. 1, no. 1, 2013. |
| Text Segmentation with Topic Models |
| Martin Riedl and Chris Biemann In: Journal for Language Technology and Computational Linguistics (JLCL), vol. 27, no. 1, p. 47--70, August 2012. |
| Page 1 Page 2 Next > |
The
LOEWE Research Center "Digital Humanities" is funded by the Hessian excellence program "Landes-Offensive zur Entwicklung Wissenschaftlich-ökonomischer Exzellenz" (LOEWE).