Semantic Information Retrieval (SIR)
Integrating semantic relatedness into information retrieval to overcome the problem of term mismatch in query and documents.
Feel free to download our SIR Flyer
Motivation
An often occurring problem in information retrieval (IR) is the gap between the vocabulary used in formulating the user's information need (topic) and the vocabulary used in writing the documents of the collection to be queried. An example for this problem is the domain of electronic career guidance where an IR system helps young people to decide which profession to choose by automatically computing a ranked list of professions according to the user's interests. The IR system compares a short essay written by the user with descriptions of professions written by domain experts. Typically, people seeking career advice use different words for describing their professional preferences as those employed in the professionally prepared descriptions of professions. Therefore, lexical semantic knowledge and soft matching, i.e. matching semantically related terms, must be especially beneficial to such a system.
Goals
Improve the performance of IR on domain specific document collections:
- increase recall (by closing the vocabulary gap)
- increase precision (especially for the first 10 ranks)
Methods
- Integrating semantic relatedness into IR models
- Combining linguistic knowledge sources, e.g. German wordnet, and Web 2.0 knowledge sources, e.g. Wikipedia ==> broad coverage
System Architecture

Publications
| Semantically Enhanced Term Frequency |
| Christof Müller and Iryna Gurevych: In: Proceedings of the 32nd European Conference on Information Retrieval Research, p. (to appear), March 2010. |
| Wisdom of Crowds versus Wisdom of Linguists - Measuring the Semantic Relatedness of Words |
| Torsten Zesch and Iryna Gurevych: In: Journal of Natural Language Engineering. to appear, vol. 16, 2010. |
| Approximate Matching for Evaluating Keyphrase Extraction |
| Torsten Zesch and Iryna Gurevych: In: Proceedings of the 7th International Conference on Recent Advances in Natural Language Processing (electronic proceedings), p. 484--489, September 2009. |
| A Study on the Semantic Relatedness of Query and Document Terms in Information Retrieval |
| Christof Müller and Iryna Gurevych: In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, p. 1338--1347, August 2009. |
| Semantic relations in a bilingual corpus of different registers |
| Oliver Čulo and Kerstin Kunz and Torsten Zesch: In: Deutsche Gesellschaft für Sprachwissenschaft (DGfS) Workshop on Corpus, Colligation, Register Variation, March 2009. |
| Extracting Professional Preferences of Users from Natural Language Essays |
| Cigdem Toprak and Christof Müller and Iryna Gurevych: In: Wolfgang Hoeppner: Tagungsband des GSCL Symposiums "Sprachtechnologie und eHumanities", p. 103-110, Abteilung für Informatik und Angewandte Kognitionswissenschaft Fakultät für Ingenieurwissenschaften Universität Duisburg-Essen, February 2009. ISSN 1863-8554. |
| Using Wikipedia and Wiktionary in Domain-Specific Information Retrieval |
| Christof Müller and Iryna Gurevych: In: Carol Peters and Danilo Giampiccol and Nicola Ferro and Vivien Petras and Julio Gonzalo and Anselmo Penas and Thomas Deselaers and Thomas Mandl and Gareth Jones and Mikko Kurimo: Evaluating Systems for Multilingual and Multimodal Information Access -- 9th Workshop of the Cross-Language Evaluation Forum, CLEF 2008, Aarhus, Denmark, September 17-19, 2008, Revised Selected Papers, Lecture Notes in Computer Science, vol. 5706, p. 219-226, Springer-Verlag GmbH, 2009. |
| Das World Wide Web als computerlinguistische Ressource |
| Iryna Gurevych: In: Ralf Klabunde and Kai-Uwe Carstensen and Christian Ebert and Cornelia Endriss and Hagen Langer and Susanne Jekat: Computerlinguistik und Sprachtechnologie - Eine Einführung, p. (to appear), Springer Verlag, January 2009. |
| Putting the „Wisdom‐of‐Crowds“ to Use in NLP: Collaboratively Constructed Semantic Resources on the Web |
| Iryna Gurevych: In: NSF sponsored symposium “Semantic Knowledge Discovery, Organization and Use”, November 2008. http://nlp.cs.nyu.edu/sk-symposium/. |
| Graph-Theoretic Analysis of Collaborative Knowledge Bases in Natural Language Processing |
| Konstantina Garoufi and Torsten Zesch and Iryna Gurevych: In: Proceedings of the Poster Session of the 7th International Semantic Web Conference, October 2008. |
| Representational Interoperability of Linguistic and Collaborative Knowledge Bases |
| Konstantina Garoufi and Torsten Zesch and Iryna Gurevych: In: Proceedings of the KONVENS Workshop on Lexical-Semantic and Ontological Resources -- Maintenance, Representation, and Standards, October 2008. |
| Using Tag Semantic Network for Keyphrase Extraction in Blogs |
| Lizhen Qu and Christof Müller and Iryna Gurevych: In: ACM 17th Conference on Information and Knowledge Management , p. 1381-1382, October 2008. |
| Using Similarity Measures for Context-Aware User Interfaces |
| Melanie Hartmann and Torsten Zesch and Max Mühlhäuser and Iryna Gurevych: In: Proceedings of the 2nd IEEE International Conference on Semantic Computing, p. 190-197, August 2008. |
| Using Wikipedia and Wiktionary in Domain-Specific Information Retrieval |
| Christof Müller and Iryna Gurevych: In: Francesca Borri and Alessandro Nardi and Carol Peters: Working Notes for the CLEF 2008 Workshop, September 2008. |
| Using Wiktionary for Computing Semantic Relatedness |
| Torsten Zesch and Christof Müller and Iryna Gurevych: In: Proceedings of the 23rd AAAI Conference on Artificial Intelligence, p. 861-867, July 2008. |
| Closing the Vocabulary Gap for Computing Text Similarity and Information Retrieval |
| Christof Müller and Iryna Gurevych and Max Mühlhäuser: In: International Journal of Semantic Computing, vol. 2, no. 2, p. 253-272, June 2008. |
| Extracting Lexical Semantic Knowledge from Wikipedia and Wiktionary |
| Torsten Zesch and Christof Müller and Iryna Gurevych: In: Proceedings of the 6th International Conference on Language Resources and Evaluation, May 2008. |
| Flexible UIMA Components for Information Retrieval Research |
| Christof Müller and Torsten Zesch and Mark-Christoph Müller and Delphine Bernhard and Kateryna Ignatova and Iryna Gurevych and Max Mühlhäuser: In: Proceedings of the LREC 2008 Workshop 'Towards Enhanced Interoperability for Large HLT Systems: UIMA for NLP', p. 24-27, May 2008. |
| What to be? - Electronic Career Guidance Based on Semantic Relatedness |
| Iryna Gurevych, Christof Müller, Torsten Zesch: In: Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics, p. 1032--1039, Association for Computational Linguistics, June 2007. http://www.aclweb.org/anthology/P/P07/P07-1130. |
| Cross-lingual Distributional Profiles of Concepts for Measuring Semantic Distance |
| Saif Mohammad and Iryna Gurevych and Graeme Hirst and Torsten Zesch: In: Proceedings of the Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), p. 571--580, June 2007. http://www.aclweb.org/anthology/D/D07/D07-1060. |
| Darmstadt Knowledge Processing Repository Based on UIMA |
| Iryna Gurevych, Max Mühlhäuser, Christof Müller, Jürgen Steimle, Markus Weimer, Torsten Zesch: In: Proceedings of the First Workshop on Unstructured Information Management Architecture at Biannual Conference of the Society for Computational Linguistics and Language Technology, April 2007. |
| Teaching "Unstructured Information Management: Theory and Applications" to Computational Linguistics Students |
| Iryna Gurevych, Christof Müller, Torsten Zesch: In: Proceedings of the First Workshop on Unstructured Information Management Architecture at Biannual Conference of the Society for Computational Linguistics and Language Technology, April 2007. |
| Integrating Semantic Knowledge into Text Similarity and Information Retrieval |
| Christof Müller, Iryna Gurevych, Max Mühlhäuser: In: Proceedings of the First IEEE International Conference on Semantic Computing (ICSC), p. 257-264, 2007. |
| Analysis of the Wikipedia Category Graph for NLP Applications |
| Torsten Zesch and Iryna Gurevych: In: Proceedings of the TextGraphs-2 Workshop (NAACL-HLT 2007), p. 1--8, April 2007. |
| Comparing Wikipedia and German Wordnet by Evaluating Semantic Relatedness on Multiple Datasets |
| Torsten Zesch and Iryna Gurevych and Max Mühlhäuser: In: Proceedings of Human Language Technologies: The Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL-HLT), p. 205--208, April 2007. |
| Analyzing and Accessing Wikipedia as a Lexical Semantic Resource |
| Torsten Zesch and Iryna Gurevych and Max Mühlhäuser: In: Data Structures for Linguistic Resources and Applications, p. 197--205, Gunter Narr, Tübingen, April 2007. |
| Automatically creating datasets for measures of semantic relatedness |
| Torsten Zesch and Iryna Gurevych: In: COLING/ACL 2006 Workshop on Linguistic Distances, p. 16--24, July 2006. |
| Exploring the Potential of Semantic Relatedness in Information Retrieval |
| Christof Müller, Iryna Gurevych: In: LWA 2006 Lernen - Wissensentdeckung - Adaptivität, 9.-11.10.2006 in Hildesheim, vol. Hildesheimer Informatikberichte, p. 126-131, Universität Hildesheim, October 2006. |
Software
- Darmstadt Knowledge Processing Repository: UIMA components for NLP, IR, and semantic relatedness measures.
- Wikipedia API & Wiktionary API: Programmatic access to locally stored Wikipedia and Wiktionary data.
- Dextract: Software for semantic relatedness experiments.
Data
Teaching
In 2006 the SIR project team offered a Seminar on Unstructured Information Management at the University of Tübingen.
Partners
The Division of Computational Linguistics at the University of Tübingen is co-applicant of the SIR project. Their research focus is on further development of the GermaNet ontology using the BERUFEnet corpus.
In cooperation with the German Federal Agency for Employment (Bundesagentur für Arbeit), we employ semantic information retrieval algorithms to realize electronic career guidance. Using a natural language essay of the person seeking advice, relevant professions are found based on their natural language descriptions.
Funding
This project is funded by Deutsche Forschungsgemeinschaft (German Research Foundation).
People
- Dr. Iryna Gurevych, Principal Investigator
- Prof. Dr. Max Mühlhäuser, Principal Investigator
- Christof Müller, Project Coordinator
- Torsten Zesch, Doctoral Researcher








