Unstructured Information Management (Project, SS 2010)

This page is outdated. Please visit the page for winter 2010/11.

Contents

While a significant amount of knowledge today is already available in structured form in databases or as part of the semantic web, most knowledge still is recorded in unstructured form as natural language artifacts such as text documents, audio or video recordings. The Unstructured Information Management (UIMA) framework, originally developed by IBM, offers a platform to impose structure on unstructured data, and thus facilitates the extraction of knowledge from unstructured sources. This project focuses on knowledge in the domain of software development. Documentation helps developers to understand a particular piece of code, but often documentation is badly maintained or missing. Fortunately a lot of knowledge is encoded in the names of methods, variables and in the way methods call each other. This project will analyse software artifacts such as WSDL files, source code and documentation in order to find implementations of specific functionality by using natural language queries.

  • Extract text from software artifacts
  • Index the extracted text and search on it
  • Come up with some searches and manually make a list of relevant results to use as a basis for evaluation
  • Use various techniques from simple dictionaries to semantic resources to improve results
  • Visualize results
  • Evaluate performance

The Darmstadt Knowledge Processing Software Repository (DKPro) provided by UKP offers a set of ready-to-use Java libraries for analysis and indexing. The project will be implemented on top of the Apache Unstructured Information Management (UIMA) framework.

Registration

If you plan to participate in this course, please register yourself. (Registration closed)

Time and Location

Introductory session Thursday, April 15, 2010, 16:15 room S1 03 / 300a (Altes Hauptgebäude). This will be partly a hands-on sessions using the PCs available in the room (no private laptops).

Regular status meetings are (tentatively) planned for Thursdays between 16:45-18:00. Actual times may vary depending on the number of participants.

Material

The course management system is used as the primary communication platform for the project and also contains any related material.

Objectives

  • Understand and apply methods of information retrieval (IR)
  • Applying natural language processing to software artifacts in order to search for implementations of specific functionality
  • Comparatively evaluate different approach
  • Use UIMA to implement complex natural language processing systems

Prerequisites

  • Knowledge of Java programming
  • Principles of algorithms and data structures

Lecturers

Literature

  • Bruno Caprile, Paolo Tonella. Nomen Est Omen: Analyzing the Language of Function Identifiers. Proceedings of the Sixth Working Conference on Reverse Engineering, p112. IEEE Computer Society Washington, DC, USA (1999)
  • D. Ferrucci and A. Lally. Accelerating corporate research in the development, application and deployment of human language technologies. SEALTS '03: Proceedings of the HLT-NAACL 2003 workshop on Software engineering and architecture of language technology systems, p67-74. Association for Computational Linguistics. Morristown, NJ, USA (2003)
  • Apache UIMA Homepage (http://incubator.apache.org/uima/)
  • Darmstadt Knowledge Processing Software Repository (http://www.ukp.tu-darmstadt.de/projects/dkpro/)
A A A | Drucken Print | Impressum Impressum | Sitemap Sitemap | Suche Search | Kontakt Contact | Webseitenanalyse: Mehr Informationen
zum Seitenanfangzum Seitenanfang