
This Apache-licensed project is providing a software solution for automatic text expansion using contextualized distributional similarity. Main contributors are
Please read the wiki of the project at
https://sourceforge.net/p/jobimtext/wiki/Home/ for a detailed project description.
![]()
The project is hosted on sourceforge under
https://sourceforge.net/p/jobimtext and is available under the Apache 2.0 License. Please contact us if you plan to contribute to, or use the software.
Please find a
demonstrator for download on the sourceforge repository.
Distributional Semantics is about building a totally unsupervised framework for computational semantics. It addresses traditional computational semantics problems like lexical ambiguity and variability, word sense disambiguation and lexical substitutability, paraphrasing, frame induction and parsing, and textual entailment. Our methodology is to avoid using rule based systems and hand labeled data for supervised learning. The goal is to build a semantic analyzer able to self-adapt to new domains and languages after unsupervised learning from large corpora of raw text. At the same time, the output of distributional semantics is a contextual thesaurus, representing sense clusters and properties characterizing each cluster. Finally, a mayor goal of the Distributional Semantics framework is to map induced linguistic knowledge to existing knowledge bases, such as for example semantic web data and databases, allowing entity linking and disambiguation with respect to pre-conceptualized domain models and enabling a new range of applications.
Distributional semantics is based on very well assessed linguistic theories and on a radical machine learning approach. It has its roots in De Saussure’s structural linguistics hypothesis and in the semiotic principles distinguishing expressions from meaning and reference. Structural semantics claims that meaning can be fully defined by semantic oppositions and relations between words, and in particular syntagmatic and paradigmatic relations. Paradigmatic relations are established in absentia and represents substitubility between words preserving meaning, whereas syntagmatic relations are mostly syntactic relations that can be identified by a syntactic parser. The distributional hypothesis, formulated by Zelling S. Harris claims that paradigmatic relations can be detected by mining distributional properties of syntagmatic relations, allowing us to acquire paradigmatic relations in a fully unsupervised way.
On the other hand, unsupervised learning and complex system are part of the Distributional Semantic framework. We are targeting algorithms that can be parallelized and executed in large computer clusters for scalability. In this way, we are going to build local models of semantic relations rather than global models, which allows to computation to be parallelized and executed using search engine technology like inverse indices and MapReduce.
Some relevant references for this project, ordered by topic.
de Saussure, F. (1916). Cours de linguistique générale. Librairie Payot & Cie, Paris.
Z. Harris. (1954). Distributional Structure. Word 10 (2/3)
G. A. Miller, W. G. Charles (1991): Contextual Correlates of Semantic Similarity. Language and Cognitive Processes 1991, 6 (1) 1-28
Biemann, C. (2011): Structure Discovery in Natural Language. In G. Hirst, E. Hovy and M. Johnson (Series Eds.): Theory and Applications of Natural Language Processing, Springer Heidelberg Dordrecht London New York
Gliozzo, A., Strapparava, C. (2009): Semantic Domains in Computational Linguistics. Springer. ISBN: 978-3-540-68156-4
Lin, D. (1998). Automatic retrieval and clustering of similar words. In Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics and the 17th International Conference on Computational Linguistics, volume 2 of ACL ’98, pages 768–774, Stroudsburg, PA, USA. Association for Computational Linguistics.
Bär, D., Biemann, C., Gurevych, I., and Zesch, T. (2012). UKP: Computing semantic textual similarity by combining multiple content similarity measures. In Proceedings of the 6th International Workshop on Semantic Evaluation, pages 435–440.
Biemann, C. (2010): Co-occurrence Cluster Features for Lexical Substitutions in Context. Proceedings of the 5th Workshop on TextGraphs in conjunction with ACL 2010, Uppsala, Sweden
Biemann, C. (2006): Chinese Whispers - an Efficient Graph Clustering Algorithm and its Application to Natural Language Processing Problems. Proceedings of the HLT-NAACL-06 Workshop on Textgraphs-06, New York, USA
Widdows, D. and Dorow, B. (2002): A graph model for unsupervised lexical acquisition. In Proceedings of the 19th international conference on Computational linguistics - Volume 1 (COLING '02), Vol. 1.
Viterbi A.J. (1967). Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. IEEE Transactions on Information Theory 13 (2): 260–269. doi:10.1109/TIT.1967.1054010
Hastings, W.K. (1970). "Monte Carlo Sampling Methods Using Markov Chains and Their Applications". Biometrika 57 (1): 97–109. doi:10.1093/biomet/57.1.97
Blei, D. M., Ng, A. Y., and Jordan, M. I. (2003). Latent Dirichlet allocation. Journal of Machine Learning Research, 3:993–1022.