Finding Semantically Related Words in Large Corpora

by Pavel Smr¾, Pavel Rychlý, Slightly modified version of the paper published in the Proceedings of TSD 2001, Pilsen, Czech Republic. June 2001, 9 pages.

FIMU-RS-2001-02. Available as Postscript, PDF.


The paper deals with the linguistic problem of fully automatic grouping of semantically related words. We discuss the measures of semantic relatedness of basic word forms and describe the treatment of collocations. Next we present the procedure of hierarchical clustering of a very large number of semantically related words and give examples of the resulting partitioning of data in the form of dendrogram. Finally we show a form of the output presentation that facilitates the inspection of the resulting word clusters.

DESAM - Approaches to Desambiguation

by Karel Pala, Pavel Rychlý, Pavel Smr¾, December 1997, 12 pages.

FIMU-RS-97-09. Available as Postscript, PDF.


This paper deals with Czech desambiguated corpus DESAM. It is a tagged corpus which was manually desambiguated and can be used in various applications. We discuss the structure of the corpus, tools used for its managing, linguistic applications, and also possible use of machine learning techniques relying on the desambiguated data. Possible ways of developing procedures for complete automatic desambiguation are considered.

