Statistical Processing of Very Large Texts

Introduction

Natural language (such as Czech or English) is without any doubt one of the things we use the most in our lives. Nevertheless, we know rather little about it. For centuries, linguists have been trying to grasp the regularities within languages using rules, patterns, and grammars - in spite of all this, we are still quite far from a full understanding of language.

Modern computer technologies make it possible to approach studying language in novel, very different ways, that interestingly complement traditional linguistic methods. The main idea is that the computer "learns" language in a way analogical to a small child - by searching parallels in utterances of people in its environment. In this respect, a computer can take advantage of a large amount of texts and the searching for parallels can be implemented for instance by machine learning algorithms. The aim is that the computer, based on large amount of data, infers meaning and usage of most words and expressions itself, without any human hard-coding them.

Statistical methods and creating training data for them make up a substantial proportion of today's state-of-the-art research in computational linguistics.

Corpora

Corpus is a collection of text data in electronic form. Being a significant source of linguistic data, corpora make it possible to investigate many frequency-related phenomena in language, and nowadays, are indispensable in NLP. In addition to corpora containing general texts, corpora for specific purposes are also produced, such as annotated, domain-specific, spoken or error corpora.

Corpora are used for investigation and development of natural language grammars. They are further helpful when developing a grammar checker, choosing entries for a dictionary or as a data source for automatic text categorization based on machine learning. Parallel corpora comprise of identical texts in various languages. They are used especially in word sense disambiguation and machine translation.

The NLP laboratory has produced a complete set of tools for creating and managing corpora, the Corpus Builder. Corpora stored in this system can contain billions of word tokens. Further under development has been the CPA (Corpus Pattern Analysis) method, which aims at obtaining information about various meanings of individual words by semi-automatic matching of patterns with corpus data. This method offers an efficient procedure for compiling dictionaries based on real data.


Information about the NLP Laboratory

Contacts:

  • Pavel Rychlý: pary(atsign)fi(dot)muni(dot)cz
  • Aleš Horák: hales(atsign)fi(dot)muni(dot)cz
  • Karel Pala: pala(atsign)fi(dot)muni(dot)cz

Laboratory Members

Further Information: