Corpus Managers and their effective implementation

Pavel Rychlý

Abstract:

The thesis deals with corpus managers -- software tools for text processing. A corpus is understood here as a huge collection of texts in electronic form. It is used as a resource of the empirical language data, i.e. words, their meanings and contexts they occur in. The corpora can be employed in many fields of linguistics (morphology, syntax, semantics, stylistics, sociolinguistics etc.) and the corpus managers are primary tools enabling corpus exploration.

In the work we would like to describe and explain what services corpus manager should offer and can offer. We describe the individual features from the users' viewpoint and the respective implementation problems as well. For the key operations of the corpus manager we present the respective algorithms and data structures, which guarantee fast performance with minimal requirements on main and disk memory.

Our results are already being used for building a new faster and more efficient corpus manager.