Syntactic and Semantic Analysis and Knowledge Representation

Introduction

Computers are very fast and powerful machines, however, they process texts written by humans in an entirely mindless way, treating them merely as sequences of meaningless symbols. The main goal of language analysis is to obtain a suitable representation of text structure and thus make it possible to process texts based on their content. This is necessary in various applications, such as spell- and grammar-checkers, intelligent search engines, text summarization, or dialogue systems.

Natural language text can be analyzed on various levels, depending on the actual application setting. With regard to automatic processing of language data, the following analysis levels can be distinguished:

Morphological Analysis

Morphological analysis gives a basic insight into natural language by studying how to distinguish and generate grammatical forms of words arising through inflection (ie. declension and conjugation). This involves considering a set of tags describing grammatical categories of the word form concerned, most notably, its base form (lemma) and paradigm. Automatic analysis of word forms in free text can be used for instance in grammar checker development, and can aid corpus tagging, or semi-automatic dictionary compiling.

The NLP laboratory has produced a general morphological analyzer for Czech, ajka, which covers vocabulary of over 6 million word forms. It has further served as a base for a similar analyzer for Slovak, the fispell grammar-checker, the czaccent converter of ascii text to text with diacritics, and an interactive interface for the IM Jabber protocol.

Syntactic Analysis

The goal of syntactic analysis is to determine whether the text string on input is a sentence in the given (natural) language. If it is, the result of the analysis contains a description of the syntactic structure of the sentence, for example in the form of a derivation tree. Such formalizations are aimed at making computers "understand" relationships between words (and indirectly between corresponding people, things, and actions). Syntactic analysis can be utilized for instance when developing a punctuation corrector, dialogue systems with a natural language interface, or as a building block in a machine translation system. Czech is a language exhibiting rich inflection and free word order and thus requires more grammar rules than most other languages. Accordingly, it is one of the languages that are very hard to analyze.

The NLP laboratory is developing the synt syntactic analyzer. According to tests performed on large corpora, the performance of synt reaches the recall of 92 % and precision of 84 %. For educational purposes we have a simple syntactic analyzer Zuzana, which is capable of visualizing several types of derivation trees.

Semantic Analysis

Semantic and pragmatic analysis make up the most complex phase of language processing as they build up on results of all the above mentioned disciplines. Based on the knowledge about the structure of words and sentences, the meaning of words, phrases, sentences and texts is stipulated, and subsequently also their purpose and consequences. From the computational point of view, no general solutions that would be adequate have been proposed for this area. There are many open theoretical problems, and in practice, great problems are caused by errors on lower processing levels. The ultimate touchstone on this level is machine translation, which hasn't been implemented for Czech with satisfactory results yet.

One of the long-term projects of the NLP laboratory is the use of Transparent Intensional Logic (TIL) as a semantic representation of knowledge and subsequently as a transfer language in automatic machine translation. At the current stage, it is realistic to process knowledge in a simpler form - considerably less complex tasks have been addressed, such as machine translation for a restricted domain (eg. official documents and weather reports), or semi-automatic machine translation between close languages. The resources exploited in these applications are corpora, semantic nets, and electronic dictionaries. Knowledge Representation

Not all information needed for processing of texts is encoded in the structure of language. In order to understand the content of texts properly, it is often necessary to possess certain knowledge about the world - either general (eg. that birds can fly, or that a key is required to open a locked door), or even very specific, expert knowledge, the reader is expected to be familiar with (eg. in a mathematical journal that an even number higher than 2 can't be a prime). Seemingly, the greatest challenge in this field is not to gather the knowledge, but to represent and structure it in a suitable way, to search in it efficiently, and to use it to infer further knowledge. These goals in their essence correspond to the task of constructing artificial intelligence, which is without any doubt one of the biggest and most interesting topics of modern science.

In the field of representation of meaning and knowledge we shall mention the notable contribution of NLP laboratory members to the EuroWordNet and Balkanet projects, which were aimed at building a multilingual WordNet-like semantic net. Further, the laboratory has developed the DEB (Dictionary Editor and Browser) platform, which makes it possible to efficiently browse and search the WordNet semantic net and also to edit it in a comfortable way. With regard to the success of this platform, it's large-scale use within the WordNet Grid project has been considered.