Next: References Up: ILP and NLP: A Previous: Applications ILP in NLP

Czech corpora and ILP

The problem of assigning(tagging) a correct grammatical category to each word is very time consuming and nontrivial - Czech language knows about 5 000 000 word forms. DESAM corpus which has been manually disambiguated contains (December 1997) more than 1 000 000 word forms (about 130 000 different word forms and 1665 different tags - grammatical categories).
Our approach exploits tagged DESAM corpus. A goal of the project is to assist in the disambiguation process. We do not aim at fully automatic disambiguation. We want, by ILP, to solve only a part - we hope that a majority - of ambiguities.
Given annotated corpus, our tasks aims to find rules for

end of sentence (when a dot is a full-stop)
subject part and predicate part of a sentence
noun phrases
disambiguation on the level of a lemma
disambiguation on the level of morphological categories We will focus on a subpart of the task to show if the methods developed inside ILP are applicable.

Lubos Popelinsky
Fri Jun 5 11:42:41 MET DST 1998