ILP team at FI MU
Nowadays activities |
Previous research |
Activities of ILP group at FI MU focus now on two application areas, natural
language processing and knowledge discovery in geographic data.
| Natural Language
Natural language processing
DESAM , a corpus of Czech newspaper texts was built at Natural Language
Processing Laboratory, Faculty of Informatics, Masaryk University. It
contains more than 1 000 000 word positions, about 130 000 different word
forms, about 65 000 of them occuring more then once, and 1665 different
Semi-automatic disambiguator DIS of Czech noun groups was developed .
The ratio of the unambiguous word forms increased from 50.6% to 58.5% after
processing by DIS. The number of tags per ambiguous token has decreased from
3.3% to 2.7%.
In  a method for automatic finding of compound verb groups in a Czech
sentence is introduced. The method results in a definite clause grammar rule
- called a verb rule - that contains information about components of the
verb group and their tags.
Lemma disambiguator for Czech
was developed [12, 13] employing Progol. A
method for disambiguation were introduced that combines ILP and
instance-based learning. The algorithm reached accuracy greater than 90%, leaving
less than 15% of words ambiguous. Lemma disambiguation of unknown words was
described in . Progol was also tested in tag disambiguation of Czech
nouns . The first results for tag disambiguation reach average accuracy
GRIND system [3, 4]
was implemented which is
capable to learn a sequence of context-dependent parse actions from a set of
syntactically annotated sentences. In the first step, GRIND constructs
a sequence of `deepening operators'. Then, in the second learning phase,
a specification of constraints on application of these operators is induced by
means of ILP - so called `forbidding predicates' are learned.
Automatic tagging of compound verb groups
Finding all parts of a compound verb group in a Czech sentence and tagging the
group as a whole is an inevitable groundwork for any subsequent (semantic)
analysis. From annotated corpus DESAM, 126 DCG rules were extracted which cover
all frequent verb groups in Czech [17, 18].
Using those rules we are able to recognise compound verb groups in unannotated
Czech texts with the accuracy 93%.
Part-of-Speech Tagging by Means of Shallow Parsing, ILP
and Active Learning
Part-of-speech tagger for Czech is described that employs
parser for Czech, manually-coded rules and inductive logic programming.
The active learning method used resulted in the decrease
in the number of training examples to label as well as in a shorter
learning time without the decrease in recall or accuracy.
Compared with the previous work, both recall and
accuracy increased and the number of training examples to label decreased.
The method was tested on ambiguities that are frequent in Czech. The
accuracy reached was higher than 96% with recall higher than 95%.
in geographic data
Knowledge discovery in geographic data
- GWiM, the system from mining ingeographic data
Interpret of an inductive query language for knowledge discovery in
geographic data  was implemented employing WiM [7, 8, 15] system. Three
kinds of inductive queries were implemented. Two of them, that ask for
characteristic and discriminate rules, are adaptation of GeoMiner (Han et
al., SIGMOD'97) rules. The dependency rules add a new quality to the
inductive query language.
- New inductive query language
Extension of GWiM has been developed .
Neighbourhood graphs  are
used for description of spatial relations. The inductive query language is
fully integrated with PostgreSQL database system. C4.5, RT4 and Progol
are used for computation of inductive queries.
The ILP system WiM [7, 8, 15] has been designed and implemented at FI MU,
Brno and CTU, Prague during last few years supported by ESPRIT ILP. WiM
extends Markus by shifting bias, generating negative examples and employing
oracles. Important feature of WiM is its ability to learn a logic program
from a small set of examples. If necessary it poses a query to the user. WiM
uses a specific strategy for the choice of this query the aim of which is to
decrease a number of negative examples as much as possible. Under this
project there were developed some versions of WiM dedicated to specialised
applications, e.g. object-oriented analysis and design  and knowledge
discovery in geographic data .
Courses relevant to ILP taught by members of the group
- ILP - one semester course (3 hours per week)
- KDD - one semester course (3 hours per week)
- KDD project(one semester project)
- ESPRIT METAL - combination of statistical methods with machine
learning, multistrategy learning
- Natural Language Processing Laboratory (with applications supporting
education of people with limited sight) (Ministry of Education, CZ) -
automatic recognition of noun phrases , synthesis of verb rules
, syntax analysis by means of machine learning  automatic
tagging of composed verb groups ;
- ILP - WiM system [7, 15], applications of WiM in software engineering
 and KDD ;
- Expressivity of ophthalmology diseases in descendent populations of a
rural region (IGA MZ CR 4377-3 Ministry of Health, CZ) - collaboration
with Health of Child Research Institute in Brno.
Faculty of Informatic, Masaryk University
CZ - 602 00 Brno
- Kuba P.: Knowledge discovery in spatial data. Master thesis FI MU
Brno, 2000 (in Czech).
- Kuba P., Popelínský L.: Automatic classification of spatial data. 7th
Conference on GIS GIS...2000, Ostrava 2000 (in Czech).
- Nepil M.: Automatic construction of natural language grammar. Master
Thesis, FI MU 2000 (in Czech).
- Nepil M.: Learning Parse Actions from Annotated Sentences (submitted
- K. Pala , P. Rychlý and P. Smrz: DESAM - annotated corpus for Czech.
In Plásil F., Jeffery K.G.(eds.): Proceedings of SOFSEM'97, Milovy,
Czech Republic. LNCS 1338, Springer-Verlag 1997. (modified version of
this paper is available as
technical report FI MU)
- Pavelek T., Popelínský L.: Towards lemma disambiguation: Similarity
classes. In Proc. of Summer School on Information Systems, Ruprechtov
1999 (in Czech)
- Flener P., Popelínský L. Stepánková O.: ILP nad Automatic Programming:
Towards three approaches. Proc. of 4th Workshop on Inductive Logic
Programming (ILP'94), Bad Honeff, Germany, 1994.
- Popelínský L.: Towards Program Synthesis From A Small Example Set.
Proceedings of 21st Czech-Slovak conference on Computer Science
SOFSEM'94, pp.91-96 Czech Society for Comp. Sci. Brno 1993. (See also
Proceedings of 10th WLP'94, Zuerich 1994, Switzerland.)
- Popelínský L.: Knowledge Discovery in Spatial Data by Means of ILP.
In: Zytkow J.M., Quafafou M.(Eds.): Principles of Data Mining and
Knowledge Discovery. Proc. of 2nd European Symposium PKDD'98, Nantes
France 1998. LNCS 1510, Springer-Verlag 1998.
- Popelínský L.: Inductive inference to support object-oriented analysis
and design. In: Proc. of 3rd Conf on Knowledge-Based Software
Engineering, Smolenice 1998, IOS Press.
- Popelínský L.: Approaches to Spatial Data Mining. In Proceedings of
GIS... Ostrava'99 Conference, ISSN 1211-4855, 1999.
- Popelínský L., Pavelek T., Ptácník T.: Towards disambiguation in Czech
corpora. In Proc. of LLL Workshop Bled, 1999
- Popelínský L., Pavelek T.: Mining lemma disambiguation rules from
Czech corpora. In Rauch J., Zytkow J.M.(Eds.):Principles and Practice
of Knowledge Discovery in Databases. Proc. of 3rdEuropean Conference
PKDD'99, Prague Czech Republic 1999. LNCS 1704, Springer-Verlag 1999.
- Popelínský L.: Towards practical inductive logic programming. PhD
thesis FEL CTU Prague 2000.
- Smrz P., Zácková E.: New Tools for Disambiguation of Czech Texts. In
Sojka P., Matousek V., Pala K., Kopecek I.: Text, Speech, Dialogue.
Proceedings of the 1st Workshop on Text, Speech, Dialogue - TSD'98,
Brno, Czech Republic, Sept. 1998.
- Zácková E. , Pala K.: Corpus-Based Rules for Czech Verb Discontinuous
Constituents. Proceedings of TSD'99, Springer Verlag 1999, LNAI 1692,
pp. 325-328. (extended and modified version of this paper is available
as technical report FI MU)
- Zácková E., Popelínský L., Nepil M. : Automatic
Tagging of Compound Verb Groups in Czech Corpora. In Proceedings of
TSD 2000, LNAI 1902, Springer Verlag 2000, pp. 115-120.
- Zácková E., Popelínský L., Nepil M. :
Tagging of Compound Verb Groups in Czech. In Proceedings of
CoNLL and LLL 2000, Lisbon, Portugal, Sept. 2000
Nepil M., Popelinsky L., Zackova E.:
Part-of-Speech Tagging by Means of Shallow Parsing, ILP and Active Learning
In Proceedings of 3rd Workshop on Learning Language in Logic(LLL), Strasbourg, 2001.