Faculty of Informatics logo


ILP team at FI MU

Nowadays activities | Previous research | Members | Courses | Conferences | Projects | Contact | References

- - - - -
Nowadays
activities
Activities of ILP group at FI MU focus now on two application areas, natural language processing and knowledge discovery in geographic data.
  Natural Language
  Processing

Natural language processing

Groundwork

DESAM [5], a corpus of Czech newspaper texts was built at Natural Language Processing Laboratory, Faculty of Informatics, Masaryk University. It contains more than 1 000 000 word positions, about 130 000 different word forms, about 65 000 of them occuring more then once, and 1665 different tags. Semi-automatic disambiguator DIS of Czech noun groups was developed [16]. The ratio of the unambiguous word forms increased from 50.6% to 58.5% after processing by DIS. The number of tags per ambiguous token has decreased from 3.3% to 2.7%. In [17] a method for automatic finding of compound verb groups in a Czech sentence is introduced. The method results in a definite clause grammar rule - called a verb rule - that contains information about components of the verb group and their tags.
Lemma disambiguator for Czech
was developed [12, 13] employing Progol. A method for disambiguation were introduced that combines ILP and instance-based learning. The algorithm reached accuracy greater than 90%, leaving less than 15% of words ambiguous. Lemma disambiguation of unknown words was described in [6]. Progol was also tested in tag disambiguation of Czech nouns [12]. The first results for tag disambiguation reach average accuracy 91.5%.
GRIND system [3, 4]
was implemented which is capable to learn a sequence of context-dependent parse actions from a set of syntactically annotated sentences. In the first step, GRIND constructs a sequence of `deepening operators'. Then, in the second learning phase, a specification of constraints on application of these operators is induced by means of ILP - so called `forbidding predicates' are learned.
Automatic tagging of compound verb groups
Finding all parts of a compound verb group in a Czech sentence and tagging the group as a whole is an inevitable groundwork for any subsequent (semantic) analysis. From annotated corpus DESAM, 126 DCG rules were extracted which cover all frequent verb groups in Czech [17, 18]. Using those rules we are able to recognise compound verb groups in unannotated Czech texts with the accuracy 93%.
Part-of-Speech Tagging by Means of Shallow Parsing, ILP and Active Learning[19]
Part-of-speech tagger for Czech is described that employs DIS shallow parser for Czech, manually-coded rules and inductive logic programming. The active learning method used resulted in the decrease in the number of training examples to label as well as in a shorter learning time without the decrease in recall or accuracy. Compared with the previous work, both recall and accuracy increased and the number of training examples to label decreased. The method was tested on ambiguities that are frequent in Czech. The accuracy reached was higher than 96% with recall higher than 95%.
  Knowledge
  discovery
  in geographic data

Knowledge discovery in geographic data

GWiM, the system from mining ingeographic data
Interpret of an inductive query language for knowledge discovery in geographic data [13] was implemented employing WiM [7, 8, 15] system. Three kinds of inductive queries were implemented. Two of them, that ask for characteristic and discriminate rules, are adaptation of GeoMiner (Han et al., SIGMOD'97) rules. The dependency rules add a new quality to the inductive query language.
New inductive query language
Extension of GWiM has been developed [1]. Neighbourhood graphs [11] are used for description of spatial relations. The inductive query language is fully integrated with PostgreSQL database system. C4.5, RT4 and Progol are used for computation of inductive queries.
- - - - -
Previous ILP
research
The ILP system WiM [7, 8, 15] has been designed and implemented at FI MU, Brno and CTU, Prague during last few years supported by ESPRIT ILP. WiM extends Markus by shifting bias, generating negative examples and employing oracles. Important feature of WiM is its ability to learn a logic program from a small set of examples. If necessary it poses a query to the user. WiM uses a specific strategy for the choice of this query the aim of which is to decrease a number of negative examples as much as possible. Under this project there were developed some versions of WiM dedicated to specialised applications, e.g. object-oriented analysis and design [10] and knowledge discovery in geographic data [9].
- - - - -
Members
- - - - -
Courses

Courses relevant to ILP taught by members of the group

  • ILP - one semester course (3 hours per week)
  • KDD - one semester course (3 hours per week)
  • KDD project(one semester project)
- - - - -
Participation
in conferences
- - - - -
Participation
in projects
  • ESPRIT METAL - combination of statistical methods with machine learning, multistrategy learning
  • Natural Language Processing Laboratory (with applications supporting education of people with limited sight) (Ministry of Education, CZ) - automatic recognition of noun phrases [16], synthesis of verb rules [17], syntax analysis by means of machine learning [4] automatic tagging of composed verb groups [18];
  • ILP - WiM system [7, 15], applications of WiM in software engineering [10] and KDD [9];
  • Expressivity of ophthalmology diseases in descendent populations of a rural region (IGA MZ CR 4377-3 Ministry of Health, CZ) - collaboration with Health of Child Research Institute in Brno.
- - - - -
Contact address Lubos Popelínský, popel@informatics.muni.cz
Faculty of Informatic, Masaryk University
Botanická 68a
CZ - 602 00 Brno
Czech Republic
- - - - -
References
  1. Kuba P.: Knowledge discovery in spatial data. Master thesis FI MU Brno, 2000 (in Czech).
  2. Kuba P., Popelínský L.: Automatic classification of spatial data. 7th Conference on GIS GIS...2000, Ostrava 2000 (in Czech).
  3. Nepil M.: Automatic construction of natural language grammar. Master Thesis, FI MU 2000 (in Czech).
  4. Nepil M.: Learning Parse Actions from Annotated Sentences (submitted to TSD'00)
  5. K. Pala , P. Rychlý and P. Smrz: DESAM - annotated corpus for Czech. In Plásil F., Jeffery K.G.(eds.): Proceedings of SOFSEM'97, Milovy, Czech Republic. LNCS 1338, Springer-Verlag 1997. (modified version of this paper is available as technical report FI MU)
  6. Pavelek T., Popelínský L.: Towards lemma disambiguation: Similarity classes. In Proc. of Summer School on Information Systems, Ruprechtov 1999 (in Czech)
  7. Flener P., Popelínský L. Stepánková O.: ILP nad Automatic Programming: Towards three approaches. Proc. of 4th Workshop on Inductive Logic Programming (ILP'94), Bad Honeff, Germany, 1994.
  8. Popelínský L.: Towards Program Synthesis From A Small Example Set. Proceedings of 21st Czech-Slovak conference on Computer Science SOFSEM'94, pp.91-96 Czech Society for Comp. Sci. Brno 1993. (See also Proceedings of 10th WLP'94, Zuerich 1994, Switzerland.)
  9. Popelínský L.: Knowledge Discovery in Spatial Data by Means of ILP. In: Zytkow J.M., Quafafou M.(Eds.): Principles of Data Mining and Knowledge Discovery. Proc. of 2nd European Symposium PKDD'98, Nantes France 1998. LNCS 1510, Springer-Verlag 1998.
  10. Popelínský L.: Inductive inference to support object-oriented analysis and design. In: Proc. of 3rd Conf on Knowledge-Based Software Engineering, Smolenice 1998, IOS Press.
  11. Popelínský L.: Approaches to Spatial Data Mining. In Proceedings of GIS... Ostrava'99 Conference, ISSN 1211-4855, 1999.
  12. Popelínský L., Pavelek T., Ptácník T.: Towards disambiguation in Czech corpora. In Proc. of LLL Workshop Bled, 1999
  13. Popelínský L., Pavelek T.: Mining lemma disambiguation rules from Czech corpora. In Rauch J., Zytkow J.M.(Eds.):Principles and Practice of Knowledge Discovery in Databases. Proc. of 3rdEuropean Conference PKDD'99, Prague Czech Republic 1999. LNCS 1704, Springer-Verlag 1999.
  14. Popelínský L.: Towards practical inductive logic programming. PhD thesis FEL CTU Prague 2000.
  15. Smrz P., Zácková E.: New Tools for Disambiguation of Czech Texts. In Sojka P., Matousek V., Pala K., Kopecek I.: Text, Speech, Dialogue. Proceedings of the 1st Workshop on Text, Speech, Dialogue - TSD'98, Brno, Czech Republic, Sept. 1998.
  16. Zácková E. , Pala K.: Corpus-Based Rules for Czech Verb Discontinuous Constituents. Proceedings of TSD'99, Springer Verlag 1999, LNAI 1692, pp. 325-328. (extended and modified version of this paper is available as technical report FI MU)
  17. Zácková E., Popelínský L., Nepil M. : Automatic Tagging of Compound Verb Groups in Czech Corpora. In Proceedings of TSD 2000, LNAI 1902, Springer Verlag 2000, pp. 115-120.
  18. Zácková E., Popelínský L., Nepil M. : Recognition and Tagging of Compound Verb Groups in Czech. In Proceedings of CoNLL and LLL 2000, Lisbon, Portugal, Sept. 2000
  19. Nepil M., Popelinsky L., Zackova E.: Part-of-Speech Tagging by Means of Shallow Parsing, ILP and Active Learning In Proceedings of 3rd Workshop on Learning Language in Logic(LLL), Strasbourg, 2001.
- - - - -
- - - - -
popel@informatics.muni.cz