The aim of this demonstration is to
introduce a project financed by the
State Committee for Scientific
Research (a Polish government body;
grant number 7 T11C 043 20) aiming
at constructing a large corpus of
written Polish for NLP applications.
We briefly present the following
characteristics of the corpus:
- aims (NLP applications, but also
  lexical, theoretical linguistic,
  language teaching and
  sociolinguistic applications in
  mind); 
- intended size and make-up of the
  corpus; 
- the original system of
  morphosyntactic annotation; 
- the system of structural and
  meta-data annotation; 
- XML (XCES) standards adopted; 
- original tools for linguistic
  annotation of the corpus: 
  - morphological analyser;
  - statistical tagger;
- intended ways of making the corpus
  publicly available. 

We also demonstrate similarities and
differences between this and similar
corpus initiatives for other
languages, and justify the current
project in terms of the lack of
publicly available and/or
linguistically annotated corpora for
Polish.

Various aspects of this project will
be presented in more detail in
related demonstrations (depending on
the TSD organisers' decision to
accept them).


Related link: <a href=""></a>