PV030 -- Textual Information Systems (spring 2012)
News |
Lectures |
References |
Tests |
- 17. 5. 2012 10AM Final exam (70 min in C511)
- 10. 5. 2012 (3+ hours in C511)
Dictionary implementations.
Syntactical methods of compression.
Models of TIS: boolean, vector and probability.
IR evaluation (TREC et al.).
Document similarity, gensim.
Automatic structuring of texts.
Signature methods.
Compression with neural networks.
slides
Q&A session.
- 3. 5. 2012 (3+ hours in C511)
Introduction to compression.
Statistical methods of compression.
Shannon-Fano, Hufmann and arithmetic coding.
Compression dictionary methods.
Adaptive dictionary compression methods.
FGK (Dvorak)
LZ77 (Dvorak),
LZ78 (Dvorak),
PPM (Dvorak)
Methods with dictionary restructuralisation.
Dictionary implementations.
Syntactical methods of compression.
slides
Homework: try document similarities on documents in projects
DML-CZ, EuDML computed by
gensim
- 26. 4. 2012 (3 hours in C511)
Basics of corpus linguistics as an example of textual information system.
Indexing with natural language processing and its implementation.
Coding theory: basic notions. Entropy, redundancy.
Universal encoding of natural numbers.
slides
Readings: Information Retrieval
(cont.): topics from
Part
4 (Index construction).
Part
5 (Index compression).
Part
6 (Scoring, term weighting and vector space model).
Part
7 (Computing scores in a complete search system).
Part
8 (Evaluation in information retrieval).
Homework: a) have a look at
Touchgraph as
possible TIS interface; b) try wordnet (module add wn) on
aisa or elsewhere.
- 19. 4. 2012 (4 hours in C511)
Brainstorming on anatomy of Google: Google paper on
WWW7 conference,
Jeff Dean's video lecture,
Google File System,
Google executive,
PageRank Calculator
About Google in Czech,
Google Gives Search a Refresh, slides
- 12. 4. 2012 2.5 hours in C511
Mathematical Information Retrieval and Metric space approaches as
another examples of IR systems:
EuDML,
MIaS/WebMIaS
Readings till next week:
first paper about Google
- 5. 4. 2012 (4 hours in C511)
Information
Retrieval (cont.):
Part
3 (Dictionaries and tolerant retrieval).
Homework: Questions about Google from slide plus
read this article.
- 29. 3. 2012 (4 hours in C511)
Midterm exam (approx. 1 hour, from 10 a.m., C511).
Introduction to Information
Retrieval. Slides (Manning):
Part 1 Boolean retrieval.
Part 2 The term vocabulary and postings lists.
- 22. 3. 2012 (4 hours, 12-14 exercises in B311)
Proximity search. Search classification: sixdimensional space of search
problems.
slides
Exercises (12 to 14) in B311: Visualisation of search engines (Pojer).
Sketch Engine motivation examples.
Index methods: preliminaries. Implementation of indexes.
Automatic indexing, thesauri construction.
Midterm exam will take place next week, prepare yourself and ask
questions on discussion forum!
- 15. 3. 2012 (4 hours in C511)
Regular expressions, search of infinitely many
patterns. Search methods from right to left (variants of Boyer Moore,
Commentz-Walter, Buczilowski). Twoway automata with jumps:
generalization of exact search algorithms.
Hierarchy of search engines.
slides
animations
of algorithms Boyer-Moore, KMP (Buehler)
taxonomy
of search automaton constructions
reformulation of CW algo by L. Riedel
(in Czech)
homework:
Let we have patterns P= {tis, ti, iti}
1) Create NFA for searching P without epsilon transitions.
2) Create DFA equivalent to the NFA from 1)
3) Minimize DFA created in 2)
4) Compare the search by 3) with AC
5) You may experiment with finite automata and JFLAP
- 8. 3. 2012 (4 hours in C511)
Exact search of one pattern
(Shift-Or, Karp-Rabin, MP, KMP) and more patterns (AC).
Exact search of several patterns (AC),
regular expressions, exact search of infinite many patterns.
slides
animation
of
Aho-Corasick algorithm, and
implementation
in C#.
Animations:
String
matching algorithms (with animations, Lecroq),
Interactive Pattern Matching Animation (Goodrich),
animation
of algoritm KMP (Buehler)
Exercises in C511.
- 1. 3. 2012 neither lecture, neither seminar
(will have 4 hours in the next 3 weeks)
- 23. 2. 2012 in C511.
Introduction, basic notions, classification of search problems.
slides
Watson,
paper
about Watson, (local copy)
Žákovi, který se hrozil chyb, Mistr řekl: "Ti, kdo nedělají chyby,
chybují nejvíc ze všech - nepokoušejí se o nic nového." Anthony de
Mello: O cestě.

sojka at fi dot muni dot cz --