Intro | News | Lectures | Exercises | Previous courses | Links | Projects |

The course is based on the textbook Manning, Raghavan and Schutze: Introduction to Information Retrieval, taught at Stanford, Munich and other places. In the course you will, among other things, learn how is it possible that Google is able to respond to 10,000+ questions per second from different places on the globe within milliseconds. There are numerous, rich and detailed materials available on Coursera. Several copies of the textbook are available in the library at FI. Also this year, parts about machine (deep) learning will be added, together with topics as image or XML retrieval. Students are encouraged to try active/flipped learning approaches wherever possible.

- Invited lectures by Vladimir Kadlec (Seznam)
will take place as part of the course, others are being negotiated.
All planned lectures below are
*subject to change*. - IS MU discussion group for PV211 is here for discussions. Questions are encouraged to be posted there!
- Course trailer (in Czech).
- 12.3.: Exam terms posted in the IS.

- 21. 2. 2018 12:00 D3: Introduction to IR, Boolean Retrieval.

Boolean retrieval slides 1, IIR chapter 1

Exercises 1 (IS) - 28. 2. 2018 12:00 D3: Dictionary and Postings'
storage (Indexing). Tolerant Retrieval.

Readings: ternary trees, Soundex demo. Explore Google datacenters (YouTube video).

Term vocabulary and postings lists slides 2, IIR chapter 2

Dictionaries and tolerant retrieval slides 3, IIR chapter 3

Exercises 2 (IS) - 7. 3. 2018 12:00 D3: Index construction, MapReduce, Compression.

Readings: Index construction slides 4, IIR chapter 4

Compression slides 5, IIR chapter 5

Exercises 3 (IS) - 14. 3. 2018 12:00 D3: Vector Space Model, IR system architecture.

Readings: Scoring, term weighting, the vector space model slides 6, Vector space model (slides Arguello), IIR chapter 6

Scoring slides 7, IIR chapter 7

slides Google architecture (Ed Austin), slides Google infrastructure (Jeff Dean), Jeff Dean (YouTube video), Google Anatomy paper from 1998, Google File System, About Google [searches], Jak funguje Google (YouTube video).

Complete search system Challenges in Building Google... (slides by Jeff Dean from Stanford CS276 course in 2015).

Exercises 4 (IS) - 21. 3. 2018 12PM D3: Evaluation, Relevance feedback and Query expansion.

Readings: Evaluation and result summaries slides 8, IIR chapter 8.

Query expansion slides 9, IIR chapter 9.

Exercises 5 (IS)

MIDTERM test #1 - 28. 3. 2018 12PM D3: Classification, SVM.

Readings: Text Classification and Naive Bayes slides 13, IIR chapter 13.

Vector Space Classification slides 14, IIR chapter 14.

Support Vector Machines slides 15a, Learning to Rank slides 15b, (IIR chapter 15).

Exercises 6 (IS) - 4. 4. 2018 12PM D3: Seznam.cz Fulltext Architecture by Vladimír Kadlec
(LinkedIn).
video (630 MiB, MP4)
, slides (500 KiB, PDF)

Abstract: The talk covers all basic web search engine blocks: crawling, indexing, query reformulation, relevance. Explanation of inner parts of the user interface such as: auto completer, query corrector, suggested searches. Real statistics from Seznam's traffic. As a bonus: Image/video search.

Vladimír works as a senior researcher at Seznam.cz since 2011 and currently the head of the whole research team at Seznam.cz. He earned his doctoral degree from FI MU in 2008. All of his research has been related to natural language processing or information retrieval. At Seznam.cz he designs and improves algorithms for the fulltext search engine. Vladimir loves (almost) all sports from snowboarding to cycling. His team works on realization of various machine learning tasks as fulltext search, text and web page analysis, recommendation systems, or image recognition.

Exercises 7 (IS) - 11. 4. 2018 12PM D3: Clustering, machine learning.

Readings: Flat Clustering slides 16, IIR chapter 16.

(Hierarchical Clustering slides 17, IIR chapter 17).

Latent Dirichlet Allocation Topic similarity by LDA: intro, LDA slides by Blei, LDA visual browser demo

Exercises 8 (IS) - 18. 4. 2018 12PM D3: Web search

Readings: Web search slides 19, IIR chapter 19.

Exercises 9 (IS) - 25. 4. 2018 12PM D3: Link analysis

Readings: Link Analysis slides 21, IIR chapter 21, How Google finds a needle....

Exercises 10 (IS)

MIDTERM test #2 - 2. 5. 2018 12PM D3: Crawling. Link Analysis. XML retrieval

Crawling slides 20, IIR chapter 20, Sketch Engine

Link Analysis (HITS) slides 21 (cont.)

Exercises 11 (IS) - 9. 5. 2018 12PM D3: Latent Semantic Indexing, Semantic indexing. MathML retrieval.

Readings: Latent Semantic Indexing slides 18, IIR chapter 18, Gensim,

Semantic indexing in ScaleText. paper on ScaleText's design.

Readings: XML retrieval slides 10, IIR chapter 10, MathML retrieval by MIaS in EuDML: slides

Exercises 12 (IS) - 16. 5. 2018 12PM D3: No teaching, Dies Academicus, no teaching by rector's and dean's will.
- 23. 5. 2018 12PM D3: Last lecture, surprise topic.

Question and answers session.

- PV211 course page from 2017, 2016, 2015, and 2014.
- Web pages of similar course at Stanford and Munich.

- Google: first paper (Anatomy of Google presented at WWW7 conference), Google crash course (in Czech), execs

I will be glad if you get encouraged into course topics and you decide
to get insight into it by solving [mini]projects.
Activities in this direction will be rewarded by the nontrivial number of
*premium* points towards successful grading.
Number of stars below is an estimate of project
difficulty, from miniproject [(*), 10 points] to big project size [(*****), 30+ points].
I am open to assign/extend a project as a Bachelor/ Masters/ Dissertation thesis,
just contact me.

- (*)+ Pointing to any (factual, typographical) errors in the course materials.
- (**)+ Preparation of hot topic slides, solutions of exercises, production or preparation of motivating Khan-Academy style video, or other course materials in LaTeX.
- (**)+ Presentation or teaching video on topics relevant to the course. Possible topics: Sketch Engine, search with linguistic attributes, random walks in texts, topic search and corpora, time-constrained search, topic modelling with gensim, LDA, Wolfram Alpha, specifics of search of structured data (chemical and mathematical formulae, linguistic trees - syntactic or dependency), etc.
- (***) Participation in IR competition at Kaggle.com.
- (***) Participation in IR research on Math Information Retrieval or Gait Recognition or ScaleText project.
- (***)+ Evaluation of Math Information Retrieval in system MIaS - possible as a Dean project under supervision of Vít Novotný or Dávid Lupták or Michal Růžička or as a Bachelor/ Masters/ Dissertation thesis.

To a pupil who was in danger, Master said, "Those who do not make mistakes, they are most mistaken for all – they do not try anything new." Anthony de Mello