A List by Author: Jan Kasprzak

e-mail:
kas(a)fi.muni.cz
home page:
https://www.fi.muni.cz/~kas/

Access Rights in Enterprise Full-text Search

by Jan Kasprzak, Michal Brandejs, Matěj Čuhel, Tomáš Obšívač, A full version of the paper presented at ICEIS 2010 conference. July 2010, 19 pages.

FIMU-RS-2010-08. Available as Postscript, PDF.

Abstract:

One of the toughest problems to solve when deploying an enterprise-wide full-text search system is to handle the access rights of the documents and intranet web pages correctly and effectively. Post-processing the results of general-purpose full-text search engine (filtering out the documents inaccessible to the user who sent the query) can be an expensive operation, especially in large collections of documents. We discuss various approaches to this problem and propose a novel method which employs virtual tokens for encoding the access rights directly into the search index. We then evaluate this approach in an intranet system with several millions of documents and a complex set of access rights and access rules.

Distributed System for Discovering Similar Documents

by Jan Kasprzak, Michal Brandejs, Miroslav Křipac, Pavel Šmerk, A full version of the paper presented at the ICEIS 2008 converence (www.iceis.org). July 2008, 14 pages.

FIMU-RS-2008-04. Available as Postscript, PDF.

Abstract:

One of the drawbacks of e-learning methods such as Web-based submission and evaluation of students` papers and essays is that it has become easier for students to plagiarize the work of other people. In this paper we present a computer-based system for discovering similar documents, which has been in use at Masaryk University in Brno since August 2006, and which will also be used in the forthcoming Czech national archive of graduate theses. We also focus on practical aspects of this system: achieving near real-time response to newly imported documents, and computational feasibility of handling large sets of documents on commodity hardware. We also show the possibilities and problems with parallelization of this system for running on a distributed cluster of computers.