X-Git-Url: https://www.fi.muni.cz/~kas/git//home/kas/public_html/git/?a=blobdiff_plain;f=paper.tex;h=8e042ef485eb1340d00c25f8bb16151a02b28cf0;hb=HEAD;hp=e098f4ae4ceb6da19f51cf9b6876d7e38a20b561;hpb=8bd472fc89fa7f354933fcc568d8ad378c019c39;p=pan12-paper.git diff --git a/paper.tex b/paper.tex index e098f4a..8e042ef 100755 --- a/paper.tex +++ b/paper.tex @@ -4,11 +4,15 @@ \usepackage[utf8]{inputenc} \usepackage{times} \usepackage{graphicx} +\usepackage{algorithm} +\usepackage{algorithmic} +\usepackage{amssymb} +\usepackage{multirow} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \begin{document} -\title{Your Title} +\title{Three way search engine queries with multi-feature document comparison for plagiarism detection} %%% Please do not remove the subtitle. \subtitle{Notebook for PAN at CLEF 2012} @@ -19,7 +23,16 @@ \maketitle \begin{abstract} -Briefly describe the main ideas of your approach. +In this paper, we describe our approach at the PAN 2012 plagiarism detection competition. +Our candidate retrieval system is based on extraction of three different types of +web queries with narrowing their execution by skipping certain passages of an input document. + +Our detailed comparison system detects common features of input +document pair, computing valid intervals from them, and then merging +some detections in the postprocessing phase. We also discuss +the relevance of current PAN 2012 settings to the real-world +plagiarism detection systems. + \end{abstract} @@ -28,9 +41,33 @@ Briefly describe the main ideas of your approach. %The notebooks shall contain a full write-up of your approach, including all details necessary to reproduce your results. -Due to the increasing ease of plagirism the plagiarism detection has nowdays become a need for many instutisions. Especially for universities where modern learning methods include e-learning and a vast document sources are online available. - - +Due to the increasing ease of plagiarism the plagiarism detection has nowadays become a need for many institutions. +Especially for universities where modern learning methods include e-learning and vast document sources are available online. +%In the Information System of Masaryk University~\cite{ismu} there is also an antiplagiarism tool which is based upon the same principles as are shown in this paper. +The core methods for automatic plagiarism detection, which also work in practice on extensive collections of documents, +are based on document similarities. In order to compute a similarity +we need to possess the original and the plagiarized document. +%The most straightforward method is to use an online search engine in order to enrich +%document base with potential plagiarized documents and evaluate the amount of plagiarism by detailed document comparison. +%In this paper we introduce a method which has been used in PAN 2012 competition\footnote{\url{http://pan.webis.de/}} +%in plagiarism detection. + +In the first section we will introduce methods, which took part in +PAN 2012 competition\footnote{\url{http://pan.webis.de/}} in plagiarism detection, for candidate document retrieval from online sources. +The task was to retrieve a set of candidate source documents that may had served as an original for plagiarism. +During the competition, there were several measures of performance such as: i) Number of queries submitted, +ii) Number of web pages downloaded, iii) Precision and recall of web pages downloaded regarding the actual sources, +iv) Number of queries until the first actual source is found, v) Number of downloads until the first actual source is downloaded. +Nevertheless, the overall performance measure was not set, thus we mainly focus on minimizing the query workload. +%In the PAN 2012 candidate document retrieval test corpus, there were 32 text documents all contained at least one plagiarism case. +%The documents were approximately 30 KB of size, the smallest were 18 KB and the largest were 44 KB. + +In the second section we describe our approach to detailed document comparison. +We highlight the differences of this approach to the one we used for PAN 2010 +competition. We then provide the outline of the algorithm, and describe +its steps in detail. We briefly mention the approaches we have explored, +but did not use in the final submission. Finally, we discuss the performance +of our system (both in terms of the plagdet score, and in terms of CPU time). \include{simon-searchengine} @@ -38,7 +75,19 @@ Due to the increasing ease of plagirism the plagiarism detection has nowdays bec \section{Conclusions} -Tady napsat zaver +We present methods for candidate document retrieval which lead to +discovery a decent amount of plagiarism with minimizing the number of used queries. +The proposed methods are applicable in general to any type of text input with no apriori information about the input document. +In PAN 2012 competition the proposed methods succeeded with competitive amount of plagiarism detected with +only a small fraction of used queries compared to the others. + +We also present a novel approach for detailed (pair-wise) document +comparison, where we allow the common features of different types +to be evaluated together into valid intervals, even though the particular +types of common features can vary to the great extent in their length +and importance, and do not provide a natural ordering. +The presented approach achieved a second-highest plagdet score +in the PAN 2012 competition. \bibliographystyle{splncs03} \begin{raggedright}