yenya: aplikovany pripominky od Simona

[pan12-paper.git] / paper.tex
diff --git a/paper.tex b/paper.tex

index e098f4ae4ceb6da19f51cf9b6876d7e38a20b561..8e042ef485eb1340d00c25f8bb16151a02b28cf0 100755 (executable)
--- a/paper.tex
+++ b/paper.tex
@@ -4,11 +4,15 @@
  \usepackage[utf8]{inputenc}
  \usepackage{times}
  \usepackage{graphicx}
+\usepackage{algorithm}
+\usepackage{algorithmic}
+\usepackage{amssymb}
+\usepackage{multirow}
  
  %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
  \begin{document}
  
-\title{Your Title}
+\title{Three way search engine queries with multi-feature document comparison for plagiarism detection}
  %%% Please do not remove the subtitle.
  \subtitle{Notebook for PAN at CLEF 2012}
  
@@ -19,7 +23,16 @@
  \maketitle
  
  \begin{abstract}
-Briefly describe the main ideas of your approach.
+In this paper, we describe our approach at the PAN 2012 plagiarism detection competition.
+Our candidate retrieval system is based on extraction of three different types of
+web queries with narrowing their execution by skipping certain passages of an input document.
+
+Our detailed comparison system detects common features of input  
+document pair, computing valid intervals from them, and then merging
+some detections in the postprocessing phase. We also discuss
+the relevance of current PAN 2012 settings to the real-world
+plagiarism detection systems.
+
  \end{abstract}
  
  
@@ -28,9 +41,33 @@ Briefly describe the main ideas of your approach.
  %The notebooks shall contain a full write-up of your approach, including all details necessary to reproduce your results.
  
  
-Due to the increasing ease of plagirism the plagiarism detection has nowdays become a need for many instutisions. Especially for universities where modern learning methods include e-learning and a vast document sources are online available.  
-
-
+Due to the increasing ease of plagiarism the plagiarism detection has nowadays become a need for many institutions.
+Especially for universities where modern learning methods include e-learning and vast document sources are available online.
+%In the Information System of Masaryk University~\cite{ismu} there is also an antiplagiarism tool which is based upon the same principles as are shown in this paper.
+The core methods for automatic plagiarism detection, which also work in practice on extensive collections of documents,
+are based on document similarities. In order to compute a similarity
+we need to possess the original and the plagiarized document.
+%The most straightforward method is to use an online search engine in order to enrich
+%document base with potential plagiarized documents and evaluate the amount of plagiarism by detailed document comparison. 
+%In this paper we introduce a method which has been used in PAN 2012 competition\footnote{\url{http://pan.webis.de/}}
+%in plagiarism detection.
+
+In the first section we will introduce methods, which took part in
+PAN 2012 competition\footnote{\url{http://pan.webis.de/}} in plagiarism detection, for candidate document retrieval from online sources.
+The task was to retrieve a set of candidate source documents that may had served as an original for plagiarism.
+During the competition, there were several measures of performance such as: i) Number of queries submitted, 
+ii) Number of web pages downloaded, iii) Precision and recall of web pages downloaded regarding the actual sources,
+iv) Number of queries until the first actual source is found, v) Number of downloads until the first actual source is downloaded.
+Nevertheless, the overall performance measure was not set, thus we mainly focus on minimizing the query workload.   
+%In the PAN 2012 candidate document retrieval test corpus, there were 32 text documents all contained at least one plagiarism case.
+%The documents were approximately 30 KB of size, the smallest were 18 KB and the largest were 44 KB.
+
+In the second section we describe our approach to detailed document comparison.
+We highlight the differences of this approach to the one we used for PAN 2010
+competition. We then provide the outline of the algorithm, and describe
+its steps in detail. We briefly mention the approaches we have explored,
+but did not use in the final submission. Finally, we discuss the performance
+of our system (both in terms of the plagdet score, and in terms of CPU time).
  
  
  \include{simon-searchengine}
@@ -38,7 +75,19 @@ Due to the increasing ease of plagirism the plagiarism detection has nowdays bec
  
  \section{Conclusions}
  
-Tady napsat zaver
+We present methods for candidate document retrieval which lead to
+discovery a decent amount of plagiarism with minimizing the number of used queries. 
+The proposed methods are applicable in general to any type of text input with no apriori information about the input document.
+In PAN 2012 competition the proposed methods succeeded with competitive amount of plagiarism detected with
+only a small fraction of used queries compared to the others.  
+ 
+We also present a novel approach for detailed (pair-wise) document
+comparison, where we allow the common features of different types
+to be evaluated together into valid intervals, even though the particular
+types of common features can vary to the great extent in their length
+and importance, and do not provide a natural ordering.
+The presented approach achieved a second-highest plagdet score
+in the PAN 2012 competition.
  
  \bibliographystyle{splncs03}
  \begin{raggedright}