pan13-paper/simon-source_retrieval.tex

   1 \section{Source Retrieval}\r
   2 The source retrieval is a subtask in a plagiarism detection process during\r
   3 which only a relatively small subset of documents are retrieved from the\r
   4 large corpus. Those candidate documents are usually further compared in detail with the\r
   5 suspicious document. In the PAN 2013 source retrieval subtask the main goal was to\r
   6 identified web pages which have been used as a source of plagiarism for creation of the \r
   7 test corpus. \r
   8 The test corpus contained XX documents each discussing one and only one theme.\r
   9 Those documents were created intentionally by\r
  10  semiprofessional writers, thus they feature nearly realistic plagiarism cases. \r
  11  Such conditions are similar to a realistic plagiarism detection scenario, such as for\r
  12 state of the art commercial plagiarism detection systems or the anti-plagiarism service developed on and\r
  13 utilized at the Masaryk University. The main difference between real-world corpus \r
  14 of suspicious documents such as for example corpus created from theses stored in Information System of Masaryk University\r
  15 and the corpus of suspicious documents used during the PAN 2013 competition is that in the PAN\r
  16 corpus each document contains plagiarism passages. Therefore we can deepen the search during the process\r
  17 in certain parts of the document where no similar passage has yet been found. This is the main\r
  18 idea of improving recall of detected plagiarism in a suspicious document.\r
  19 \r
  20 \r
  21 \begin{figure}\r
  22   \centering\r
  23   \includegraphics[width=1.00\textwidth]{img/source_retrieval_process.pdf}\r
  24   \caption{Source retrieval process.}\r
  25   \label{fig:source_retr_process}\r
  26 \end{figure}\r
  27 \r
  28 An online plagiarism detection can be viewed as a reverse engineering task where \r
  29 we need to find original documents from which the plagiarized document was created.\r
  30 During the process the plagiarist locates original documents with the use of a search engine.\r
  31 The user decides what query the search engine to ask and which of the results from the result page to use.\r
  32 In real-world scenario the corpus is the whole Web and the search engine can be a contemporary commercial search engine\r
  33 which scales to the size of the Web. This methodology is based on the fact that we do not\r
  34 possess enough resources to download and effectively process the whole corpus.\r
  35 In the case of PAN 2013 competition the corpus\r
  36 of source documents is the ClueWeb~\footnote{\url{http://lemurproject.org/clueweb09.php/}} corpus. \r
  37 As a document retrieval tool for the competition we utilized the ChatNoir~\cite{chatnoir} search engine which indexes the English\r
  38 subset of the ClueWeb.   \r
  39 The reverse engineering decision process reside in creation of suitable queries on the basis of the suspicious document\r
  40 and in decision what to actually download and what to report as a plagiarism case from the search results.\r
  41 \r
  42 These first two stages can be viewed in figure~\ref{fig:source_retr_process} as Querying and Selecting. Selected results \r
  43 from the search engine are forthwith textually aligned with the suspicious document (see section~\ref{text_alignment} for more details).\r
  44 This is the last decision phase -- what to report.\r
  45 If there is any continuous passage of reused text detected, the result document is reported\r
  46  and the continuous passages in the suspicious document are marked as 'discovered' and no further processing\r
  47 of those parts is made. \r
  48  \r
  49 \subsection{Querying}\r
  50 Querying means to effectively utilize the search engine in order to retrieve as many relevant\r
  51 documents as possible with the minimum amount of queries. We consider the resulting document relevant \r
  52 if it shares some of text characteristics with the suspicious document.  \r
  53 \r
  54 We used 3 different types of queries~\footnote{We used similar three-way based methodology in PAN 2012 \r
  55 Candidate Document Retrieval subtask. However this time we completely replaced the headers based queries\r
  56 with paragraph based queries, since the headers based queries did not pay off in the overall process.}:\r
  57 i) keywords based queries, ii) intrinsic plagiarism\r
  58 based queries, and iii) paragraph based queries. Three main properties distinguish each type of query: i) Positional; ii) Phrasal; iii) Deterministic.\r
  59 Positional queries carry extra information about a textual interval in the suspicious document which the query represents.\r
  60 A phrasal query aims for retrieval of documents containing the same small piece of a text. They are usually created from closely coupled words. \r
  61 Deterministic queries for specific suspicious document are always the same no matter how many times we run the software. \r
  62 On the contrary the software can create in two runs potentially different nondeterministic queries.\r
  63 \r
  64 \subsubsection{Keywords Based Queries}\r
  65 \r
  66 \subsubsection{Intrinsic Plagiarism Based Queries}\r
  67 \subsubsection{Paragraph Based Queries}\r
  68 \subsection{Search Control}\r
  69 \r
  70 \r
  71 \subsection{Result Selection}\r
  72 \subsection{Snippet Control}\r
  73 \r
  74 \r
  75  \r
  76 \r
  77 \r
  78 \r