pan13-paper/yenya-text_alignment.tex

   1 \section{Text Alignment}~\label{text_alignment}\r
   2 \r
   3 \subsection{Overview}\r
   4 \r
   5 Our approach at the text alignment subtask of PAN 2013 uses the same\r
   6 basic principles as our previous work in this area, described\r
   7 in \cite{Suchomel2012}, which in turn builds on our work for previous\r
   8 PAN campaigns,, \cite{Kasprzak2010}, \cite{Kasprzak2009a}:\r
   9 \r
  10 We detect {\it common features} between source and suspicious documents,\r
  11 where features we currently use are word $n$-grams, and stop-word $m$-grams\r
  12 \cite{stamatatos2011plagiarism}. From those common features (each of which\r
  13 can occur multiple times in both source and suspicious document), we form\r
  14 {\it valid intervals}\footnote{%\r
  15 We describe the algorithm for computing valid intervals in \cite{Kasprzak2009a},\r
  16 and a similar approach is also used in \cite{stamatatos2011plagiarism}.}\r
  17 of characters\r
  18 from the source and suspicious documents, where the interval in both\r
  19 of these documents is covered ``densely enough'' by the common features.\r
  20 \r
  21 We then postprocess the valid intervals, removing overlapping detections,\r
  22 and merging detections which are close enough to each other.\r
  23 \r
  24 In the next sections, we summarize the modifications we did for PAN 2013,\r
  25 including approaches tried but not used. For the training corpus,\r
  26 our software from PAN 2012 gave the plagdet score of TODO, which we\r
  27 consider the baseline for further improvements.\r
  28 \r
  29 \subsection{Alternative features}\r
  30 \r
  31 TODO \cite{torrejondetailed}\r
  32 \r
  33 \subsection{Global postprocessing}\r
  34 \r
  35 For PAN 2013, the algorithm had access to all of the source and suspicious\r
  36 documents. Because of this, we have rewritten our software to handle\r
  37 all of the documents at once, in order to be able to do cross-document\r
  38 optimizations and postprocessing, similar to what we did for PAN 2010.\r
  39 This required refactorization of most of the code. We are able to handle\r
  40 most of the computation in parallel in per-CPU threads, with little\r
  41 synchronization needed. The parallelization was used especially\r
  42 for development, where it has provided a significant performance boost.\r
  43 The official performance numbers are from single-threaded run, though.\r
  44 \r
  45 For PAN 2010, we have used the following postprocessing heuristics:\r
  46 If there are overlapping detections inside a suspicious document,\r
  47 keep the longer one, provided that it is long enough. For overlapping\r
  48 detections up to 600 characters, \r
  49 TODO\r
  50 \r
  51 \r