Uprava titulku, ext. abstrakt

[pan13-paper.git] / pan13-paper / yenya-text_alignment.tex
diff --git a/pan13-paper/yenya-text_alignment.tex b/pan13-paper/yenya-text_alignment.tex

index e284fe19581f6f382a5e8e3c738e6adc569595e9..1cf67e7b18eebf354293184af9bf3ca49e8bd51c 100755 (executable)
--- a/pan13-paper/yenya-text_alignment.tex
+++ b/pan13-paper/yenya-text_alignment.tex
@@ -1,51 +1,130 @@
  \section{Text Alignment}~\label{text_alignment}\r
-\r
-\subsection{Overview}\r
-\r
+%\subsection{Overview}\r
  Our approach at the text alignment subtask of PAN 2013 uses the same\r
  basic principles as our previous work in this area, described\r
  in \cite{suchomel_kas_12}, which in turn builds on our work for previous\r
-PAN campaigns,, \cite{Kasprzak2010}, \cite{Kasprzak2009a}:\r
+PAN campaigns \cite{Kasprzak2010}, \cite{Kasprzak2009a}:\r
  \r
  We detect {\it common features} between source and suspicious documents,\r
-where features we currently use are word $n$-grams, and stop-word $m$-grams\r
+where the features we currently use are word $n$-grams, and stop-word $m$-grams\r
  \cite{stamatatos2011plagiarism}. From those common features (each of which\r
  can occur multiple times in both source and suspicious document), we form\r
  {\it valid intervals}\footnote{%\r
-We describe the algorithm for computing valid intervals in \cite{Kasprzak2009a},\r
-and a similar approach is also used in \cite{stamatatos2011plagiarism}.}\r
+See \cite{Kasprzak2009a} for the algorithm for computing valid intervals;\r
+a similar approach is also used in \cite{stamatatos2011plagiarism}.}\r
  of characters\r
  from the source and suspicious documents, where the interval in both\r
  of these documents is covered ``densely enough'' by the common features.\r
  \r
-We then postprocess the valid intervals, removing overlapping detections,\r
-and merging detections which are close enough to each other.\r
+We then postprocess the valid intervals, removing the overlapping detections,\r
+and merging the detections which are close enough to each other.\r
+\r
+For the training corpus,\r
+our unmodified software from PAN 2012 gave the following results\footnote{%\r
+See \cite{potthastframework} for definition of {\it plagdet} and the rationale behind this type of scoring.}:\r
+\r
+\def\plagdet#1#2#3#4{\par{\r
+$\textit{plagdet}=#1, \textit{recall}=#2, \textit{precision}=#3, \textit{granularity}=#4$}\hfill\par}\r
+\r
+\plagdet{0.7235}{0.6306}{0.8484}{1.0000}\r
+\r
+We take the above as the baseline for further improvements.\r
+In the next sections, we summarize the modifications we did for PAN 2013.\r
+\r
+\subsection{Alternative Features}\r
+\label{altfeatures}\r
+\r
+In PAN 2012, we have used word 5-grams and stop-word 8-grams.\r
+This year we have experimented with different word $n$-grams, and also\r
+with contextual $n$-grams as described in \cite{torrejondetailed}.\r
+Modifying the algorithm to use contextual $n$-grams created as word\r
+5-grams with the middle word removed (i.e. two words before and two words\r
+after the context) yielded better score:\r
+\r
+\plagdet{0.7421}{0.6721}{0.8282}{1.0000}\r
+\r
+We have then made tests with plain word 4-grams, and to our surprise,\r
+it gave even better score than contextual $n$-grams:\r
  \r
-In the next sections, we summarize the modifications we did for PAN 2013,\r
-including approaches tried but not used. For the training corpus,\r
-our software from PAN 2012 gave the plagdet score of TODO, which we\r
-consider the baseline for further improvements.\r
+\plagdet{0.7447}{0.7556}{0.7340}{1.0000}\r
  \r
-\subsection{Alternative features}\r
+It should be noted that these two quite similar approaches (both use the\r
+features formed from four words), while having a similar plagdet score,\r
+have their precision and recall values completely different. Looking at the\r
+training corpus parts, plain word 4-grams were better at all parts\r
+of the corpus (in terms of plagdet score), except the 02-no-obfuscation\r
+part.\r
  \r
-TODO \cite{torrejondetailed}\r
+In our final submission, we have used word 4-grams and stop-word 8-grams.\r
  \r
-\subsection{Global postprocessing}\r
+\subsection{Global Postprocessing}\r
  \r
  For PAN 2013, the algorithm had access to all of the source and suspicious\r
-documents. Because of this, we have rewritten our software to handle\r
-all of the documents at once, in order to be able to do cross-document\r
+documents at once. It was not limited to a single document pair, as it was\r
+in 2012. We have rewritten our software to handle\r
+all of the documents in one run, in order to be able to do cross-document\r
  optimizations and postprocessing, similar to what we did for PAN 2010.\r
-This required refactorization of most of the code. We are able to handle\r
-most of the computation in parallel in per-CPU threads, with little\r
-synchronization needed. The parallelization was used especially\r
-for development, where it has provided a significant performance boost.\r
-The official performance numbers are from single-threaded run, though.\r
+%This required refactorization of most of the code. We are able to handle\r
+%most of the computation in parallel in per-CPU threads, with little\r
+%synchronization needed. The parallelization was used especially\r
+%for development, where it has provided a significant performance boost.\r
+%The official performance numbers are from single-threaded run, though.\r
  \r
  For PAN 2010, we have used the following postprocessing heuristics:\r
  If there are overlapping detections inside a suspicious document,\r
  keep the longer one, provided that it is long enough. For overlapping\r
-detections up to 600 characters, \r
-TODO\r
+detections up to 600 characters, drop them both. We have implemented\r
+this heuristics, but have found that it led to a lower score than\r
+without this modification. Further experiments with global postprocessing\r
+of overlaps led to a new heuristics: we unconditionally drop overlapping\r
+detections with up to 250 characters both, but if at least one of them\r
+is longer, we keep both detections. This is probably a result of\r
+plagdet being skewed too much towards recall (because the percentage of\r
+plagiarized cases in the corpus is way too high compared to real world),\r
+so it is favourable to keep the detection even though the evidence\r
+for it is rather low.\r
+\r
+The global postprocessing improved the score even more:\r
+\r
+\plagdet{0.7469}{0.7558}{0.7382}{1.0000}\r
+\r
+\subsection{Evaluation Results and Future Work}\r
+\r
+       The evaulation on the competition corpus had the following results:\r
+\r
+\plagdet{0.7448}{0.7659}{0.7251}{1.0003}\r
+\r
+This is quite similar to what we have seen on a training corpus,\r
+with only the granularity different from 1.000 being a bit surprising.\r
+%, given\r
+%the aggressive joining of neighbouring detections we perform.\r
+Compared to the other participants, our algorithm performs\r
+especially well for human-created plagiarism (the 05-summary-obfuscation\r
+sub-corpus), which is where we want to focus for our production\r
+systems\footnote{Our production systems include the Czech National Archive\r
+of Graduate Theses,\\ \url{http://theses.cz}}.\r
+\r
+%      After the final evaluation, we did further experiments\r
+%with feature types, and discovered that using stop-word 8-grams,\r
+%word 4-grams, {\it and} contextual $n$-grams as described in\r
+%Section \ref{altfeatures} performs even better (on a training corpus):\r
+%\r
+%\plagdet{0.7522}{0.7897}{0.7181}{1.0000}\r
+\r
+We plan to experiment further with combining more than two types\r
+of features, be it continuous $n$-grams or contextual features.\r
+This should allow us to tune down the aggresive heuristics for joining\r
+neighbouring detections, which should lead to higher precision,\r
+hopefully without sacrifying recall.\r
+\r
+       As for the computational performance, it should be noted that\r
+our software is prototyped in a scripting language (Perl), so it is not\r
+the fastest possible implementation of the algorithm used. The code\r
+contains about 800 non-comment lines of code, including the parallelization\r
+of most parts and debugging/logging statements.\r
+\r
+       The system is mostly language independent. The only language dependent\r
+part of the code is the list of English stop-words for stop-word $n$-grams.\r
+We use no stemming or other kinds of language-dependent processing.\r
  \r
  \r