X-Git-Url: https://www.fi.muni.cz/~kas/git//home/kas/public_html/git/?a=blobdiff_plain;f=pan13-paper%2Fyenya-text_alignment.tex;h=1cf67e7b18eebf354293184af9bf3ca49e8bd51c;hb=9e3bea6abbc34854e6fc92ba08c2200290e685cd;hp=e284fe19581f6f382a5e8e3c738e6adc569595e9;hpb=a864666d2318f3ba5af8f0d53d36981254386043;p=pan13-paper.git diff --git a/pan13-paper/yenya-text_alignment.tex b/pan13-paper/yenya-text_alignment.tex index e284fe1..1cf67e7 100755 --- a/pan13-paper/yenya-text_alignment.tex +++ b/pan13-paper/yenya-text_alignment.tex @@ -1,51 +1,130 @@ \section{Text Alignment}~\label{text_alignment} - -\subsection{Overview} - +%\subsection{Overview} Our approach at the text alignment subtask of PAN 2013 uses the same basic principles as our previous work in this area, described in \cite{suchomel_kas_12}, which in turn builds on our work for previous -PAN campaigns,, \cite{Kasprzak2010}, \cite{Kasprzak2009a}: +PAN campaigns \cite{Kasprzak2010}, \cite{Kasprzak2009a}: We detect {\it common features} between source and suspicious documents, -where features we currently use are word $n$-grams, and stop-word $m$-grams +where the features we currently use are word $n$-grams, and stop-word $m$-grams \cite{stamatatos2011plagiarism}. From those common features (each of which can occur multiple times in both source and suspicious document), we form {\it valid intervals}\footnote{% -We describe the algorithm for computing valid intervals in \cite{Kasprzak2009a}, -and a similar approach is also used in \cite{stamatatos2011plagiarism}.} +See \cite{Kasprzak2009a} for the algorithm for computing valid intervals; +a similar approach is also used in \cite{stamatatos2011plagiarism}.} of characters from the source and suspicious documents, where the interval in both of these documents is covered ``densely enough'' by the common features. -We then postprocess the valid intervals, removing overlapping detections, -and merging detections which are close enough to each other. +We then postprocess the valid intervals, removing the overlapping detections, +and merging the detections which are close enough to each other. + +For the training corpus, +our unmodified software from PAN 2012 gave the following results\footnote{% +See \cite{potthastframework} for definition of {\it plagdet} and the rationale behind this type of scoring.}: + +\def\plagdet#1#2#3#4{\par{ +$\textit{plagdet}=#1, \textit{recall}=#2, \textit{precision}=#3, \textit{granularity}=#4$}\hfill\par} + +\plagdet{0.7235}{0.6306}{0.8484}{1.0000} + +We take the above as the baseline for further improvements. +In the next sections, we summarize the modifications we did for PAN 2013. + +\subsection{Alternative Features} +\label{altfeatures} + +In PAN 2012, we have used word 5-grams and stop-word 8-grams. +This year we have experimented with different word $n$-grams, and also +with contextual $n$-grams as described in \cite{torrejondetailed}. +Modifying the algorithm to use contextual $n$-grams created as word +5-grams with the middle word removed (i.e. two words before and two words +after the context) yielded better score: + +\plagdet{0.7421}{0.6721}{0.8282}{1.0000} + +We have then made tests with plain word 4-grams, and to our surprise, +it gave even better score than contextual $n$-grams: -In the next sections, we summarize the modifications we did for PAN 2013, -including approaches tried but not used. For the training corpus, -our software from PAN 2012 gave the plagdet score of TODO, which we -consider the baseline for further improvements. +\plagdet{0.7447}{0.7556}{0.7340}{1.0000} -\subsection{Alternative features} +It should be noted that these two quite similar approaches (both use the +features formed from four words), while having a similar plagdet score, +have their precision and recall values completely different. Looking at the +training corpus parts, plain word 4-grams were better at all parts +of the corpus (in terms of plagdet score), except the 02-no-obfuscation +part. -TODO \cite{torrejondetailed} +In our final submission, we have used word 4-grams and stop-word 8-grams. -\subsection{Global postprocessing} +\subsection{Global Postprocessing} For PAN 2013, the algorithm had access to all of the source and suspicious -documents. Because of this, we have rewritten our software to handle -all of the documents at once, in order to be able to do cross-document +documents at once. It was not limited to a single document pair, as it was +in 2012. We have rewritten our software to handle +all of the documents in one run, in order to be able to do cross-document optimizations and postprocessing, similar to what we did for PAN 2010. -This required refactorization of most of the code. We are able to handle -most of the computation in parallel in per-CPU threads, with little -synchronization needed. The parallelization was used especially -for development, where it has provided a significant performance boost. -The official performance numbers are from single-threaded run, though. +%This required refactorization of most of the code. We are able to handle +%most of the computation in parallel in per-CPU threads, with little +%synchronization needed. The parallelization was used especially +%for development, where it has provided a significant performance boost. +%The official performance numbers are from single-threaded run, though. For PAN 2010, we have used the following postprocessing heuristics: If there are overlapping detections inside a suspicious document, keep the longer one, provided that it is long enough. For overlapping -detections up to 600 characters, -TODO +detections up to 600 characters, drop them both. We have implemented +this heuristics, but have found that it led to a lower score than +without this modification. Further experiments with global postprocessing +of overlaps led to a new heuristics: we unconditionally drop overlapping +detections with up to 250 characters both, but if at least one of them +is longer, we keep both detections. This is probably a result of +plagdet being skewed too much towards recall (because the percentage of +plagiarized cases in the corpus is way too high compared to real world), +so it is favourable to keep the detection even though the evidence +for it is rather low. + +The global postprocessing improved the score even more: + +\plagdet{0.7469}{0.7558}{0.7382}{1.0000} + +\subsection{Evaluation Results and Future Work} + + The evaulation on the competition corpus had the following results: + +\plagdet{0.7448}{0.7659}{0.7251}{1.0003} + +This is quite similar to what we have seen on a training corpus, +with only the granularity different from 1.000 being a bit surprising. +%, given +%the aggressive joining of neighbouring detections we perform. +Compared to the other participants, our algorithm performs +especially well for human-created plagiarism (the 05-summary-obfuscation +sub-corpus), which is where we want to focus for our production +systems\footnote{Our production systems include the Czech National Archive +of Graduate Theses,\\ \url{http://theses.cz}}. + +% After the final evaluation, we did further experiments +%with feature types, and discovered that using stop-word 8-grams, +%word 4-grams, {\it and} contextual $n$-grams as described in +%Section \ref{altfeatures} performs even better (on a training corpus): +% +%\plagdet{0.7522}{0.7897}{0.7181}{1.0000} + +We plan to experiment further with combining more than two types +of features, be it continuous $n$-grams or contextual features. +This should allow us to tune down the aggresive heuristics for joining +neighbouring detections, which should lead to higher precision, +hopefully without sacrifying recall. + + As for the computational performance, it should be noted that +our software is prototyped in a scripting language (Perl), so it is not +the fastest possible implementation of the algorithm used. The code +contains about 800 non-comment lines of code, including the parallelization +of most parts and debugging/logging statements. + + The system is mostly language independent. The only language dependent +part of the code is the list of English stop-words for stop-word $n$-grams. +We use no stemming or other kinds of language-dependent processing.