Uplny text yenyovy kapitoly.

author Jan "Yenya" Kasprzak <kas@fi.muni.cz>

Fri, 31 May 2013 17:33:50 +0000 (19:33 +0200)

committer Jan "Yenya" Kasprzak <kas@fi.muni.cz>

Fri, 31 May 2013 17:33:50 +0000 (19:33 +0200)
author Jan "Yenya" Kasprzak <kas@fi.muni.cz>
Fri, 31 May 2013 17:33:50 +0000 (19:33 +0200)
committer Jan "Yenya" Kasprzak <kas@fi.muni.cz>
Fri, 31 May 2013 17:33:50 +0000 (19:33 +0200)
diff --git a/pan13-paper/pan13-notebook.bib b/pan13-paper/pan13-notebook.bib

index 6279f197f6e976294655ff4c4ab9f293826eb5f8..4960cc121abe3bf65c9ba2ba3c212b557fc6abdd 100755 (executable)
--- a/pan13-paper/pan13-notebook.bib
+++ b/pan13-paper/pan13-notebook.bib
@@ -118,3 +118,15 @@
    pages={1--8},\r
    year={2012}\r
  }\r
+\r
+@INPROCEEDINGS{potthastframework,\r
+        TITLE              = {{An Evaluation Framework for Plagiarism Detection}\r
+},\r
+        AUTHOR             = {Martin Potthast and Benno Stein and Alberot Barr{\'o}n-Cede{\~n}o and Paolo Rosso},\r
+        BOOKTITLE          = {Proceedings of the 23rd International Conference on Computational Linguistics (COLING 2010)},\r
+        MONTH              = aug,\r
+        YEAR               = {2010},\r
+        ADDRESS            = {Beijing, China},\r
+        PUBLISHER          = {Association for Computational Linguistics},\r
+}\r
+\r
diff --git a/pan13-paper/yenya-text_alignment.tex b/pan13-paper/yenya-text_alignment.tex

index e284fe19581f6f382a5e8e3c738e6adc569595e9..b57dc26b6ad21de9bc5f09cea38dcb3d796e2285 100755 (executable)
--- a/pan13-paper/yenya-text_alignment.tex
+++ b/pan13-paper/yenya-text_alignment.tex
@@ -5,10 +5,10 @@
  Our approach at the text alignment subtask of PAN 2013 uses the same\r
  basic principles as our previous work in this area, described\r
  in \cite{suchomel_kas_12}, which in turn builds on our work for previous\r
-PAN campaigns,, \cite{Kasprzak2010}, \cite{Kasprzak2009a}:\r
+PAN campaigns \cite{Kasprzak2010}, \cite{Kasprzak2009a}:\r
  \r
  We detect {\it common features} between source and suspicious documents,\r
-where features we currently use are word $n$-grams, and stop-word $m$-grams\r
+where the features we currently use are word $n$-grams, and stop-word $m$-grams\r
  \cite{stamatatos2011plagiarism}. From those common features (each of which\r
  can occur multiple times in both source and suspicious document), we form\r
  {\it valid intervals}\footnote{%\r
@@ -18,23 +18,54 @@ of characters
  from the source and suspicious documents, where the interval in both\r
  of these documents is covered ``densely enough'' by the common features.\r
  \r
-We then postprocess the valid intervals, removing overlapping detections,\r
-and merging detections which are close enough to each other.\r
+We then postprocess the valid intervals, removing the overlapping detections,\r
+and merging the detections which are close enough to each other.\r
  \r
+For the training corpus,\r
+our unmodified software from PAN 2012 gave the following results\footnote{%\r
+See \cite{potthastframework} for definition of {\it plagdet} and the rationale for this type of scoring.}:\r
+\r
+\def\plagdet#1#2#3#4{\par{\r
+$\textit{plagdet}=#1, \textit{recall}=#2, \textit{precision}=#3, \textit{granularity}=#4$}\hfill\par}\r
+\r
+\plagdet{0.7235}{0.6306}{0.8484}{1.0000}\r
+\r
+We take the above as the baseline for further improvements.\r
  In the next sections, we summarize the modifications we did for PAN 2013,\r
-including approaches tried but not used. For the training corpus,\r
-our software from PAN 2012 gave the plagdet score of TODO, which we\r
-consider the baseline for further improvements.\r
+including approaches tried but not used.\r
+\r
+\subsection{Alternative Features}\r
+\label{altfeatures}\r
+\r
+In PAN 2012, we have used word 5-grams and stop-word 8-grams.\r
+This year we have experimented with different word $n$-grams, and also\r
+with contextual $n$-grams as described in \cite{torrejondetailed}.\r
+Modifying the algorithm to use contextual $n$-grams created as word\r
+5-grams with the middle word removed (i.e. two words before and two words\r
+after the context) yielded better score:\r
  \r
-\subsection{Alternative features}\r
+\plagdet{0.7421}{0.6721}{0.8282}{1.0000}\r
  \r
-TODO \cite{torrejondetailed}\r
+We have then made tests with plain word 4-grams, and to our surprise,\r
+it gave even better score than contextual $n$-grams:\r
  \r
-\subsection{Global postprocessing}\r
+\plagdet{0.7447}{0.7556}{0.7340}{1.0000}\r
+\r
+It should be noted that these two quite similar approaches (both use the\r
+features formed from four words), while having a similar plagdet score,\r
+have their precision and recall values completely different. Looking at the\r
+training corpus parts, plain word 4-grams were better at all parts\r
+of the corpus (in terms of plagdet score), except the 02-no-obfuscation\r
+part.\r
+\r
+In our final submission, we have used word 4-grams and stop-word 8-grams.\r
+\r
+\subsection{Global Postprocessing}\r
  \r
  For PAN 2013, the algorithm had access to all of the source and suspicious\r
-documents. Because of this, we have rewritten our software to handle\r
-all of the documents at once, in order to be able to do cross-document\r
+documents at once. It was not limited to a single document pair, as in\r
+2012. Because of this, we have rewritten our software to handle\r
+all of the documents in one run, in order to be able to do cross-document\r
  optimizations and postprocessing, similar to what we did for PAN 2010.\r
  This required refactorization of most of the code. We are able to handle\r
  most of the computation in parallel in per-CPU threads, with little\r
@@ -45,7 +76,55 @@ The official performance numbers are from single-threaded run, though.
  For PAN 2010, we have used the following postprocessing heuristics:\r
  If there are overlapping detections inside a suspicious document,\r
  keep the longer one, provided that it is long enough. For overlapping\r
-detections up to 600 characters, \r
-TODO\r
+detections up to 600 characters, drop them both. We have implemented\r
+this heuristics, but have found that it led to a lower score than\r
+without this modification. Further experiments with global postprocessing\r
+of overlaps led to a new heuristics: we unconditionally drop overlapping\r
+detections with up to 250 characters both, but if at least one of them\r
+is longer, we keep both detections. This is probably a result of\r
+plagdet being skewed too much to recall (because the percentage of\r
+plagiarized cases in the corpus is way too high compared to real world),\r
+so it is favourable to keep the detection even though the evidence\r
+for it is rather low.\r
+\r
+The global postprocessing improved the score even more:\r
+\r
+\plagdet{0.7469}{0.7558}{0.7382}{1.0000}\r
+\r
+\subsection{Evaluation Results and Future Work}\r
+\r
+       The evaulation on the competition corpus had the following results:\r
+\r
+\plagdet{0.7448}{0.7659}{0.7251}{1.0003}\r
+\r
+This is quite similar to what we have seen on a training corpus,\r
+only the granularity different from 1.000 is a bit surprising, given\r
+the aggressive joining of neighbouring detections we perform.\r
+Compared to the other participants, our algorithm performs\r
+especially well for human-created plagiarism (the 05-summary-obfuscation\r
+sub-corpus), which is where we want to focus for our production\r
+systems\footnote{Our production systems include the Czech National Archive\r
+of Graduate Theses, \url{http://theses.cz}}.\r
+\r
+       After the final evaluation, we did further experiments\r
+with feature types, and discovered that using stop-word 8-grams,\r
+word 4-grams, {\it and} contextual $n$-grams as described in\r
+Section \ref{altfeatures} performs even better (on a training corpus):\r
+\r
+\plagdet{0.7522}{0.7897}{0.7181}{1.0000}\r
+\r
+We plan to experiment further with combining more than two types\r
+of features, be it continuous $n$-grams or contextual features.\r
+This should allow us to tune down the aggresive heuristics for joining\r
+neighbouring detections, which should lead to higher precision,\r
+hopefully without sacrifying recall.\r
+\r
+       As for the computational performance, it should be noted that\r
+our software is prototyped in a scripting language (Perl), so it is not\r
+the fastest possible implementation of the algorithm used. The code\r
+contains about 800 non-comment lines of code, including the parallelization\r
+of most parts and debugging/logging statements. The only language-dependent\r
+part of the code is the list of English stop-words for stop-word $n$-grams.\r
+We use no stemming or other kinds of language-dependent processing.\r
  \r
  \r
author	Jan "Yenya" Kasprzak <kas@fi.muni.cz>
	Fri, 31 May 2013 17:33:50 +0000 (19:33 +0200)
committer	Jan "Yenya" Kasprzak <kas@fi.muni.cz>
	Fri, 31 May 2013 17:33:50 +0000 (19:33 +0200)
pan13-paper/pan13-notebook.bib		patch \| blob \| history
pan13-paper/yenya-text_alignment.tex		patch \| blob \| history