X-Git-Url: https://www.fi.muni.cz/~kas/git//home/kas/public_html/git/?a=blobdiff_plain;f=pan13-paper%2Fyenya-text_alignment.tex;h=1b54ac7bc478f5558345f4fea44e697e2bed93e6;hb=ebba97ad24be305e65ceb7cfdbb34d54d9a6bfba;hp=1cf67e7b18eebf354293184af9bf3ca49e8bd51c;hpb=9e3bea6abbc34854e6fc92ba08c2200290e685cd;p=pan13-paper.git diff --git a/pan13-paper/yenya-text_alignment.tex b/pan13-paper/yenya-text_alignment.tex index 1cf67e7..1b54ac7 100755 --- a/pan13-paper/yenya-text_alignment.tex +++ b/pan13-paper/yenya-text_alignment.tex @@ -34,8 +34,8 @@ In the next sections, we summarize the modifications we did for PAN 2013. \subsection{Alternative Features} \label{altfeatures} -In PAN 2012, we have used word 5-grams and stop-word 8-grams. -This year we have experimented with different word $n$-grams, and also +In PAN 2012, we used word 5-grams and stop-word 8-grams. +This year we experimented with different word $n$-grams, and also with contextual $n$-grams as described in \cite{torrejondetailed}. Modifying the algorithm to use contextual $n$-grams created as word 5-grams with the middle word removed (i.e. two words before and two words @@ -43,7 +43,7 @@ after the context) yielded better score: \plagdet{0.7421}{0.6721}{0.8282}{1.0000} -We have then made tests with plain word 4-grams, and to our surprise, +We then made tests with plain word 4-grams, and to our surprise, it gave even better score than contextual $n$-grams: \plagdet{0.7447}{0.7556}{0.7340}{1.0000} @@ -55,7 +55,7 @@ training corpus parts, plain word 4-grams were better at all parts of the corpus (in terms of plagdet score), except the 02-no-obfuscation part. -In our final submission, we have used word 4-grams and stop-word 8-grams. +In our final submission, we used word 4-grams and stop-word 8-grams. \subsection{Global Postprocessing} @@ -70,17 +70,17 @@ optimizations and postprocessing, similar to what we did for PAN 2010. %for development, where it has provided a significant performance boost. %The official performance numbers are from single-threaded run, though. -For PAN 2010, we have used the following postprocessing heuristics: +For PAN 2010, we used the following postprocessing heuristics: If there are overlapping detections inside a suspicious document, keep the longer one, provided that it is long enough. For overlapping -detections up to 600 characters, drop them both. We have implemented -this heuristics, but have found that it led to a lower score than +detections up to 600 characters, drop them both. We implemented +this heuristics, but found that it led to a lower score than without this modification. Further experiments with global postprocessing of overlaps led to a new heuristics: we unconditionally drop overlapping detections with up to 250 characters both, but if at least one of them is longer, we keep both detections. This is probably a result of plagdet being skewed too much towards recall (because the percentage of -plagiarized cases in the corpus is way too high compared to real world), +plagiarized cases in the corpus is way too high compared to real-world), so it is favourable to keep the detection even though the evidence for it is rather low. @@ -90,7 +90,7 @@ The global postprocessing improved the score even more: \subsection{Evaluation Results and Future Work} - The evaulation on the competition corpus had the following results: + The evaluation on the competition corpus had the following results: \plagdet{0.7448}{0.7659}{0.7251}{1.0003} @@ -113,9 +113,9 @@ of Graduate Theses,\\ \url{http://theses.cz}}. We plan to experiment further with combining more than two types of features, be it continuous $n$-grams or contextual features. -This should allow us to tune down the aggresive heuristics for joining +This should allow us to tune down the aggressive heuristics for joining neighbouring detections, which should lead to higher precision, -hopefully without sacrifying recall. +hopefully without sacrificing recall. As for the computational performance, it should be noted that our software is prototyped in a scripting language (Perl), so it is not