From 9e3bea6abbc34854e6fc92ba08c2200290e685cd Mon Sep 17 00:00:00 2001 From: "Jan \"Yenya\" Kasprzak" Date: Sat, 1 Jun 2013 16:52:14 +0200 Subject: [PATCH] Uprava titulku, ext. abstrakt --- pan13-paper/extended-abstract.txt | 22 ++++++++++++++++++++++ pan13-paper/pan13-notebook.tex | 8 ++++---- pan13-paper/yenya-text_alignment.tex | 2 +- 3 files changed, 27 insertions(+), 5 deletions(-) create mode 100644 pan13-paper/extended-abstract.txt diff --git a/pan13-paper/extended-abstract.txt b/pan13-paper/extended-abstract.txt new file mode 100644 index 0000000..53290e5 --- /dev/null +++ b/pan13-paper/extended-abstract.txt @@ -0,0 +1,22 @@ +This paper describes our approaches for the Plagiarism Detection task +of PAN 2013. + +We present modified three-way search methodology for source retrieval subtask. +TODO Neco podrobnejsiho. + +For the text alignment subtask, we use the similar approach as in PAN 2012. +We detect common features of various types between the suspicious and source +documents. We have experimented with more types of features. The best +results had the combination of sorted word 4-grams with unsorted stop-word +8-grams. From the common features we compute valid intervals, which map +passages from the suspicious document to the passages of the source document, +such that these passages are covered ``densely enough'' with corresponding +common features. For PAN 2013, we have modified the postprocessing phase: +the fact that the algorithm had access to the whole corpus of source and +suspicious documents at once allowed us to process the documents in one +batch and to perform a global post-processing, handling the overlapping +detections not only between the given suspicious and source document, +but also between all the detections from a given suspicious document. +The modifications brought a significant improvement compared to PAN 2013 +on a training corpus, and the results from the competition corpus +are similar enough to claim that these improvements are usable in general. diff --git a/pan13-paper/pan13-notebook.tex b/pan13-paper/pan13-notebook.tex index 1d13300..8adaa7f 100755 --- a/pan13-paper/pan13-notebook.tex +++ b/pan13-paper/pan13-notebook.tex @@ -7,7 +7,7 @@ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \begin{document} -\title{Diverse Queries and Feature Type Selection for Pairwise Document Comparison} +\title{Diverse Queries and Feature Type Selection for Plagiarism Discovery} %%% Please do not remove the subtitle. \subtitle{Notebook for PAN at CLEF 2013} @@ -22,9 +22,9 @@ This paper describes approaches used for the Plagiarism Detection task in PAN 20 on uncovering plagiarism, authorship, and social software misuse. We present modified three-way search methodology for Source Retrieval subtask and analyse snippet similarity performance. The results show, that presented approach is adaptable in real-world plagiarism situations. -For the Detailed Comparison task, we discuss feature type selection, -global postprocessing. We significantly improved the pairwise comparison -results with even further optimizations possible. +For the Detailed Comparison task, we discuss feature type selection and +global postprocessing. Resulting performance is significantly better +with the described modifications, and further improvement is still possible. \end{abstract} diff --git a/pan13-paper/yenya-text_alignment.tex b/pan13-paper/yenya-text_alignment.tex index 1f4f5cf..1cf67e7 100755 --- a/pan13-paper/yenya-text_alignment.tex +++ b/pan13-paper/yenya-text_alignment.tex @@ -102,7 +102,7 @@ Compared to the other participants, our algorithm performs especially well for human-created plagiarism (the 05-summary-obfuscation sub-corpus), which is where we want to focus for our production systems\footnote{Our production systems include the Czech National Archive -of Graduate Theses, \url{http://theses.cz}}. +of Graduate Theses,\\ \url{http://theses.cz}}. % After the final evaluation, we did further experiments %with feature types, and discovered that using stop-word 8-grams, -- 2.43.0