Uprava titulku, ext. abstrakt

author Jan "Yenya" Kasprzak <kas@fi.muni.cz>

Sat, 1 Jun 2013 14:52:14 +0000 (16:52 +0200)

committer Jan "Yenya" Kasprzak <kas@fi.muni.cz>

Sat, 1 Jun 2013 14:52:14 +0000 (16:52 +0200)
author Jan "Yenya" Kasprzak <kas@fi.muni.cz>
Sat, 1 Jun 2013 14:52:14 +0000 (16:52 +0200)
committer Jan "Yenya" Kasprzak <kas@fi.muni.cz>
Sat, 1 Jun 2013 14:52:14 +0000 (16:52 +0200)
diff --git a/pan13-paper/extended-abstract.txt b/pan13-paper/extended-abstract.txt

new file mode 100644 (file)

index 0000000..53290e5
--- /dev/null
+++ b/pan13-paper/extended-abstract.txt
@@ -0,0 +1,22 @@
+This paper describes our approaches for the Plagiarism Detection task
+of PAN 2013.
+
+We present modified three-way search methodology for source retrieval subtask.
+TODO Neco podrobnejsiho.
+
+For the text alignment subtask, we use the similar approach as in PAN 2012.
+We detect common features of various types between the suspicious and source
+documents. We have experimented with more types of features. The best
+results had the combination of sorted word 4-grams with unsorted stop-word
+8-grams. From the common features we compute valid intervals, which map
+passages from the suspicious document to the passages of the source document,
+such that these passages are covered ``densely enough'' with corresponding
+common features. For PAN 2013, we have modified the postprocessing phase:
+the fact that the algorithm had access to the whole corpus of source and
+suspicious documents at once allowed us to process the documents in one
+batch and to perform a global post-processing, handling the overlapping
+detections not only between the given suspicious and source document,
+but also between all the detections from a given suspicious document.
+The modifications brought a significant improvement compared to PAN 2013
+on a training corpus, and the results from the competition corpus
+are similar enough to claim that these improvements are usable in general.
diff --git a/pan13-paper/pan13-notebook.tex b/pan13-paper/pan13-notebook.tex

index 1d1330065b5e6b618dcf3cd86229f914d72d23c2..8adaa7fbe8b4bb50701a04f4498a2289c6fa0aa5 100755 (executable)
--- a/pan13-paper/pan13-notebook.tex
+++ b/pan13-paper/pan13-notebook.tex
@@ -7,7 +7,7 @@
  %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
  \begin{document}
  
-\title{Diverse Queries and Feature Type Selection for Pairwise Document Comparison}
+\title{Diverse Queries and Feature Type Selection for Plagiarism Discovery}
  %%% Please do not remove the subtitle.
  \subtitle{Notebook for PAN at CLEF 2013}
  
@@ -22,9 +22,9 @@ This paper describes approaches used for the Plagiarism Detection task in PAN 20
  on uncovering plagiarism, authorship, and social software misuse.  
  We present modified three-way search methodology for Source Retrieval subtask and analyse snippet similarity performance.
  The results show, that presented approach is adaptable in real-world plagiarism situations.
-For the Detailed Comparison task, we discuss feature type selection,
-global postprocessing. We significantly improved the pairwise comparison
-results with even further optimizations possible.
+For the Detailed Comparison task, we discuss feature type selection and
+global postprocessing. Resulting performance is significantly better
+with the described modifications, and further improvement is still possible.
  \end{abstract}
  
  
diff --git a/pan13-paper/yenya-text_alignment.tex b/pan13-paper/yenya-text_alignment.tex

index 1f4f5cf58a37c6e09a7a2d609cbcc340fd16c0ab..1cf67e7b18eebf354293184af9bf3ca49e8bd51c 100755 (executable)
--- a/pan13-paper/yenya-text_alignment.tex
+++ b/pan13-paper/yenya-text_alignment.tex
@@ -102,7 +102,7 @@ Compared to the other participants, our algorithm performs
  especially well for human-created plagiarism (the 05-summary-obfuscation\r
  sub-corpus), which is where we want to focus for our production\r
  systems\footnote{Our production systems include the Czech National Archive\r
-of Graduate Theses, \url{http://theses.cz}}.\r
+of Graduate Theses,\\ \url{http://theses.cz}}.\r
  \r
  %      After the final evaluation, we did further experiments\r
  %with feature types, and discovered that using stop-word 8-grams,\r
author	Jan "Yenya" Kasprzak <kas@fi.muni.cz>
	Sat, 1 Jun 2013 14:52:14 +0000 (16:52 +0200)
committer	Jan "Yenya" Kasprzak <kas@fi.muni.cz>
	Sat, 1 Jun 2013 14:52:14 +0000 (16:52 +0200)
pan13-paper/extended-abstract.txt	[new file with mode: 0644]	patch \| blob
pan13-paper/pan13-notebook.tex		patch \| blob \| history
pan13-paper/yenya-text_alignment.tex		patch \| blob \| history