Upravy pred odeslanim

author Jan "Yenya" Kasprzak <kas@fi.muni.cz>

Sat, 1 Jun 2013 21:13:12 +0000 (23:13 +0200)

committer Jan "Yenya" Kasprzak <kas@fi.muni.cz>

Sat, 1 Jun 2013 21:15:08 +0000 (23:15 +0200)
author Jan "Yenya" Kasprzak <kas@fi.muni.cz>
Sat, 1 Jun 2013 21:13:12 +0000 (23:13 +0200)
committer Jan "Yenya" Kasprzak <kas@fi.muni.cz>
Sat, 1 Jun 2013 21:15:08 +0000 (23:15 +0200)
diff --git a/pan13-paper/extended-abstract.doc b/pan13-paper/extended-abstract.doc

new file mode 100644 (file)

index 0000000..7c31158

Binary files /dev/null and b/pan13-paper/extended-abstract.doc differ
diff --git a/pan13-paper/extended-abstract.txt b/pan13-paper/extended-abstract.txt

deleted file mode 100755 (executable)

index 33b98d1..0000000
--- a/pan13-paper/extended-abstract.txt
+++ /dev/null
@@ -1,3 +0,0 @@
-This paper describes our approaches for the Plagiarism Detection task of PAN 2013. We present modified three-way search methodology for source retrieval subtask. We introduce new query type – the paragraph based queries. Their purpose is to check some parts of suspicious text in more depth. The other two types of queries are: the keywords based for retrieval of documents concerning the same theme; and the intrinsic plagiarism based for retrieval sources which contain text detected as different, in a manner of writing style, from other parts of the suspicious document. The query execution was controlled by its type and by preliminary similarities discovered during the searches. We discuss 2-tuples snippet similarity measurement for decision making over search result download, which indicates how many neighbouring word pairs coexist in the snippet and in the suspicious document. Our tests indicate advantages setting of snippet similarity threshold. The results show that our approach had the second best ratio of recall to the number of used queries, which tells about the query efficacy. Our approached achieved low precision probably due to reporting many results which were not considered as correct hits. Nonetheless those results contained some textual similarity according to text alignment subtask score, which we believe is still worthwhile to report.
-For the text alignment subtask, we use the similar approach as in PAN 2012.We detect common features of various types between the suspicious and source documents. We experimented with more types of features. The best results had the combination of sorted word 4-grams with unsorted stop-word 8-grams. From the common features we compute valid intervals, which map passages from the suspicious document to the passages of the source document, such that these passages are covered “densely enough” with corresponding common features. For PAN 2013, we modified the post-processing phase: the fact that the algorithm had access to the whole corpus of source and suspicious documents at once allowed us to process the documents in one batch and to perform a global post-processing, handling the overlapping detections not only between the given suspicious and source document, but also between all the detections from a given suspicious document. The modifications brought a significant improvement compared to PAN 2012 on a training corpus, and the results from the competition corpus are similar enough to claim that these improvements are usable in general.
-
diff --git a/pan13-paper/keywords.txt b/pan13-paper/keywords.txt

new file mode 100644 (file)

index 0000000..a5e4a3a
--- /dev/null
+++ b/pan13-paper/keywords.txt
@@ -0,0 +1,11 @@
+source retrieval
+querying
+search engine
+snippet
+url download
+plagiarism detection
+pairwise document comparison
+plagiarized passage detection
+common features
+valid intervals
+
diff --git a/pan13-paper/pan13-notebook.pdf b/pan13-paper/pan13-notebook.pdf

new file mode 100644 (file)

index 0000000..b58c21c

Binary files /dev/null and b/pan13-paper/pan13-notebook.pdf differ
diff --git a/pan13-paper/simon-source_retrieval.tex b/pan13-paper/simon-source_retrieval.tex

index 4370a1de5ae5f1a2bedd226e9167c0c149823621..2777f3777f544a469697bf2305e6226e1fa9d287 100755 (executable)
--- a/pan13-paper/simon-source_retrieval.tex
+++ b/pan13-paper/simon-source_retrieval.tex
@@ -5,7 +5,7 @@ large corpus. Those candidate documents are usually further compared in detail w
  suspicious document. In PAN 2013 source retrieval subtask the main goal was to\r
  identify web pages which have been used as a source of plagiarism for test corpus creation.\r
  \r
  suspicious document. In PAN 2013 source retrieval subtask the main goal was to\r
  identify web pages which have been used as a source of plagiarism for test corpus creation.\r
  \r
-The test corpus contained 58 documents each discussing only one theme.\r
+The test corpus contained 58 documents each discussing one topic only.\r
  Those documents were created intentionally by\r
   semiprofessional writers, thus they featured nearly realistic plagiarism cases~\cite{plagCorpus}.\r
  Resources were looked up in the ClueWeb\footnote{\url{http://lemurproject.org/clueweb09.php/}} corpus.\r
  Those documents were created intentionally by\r
   semiprofessional writers, thus they featured nearly realistic plagiarism cases~\cite{plagCorpus}.\r
  Resources were looked up in the ClueWeb\footnote{\url{http://lemurproject.org/clueweb09.php/}} corpus.\r
author	Jan "Yenya" Kasprzak <kas@fi.muni.cz>
	Sat, 1 Jun 2013 21:13:12 +0000 (23:13 +0200)
committer	Jan "Yenya" Kasprzak <kas@fi.muni.cz>
	Sat, 1 Jun 2013 21:15:08 +0000 (23:15 +0200)
pan13-paper/extended-abstract.doc	[new file with mode: 0644]	patch \| blob
pan13-paper/extended-abstract.txt	[deleted file]	patch \| blob \| history
pan13-paper/keywords.txt	[new file with mode: 0644]	patch \| blob
pan13-paper/pan13-notebook.pdf	[new file with mode: 0644]	patch \| blob
pan13-paper/simon-source_retrieval.tex		patch \| blob \| history