From: Jan "Yenya" Kasprzak Date: Sat, 1 Jun 2013 21:13:12 +0000 (+0200) Subject: Upravy pred odeslanim X-Git-Tag: odeslano-20130601-2314 X-Git-Url: https://www.fi.muni.cz/~kas/git//home/kas/public_html/git/?p=pan13-paper.git;a=commitdiff_plain;h=b2a29fbd0610261c2e1ba0738b181a9a98ed01ee Upravy pred odeslanim - pridany keywords - zarazen extended abstract jako doc a ne txt - drobna uprava Simonovy kapitoly --- diff --git a/pan13-paper/extended-abstract.doc b/pan13-paper/extended-abstract.doc new file mode 100644 index 0000000..7c31158 Binary files /dev/null and b/pan13-paper/extended-abstract.doc differ diff --git a/pan13-paper/extended-abstract.txt b/pan13-paper/extended-abstract.txt deleted file mode 100755 index 33b98d1..0000000 --- a/pan13-paper/extended-abstract.txt +++ /dev/null @@ -1,3 +0,0 @@ -This paper describes our approaches for the Plagiarism Detection task of PAN 2013. We present modified three-way search methodology for source retrieval subtask. We introduce new query type – the paragraph based queries. Their purpose is to check some parts of suspicious text in more depth. The other two types of queries are: the keywords based for retrieval of documents concerning the same theme; and the intrinsic plagiarism based for retrieval sources which contain text detected as different, in a manner of writing style, from other parts of the suspicious document. The query execution was controlled by its type and by preliminary similarities discovered during the searches. We discuss 2-tuples snippet similarity measurement for decision making over search result download, which indicates how many neighbouring word pairs coexist in the snippet and in the suspicious document. Our tests indicate advantages setting of snippet similarity threshold. The results show that our approach had the second best ratio of recall to the number of used queries, which tells about the query efficacy. Our approached achieved low precision probably due to reporting many results which were not considered as correct hits. Nonetheless those results contained some textual similarity according to text alignment subtask score, which we believe is still worthwhile to report. -For the text alignment subtask, we use the similar approach as in PAN 2012.We detect common features of various types between the suspicious and source documents. We experimented with more types of features. The best results had the combination of sorted word 4-grams with unsorted stop-word 8-grams. From the common features we compute valid intervals, which map passages from the suspicious document to the passages of the source document, such that these passages are covered “densely enough” with corresponding common features. For PAN 2013, we modified the post-processing phase: the fact that the algorithm had access to the whole corpus of source and suspicious documents at once allowed us to process the documents in one batch and to perform a global post-processing, handling the overlapping detections not only between the given suspicious and source document, but also between all the detections from a given suspicious document. The modifications brought a significant improvement compared to PAN 2012 on a training corpus, and the results from the competition corpus are similar enough to claim that these improvements are usable in general. - diff --git a/pan13-paper/keywords.txt b/pan13-paper/keywords.txt new file mode 100644 index 0000000..a5e4a3a --- /dev/null +++ b/pan13-paper/keywords.txt @@ -0,0 +1,11 @@ +source retrieval +querying +search engine +snippet +url download +plagiarism detection +pairwise document comparison +plagiarized passage detection +common features +valid intervals + diff --git a/pan13-paper/pan13-notebook.pdf b/pan13-paper/pan13-notebook.pdf new file mode 100644 index 0000000..b58c21c Binary files /dev/null and b/pan13-paper/pan13-notebook.pdf differ diff --git a/pan13-paper/simon-source_retrieval.tex b/pan13-paper/simon-source_retrieval.tex index 4370a1d..2777f37 100755 --- a/pan13-paper/simon-source_retrieval.tex +++ b/pan13-paper/simon-source_retrieval.tex @@ -5,7 +5,7 @@ large corpus. Those candidate documents are usually further compared in detail w suspicious document. In PAN 2013 source retrieval subtask the main goal was to identify web pages which have been used as a source of plagiarism for test corpus creation. -The test corpus contained 58 documents each discussing only one theme. +The test corpus contained 58 documents each discussing one topic only. Those documents were created intentionally by semiprofessional writers, thus they featured nearly realistic plagiarism cases~\cite{plagCorpus}. Resources were looked up in the ClueWeb\footnote{\url{http://lemurproject.org/clueweb09.php/}} corpus.