X-Git-Url: https://www.fi.muni.cz/~kas/git//home/kas/public_html/git/?a=blobdiff_plain;f=pan13-paper%2Fpan13-notebook.tex;h=a6a711656247111e8eabf7c95e05c38c1fbf4a99;hb=14ecfe62bce797cf4a4dba67481fccce2bba24aa;hp=a6d2ba3f9c7dc1bbd7590de86fcd611a417f2864;hpb=060415b2f4c4f0482b6a128f8103651e1c9af823;p=pan13-paper.git diff --git a/pan13-paper/pan13-notebook.tex b/pan13-paper/pan13-notebook.tex index a6d2ba3..a6a7116 100755 --- a/pan13-paper/pan13-notebook.tex +++ b/pan13-paper/pan13-notebook.tex @@ -7,7 +7,7 @@ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \begin{document} -\title{Improving plagiarism detection} +\title{Neco Simonovo and Feature Type Selection for Pairwise Document Comparison} %%% Please do not remove the subtitle. \subtitle{Notebook for PAN at CLEF 2013} @@ -19,22 +19,44 @@ \begin{abstract} This paper describes approaches used for the Plagiarism Detection task in PAN 2013 international competition -on uncovering plagiarism, authorship, and social software misuse. - +on uncovering plagiarism, authorship, and social software misuse. +We present modified three-way search methodology for Source Retrieval subtask and analyse snippet similarity performance. +The results show, that presented approach is adaptable in real-world plagiarism situations. +For the detailed comparison task, we discuss feature type selection, +global postprocessing. We have significantly improved the pairwise comparison +results with even further optimizations possible. \end{abstract} \section{Introduction} - -The notebooks shall contain a full write-up of your approach, including all details necessary to reproduce your results. - - -\include{simon-source_retrieval} -\include{yenya-text_alignment} +In PAN 2013 competition on plagiarism detection we participated in both the Source Retrieval +and the Text Alignment subtask. In both tasks we adapted methodology used in PAN 2012\footnote{% +See \cite{pan2012} for an overview of PAN 2012 plagiarism detection campaign.} \cite{suchomel_kas_12}. +Section~\ref{source_retr} describes querying approach for source retrieval, where we used three different +types of queries. We present a new type of query based on text paragraphs. +The query execution were controled by its type and by preliminary similarities +discovered during the searches. +In Section~\ref{text_alignment} we describe our approach for the text alignment +(pairwise comparison) subtask. We briefly introduce our system, +and then we discuss the feature types, which are usable for pairwise comparison,including the evaluation of their feasibility for this purpose. We then describe +the global (corpus-wide) optimizations used, and finally we discuss +the results achieved and further development. + +\input{simon-source_retrieval} +\input{yenya-text_alignment} \section{Conclusions} +Unfortunately the ChatNoir search engine does not support phrasal search, therefore it +is possible that evaluated results may be quite distorted in this manner. + +In the text alignment subtask, we have achieved a significant improvement +with respect to our system from PAN 2012. Further development in this +area is still possible. For a real-world system, however, a completely +different set of parameters and heuristics needs to be used, as a result +of plagdet score together with the structure of the competition corpus +being too different from the real world. \bibliographystyle{splncs03} \begin{raggedright}