X-Git-Url: https://www.fi.muni.cz/~kas/git//home/kas/public_html/git/?a=blobdiff_plain;f=pan13-paper%2Fpan13-notebook.tex;h=4bdcaedb1140014c5be76202e398e9d051a462b4;hb=898081691e33de22c2a04a642fc6378b9421b006;hp=a6d2ba3f9c7dc1bbd7590de86fcd611a417f2864;hpb=060415b2f4c4f0482b6a128f8103651e1c9af823;p=pan13-paper.git diff --git a/pan13-paper/pan13-notebook.tex b/pan13-paper/pan13-notebook.tex index a6d2ba3..4bdcaed 100755 --- a/pan13-paper/pan13-notebook.tex +++ b/pan13-paper/pan13-notebook.tex @@ -7,7 +7,7 @@ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \begin{document} -\title{Improving plagiarism detection} +\title{Diverse Queries and Feature Type Selection for Plagiarism Discovery} %%% Please do not remove the subtitle. \subtitle{Notebook for PAN at CLEF 2013} @@ -19,22 +19,50 @@ \begin{abstract} This paper describes approaches used for the Plagiarism Detection task in PAN 2013 international competition -on uncovering plagiarism, authorship, and social software misuse. - +on uncovering plagiarism, authorship, and social software misuse. +We present modified three-way search methodology for Source Retrieval subtask and analyse snippet similarity performance. +The results show, that presented approach is adaptable in real-world plagiarism situations. +For the Detailed Comparison task, we discuss feature type selection and +global postprocessing. Resulting performance is significantly better +with the described modifications, and further improvement is still possible. \end{abstract} \section{Introduction} - -The notebooks shall contain a full write-up of your approach, including all details necessary to reproduce your results. - - -\include{simon-source_retrieval} -\include{yenya-text_alignment} +In PAN 2013 competition on plagiarism detection we participated in both the Source Retrieval +and the Text Alignment subtasks. In both tasks we adapted methodology used in PAN 2012\footnote{% +See \cite{pan2012} for an overview of PAN 2012 plagiarism detection campaign.} \cite{suchomel_kas_12}. +Section~\ref{source_retr} describes querying approach for source retrieval, where we used three different +types of queries. We present a new type of query based on text paragraphs. +The query execution was controlled by its type and by preliminary similarities +discovered during the searches. +Section~\ref{text_alignment} describes our approach for the text alignment +(pairwise comparison) subtask. We briefly introduce our system, +and then we discuss the feature types, which are usable for pairwise comparison, +including the evaluation of their feasibility for this purpose. We then describe +the global (corpus-wide) optimizations used, and finally we discuss +the results achieved and further development. + +\input{simon-source_retrieval} +\input{yenya-text_alignment} \section{Conclusions} - +We introduced querying strategy with snippet similarity measure. %which approved to be competitive. +In source retrieval subtask the strategy performed with the second best ratio +of recall to the number of used queries. +We focused our queries on selected parts of text +and on parts with no discovered external similarities. +Unfortunately the ChatNoir search engine currently does not support phrasal search, therefore it +is possible that evaluated results may be quite distorted in this manner. + +In the text alignment subtask, we have achieved a significant improvement +with respect to our system from PAN 2012. Further development in this +area is still possible. For a real-world system, however, a completely +different set of parameters and heuristics needs to be used, as a result +of plagdet score together with the structure of the competition corpus +being too different from the real world. +More information obout the competition proceedings can be found in~\cite{pan2013}. \bibliographystyle{splncs03} \begin{raggedright}