X-Git-Url: https://www.fi.muni.cz/~kas/git//home/kas/public_html/git/?a=blobdiff_plain;f=extended-abstract.tex;h=0ecfb48dfd857b47dc90b70f677eea95cf20963f;hb=8603d483a604da3d57e6287fd7690179d681d847;hp=301338d5982c9d3735f5d46e09d6c62c6a9e01da;hpb=8370b09a4d2789d55381b37f6f0750bdde14392b;p=pan12-paper.git diff --git a/extended-abstract.tex b/extended-abstract.tex index 301338d..0ecfb48 100644 --- a/extended-abstract.tex +++ b/extended-abstract.tex @@ -16,19 +16,157 @@ \maketitle -- odkaz na predchozi prace +\section{General Approach} -- parametry ktere jsme pouzili +The approach Masaryk University team has used in PAN 2012 Plagiarism +detection---detailed comparison sub-task is based on the same approach +that we have used in PAN 2010 \cite{Kasprzak2010}. This time, we have +used a similar approach, enhanced by several means -- multi-features +The algorithm evaluates the document pair in several stages: -- post-processing +\begin{itemize} +\item intrinsic plagiarism detection +\item language detection of the source document +\begin{itemize} +\item cross-lingual plagiarism detection, if the source document is not in English +\end{itemize} +\item detecting intervals with common features +\item post-processing phase, mainly serves for merging the nearby common intervals +\end{itemize} -- kritika plagdet? +\section{Intrinsic plagiarism detection} + +Our approach is based on character $n$-gram profiles of the interval of +the fixed size (in terms of $n$-grams), and their differences to the +profile of the whole document \cite{pan09stamatatos}. We have further +enhanced the approach with using gaussian smoothing of the style-change +function \cite{Kasprzak2010}. + +For PAN 2012, we have experimented with using 1-, 2-, and 3-grams instead +of only 3-grams, and using the different measure of the difference between +the n-gram profiles. We have used an approach similar to \cite{ngram}, +where we have compute the profile as an ordered set of 400 most-frequent +$n$-grams in a given text (the whole document or a partial window). Apart +from ordering the set we have ignored the actual number of occurrences +of a given $n$-gram altogether, and used the value inveresly +proportional to the $n$-gram order in the profile, in accordance with +the Zipf's law \cite{zipf1935psycho}. + +This approach has provided more stable style-change function than +than the one proposed in \cite{pan09stamatatos}. Because of pair-wise +nature of the detailed comparison sub-task, we couldn't use the results +of the intrinsic detection immediately, so we wanted to use them +as hints to the external detection. + +\section{Cross-lingual detection} + +%For language detection, we used the $n$-gram based categorization \cite{ngram}. +%We have computed the language profiles from the source documents of the +%training corpus (using the annotations from the corpus itself). The result +%of this approach was better than using the stopwords-based detection we have +%used in PAN 2010. However, there were still mis-detected documents, +%mainly the long lists of surnames and other tabular data. We have added +%an ad-hoc fix, where for documents having their profile too distant from all of +%English, German, and Spanish profiles, we have declared them to be in English. + +For cross-lingual plagiarism detection, our aim was to use the public +interface to Google translate if possible, and use the resulting document +as the source for standard intra-lingual detector. +Should the translation service not be available, we wanted +to use the fall-back strategy of translating isolated words only, +with the additional exact matching of longer words (we have used words with +5 characters or longer). +We have supposed these longer words can be names or specialized terms, +present in both languages. + +We have used dictionaries from several sources, like +{\tt dicts.info\footnote{\url{http://www.dicts.info/}}}, +{\tt omegawiki\footnote{\url{http://www.omegawiki.org/}}}, +and {\tt wiktionary\footnote{\url{http://en.wiktionary.org/}}}. The source +and translated document were aligned on a line-by-line basis. + +In the final form of the detailed comparison sub-task, the results of machine +translation of the source documents were provided to the detector programs +by the surrounding environment, so we have discarded the language detection +and machine translation from our submission altogether, and used only +line-by-line alignment of the source and translated document for calculating +the offsets of text features in the source document. + +\section{Multi-feature Plagiarism Detection} + +Our pair-wise plagiarism detection is based on finding common passages +of text, present both in the source and suspicious document. We call them +{\it features}. In PAN 2010, we have used sorted word 5-grams, formed from +words of three or more characters, as features to compare. +Recently, other means of plagiarism detection have been explored: +Stop-word $n$-gram detection is one of them +\cite{stamatatos2011plagiarism}. + +We propose the plagiarism detection system based on detecting common +features of various type, like word $n$-grams, stopword $n$-grams, +translated words or word bigrams, exact common longer words from document +pairs having each document in a different language, etc. The system +has to be to the great extent independent of the specialities of various +feature types. It cannot, for example, use the order of given features +as a measure of distance between the features, as for example, several +word 5-grams can be fully contained inside one stopword 8-gram. + +We thus define {\it common feature} of two documents (susp and src) +as the following tuple: +$$\langle +\hbox{offset}_{\hbox{susp}}, +\hbox{length}_{\hbox{susp}}, +\hbox{offset}_{\hbox{src}}, +\hbox{length}_{\hbox{src}} \rangle$$ + +In our final submission, we have used only the following two types +of common features: + +\begin{itemize} +\item word 5-grams, from words of three or more characters, sorted, lowercased +\item stop-word 8-grams, from 50 most-frequent English words (including + the possessive suffix 's), unsorted, lowercased, with 8-grams formed + only from the seven most-frequent words ({\it the, of, a, in, to, 's}) + removed +\end{itemize} + +We have gathered all the common features for a given document pair, and formed +{\it valid intervals} from them, as described in \cite{Kasprzak2009a} +(a similar approach is used also in \cite{stamatatos2011plagiarism}). +The algorithm is modified for multi-feature detection to use character offsets +only instead of feature order numbers. We have used valid intervals +consisting of at least 5 common features, with the maximum allowed gap +inside the interval (characters not belonging to any common feature +of a given valid interval) set to 3,500 characters. + +We have also experimented with modifying the allowed gap size using the +intrinsic plagiarism detection: to allow only shorter gap if the common +features around the gap belong to different passages, detected as plagiarized +in the suspicious document by the intrinsic detector, and allow larger gap, +if both the surrounding common features belong to the same passage, +detected by the intrinsic detector. This approach, however, did not show +any improvement against allowed gap of a static size, so it was omitted +from the final submission. + +\section{Postprocessing} + + +\section{Further discussion} + +In the full paper, we will also discuss the following topics: + +\begin{itemize} +\item language detection +\item suitability of plagdet score\cite{potthastframework} for performance measurement +\item feasibility of our approach in large-scale systems +\item other possible features to use, especially for cross-lingual detection +\item discussion of parameter settings +\end{itemize} \bibliographystyle{splncs03} \begin{raggedright} -\bibliography{} +\bibliography{paper} \end{raggedright} \end{document}