X-Git-Url: https://www.fi.muni.cz/~kas/git//home/kas/public_html/git/?a=blobdiff_plain;f=yenya-detailed.tex;fp=yenya-detailed.tex;h=3615ab9b45e364f79d2562566d9488569701ab40;hb=fb012b2add74c5aeee11754a3ae6df1394bfee25;hp=0000000000000000000000000000000000000000;hpb=75de68e9bb081ce2167cc2fc8aa4ae63b45764b4;p=pan12-paper.git diff --git a/yenya-detailed.tex b/yenya-detailed.tex new file mode 100644 index 0000000..3615ab9 --- /dev/null +++ b/yenya-detailed.tex @@ -0,0 +1,150 @@ +\section{Detailed Document Comparison} + +\subsection{General Approach} + +The approach Masaryk University team has used in PAN 2012 Plagiarism +detection---detailed comparison sub-task is based on the same approach +that we have used in PAN 2010 \cite{Kasprzak2010}. This time, we have +used a similar approach, enhanced by several means + +The algorithm evaluates the document pair in several stages: + +\begin{itemize} +\item intrinsic plagiarism detection +\item language detection of the source document +\begin{itemize} +\item cross-lingual plagiarism detection, if the source document is not in English +\end{itemize} +\item detecting intervals with common features +\item post-processing phase, mainly serves for merging the nearby common intervals +\end{itemize} + +\subsection{Intrinsic plagiarism detection} + +Our approach is based on character $n$-gram profiles of the interval of +the fixed size (in terms of $n$-grams), and their differences to the +profile of the whole document \cite{pan09stamatatos}. We have further +enhanced the approach with using gaussian smoothing of the style-change +function \cite{Kasprzak2010}. + +For PAN 2012, we have experimented with using 1-, 2-, and 3-grams instead +of only 3-grams, and using the different measure of the difference between +the n-gram profiles. We have used an approach similar to \cite{ngram}, +where we have compute the profile as an ordered set of 400 most-frequent +$n$-grams in a given text (the whole document or a partial window). Apart +from ordering the set we have ignored the actual number of occurrences +of a given $n$-gram altogether, and used the value inveresly +proportional to the $n$-gram order in the profile, in accordance with +the Zipf's law \cite{zipf1935psycho}. + +This approach has provided more stable style-change function than +than the one proposed in \cite{pan09stamatatos}. Because of pair-wise +nature of the detailed comparison sub-task, we couldn't use the results +of the intrinsic detection immediately, so we wanted to use them +as hints to the external detection. + +\subsection{Cross-lingual detection} + +%For language detection, we used the $n$-gram based categorization \cite{ngram}. +%We have computed the language profiles from the source documents of the +%training corpus (using the annotations from the corpus itself). The result +%of this approach was better than using the stopwords-based detection we have +%used in PAN 2010. However, there were still mis-detected documents, +%mainly the long lists of surnames and other tabular data. We have added +%an ad-hoc fix, where for documents having their profile too distant from all of +%English, German, and Spanish profiles, we have declared them to be in English. + +For cross-lingual plagiarism detection, our aim was to use the public +interface to Google translate if possible, and use the resulting document +as the source for standard intra-lingual detector. +Should the translation service not be available, we wanted +to use the fall-back strategy of translating isolated words only, +with the additional exact matching of longer words (we have used words with +5 characters or longer). +We have supposed these longer words can be names or specialized terms, +present in both languages. + +We have used dictionaries from several sources, like +{\tt dicts.info\footnote{\url{http://www.dicts.info/}}}, +{\tt omegawiki\footnote{\url{http://www.omegawiki.org/}}}, +and {\tt wiktionary\footnote{\url{http://en.wiktionary.org/}}}. The source +and translated document were aligned on a line-by-line basis. + +In the final form of the detailed comparison sub-task, the results of machine +translation of the source documents were provided to the detector programs +by the surrounding environment, so we have discarded the language detection +and machine translation from our submission altogether, and used only +line-by-line alignment of the source and translated document for calculating +the offsets of text features in the source document. + +\subsection{Multi-feature Plagiarism Detection} + +Our pair-wise plagiarism detection is based on finding common passages +of text, present both in the source and suspicious document. We call them +{\it features}. In PAN 2010, we have used sorted word 5-grams, formed from +words of three or more characters, as features to compare. +Recently, other means of plagiarism detection have been explored: +Stop-word $n$-gram detection is one of them +\cite{stamatatos2011plagiarism}. + +We propose the plagiarism detection system based on detecting common +features of various type, like word $n$-grams, stopword $n$-grams, +translated words or word bigrams, exact common longer words from document +pairs having each document in a different language, etc. The system +has to be to the great extent independent of the specialities of various +feature types. It cannot, for example, use the order of given features +as a measure of distance between the features, as for example, several +word 5-grams can be fully contained inside one stopword 8-gram. + +We thus define {\it common feature} of two documents (susp and src) +as the following tuple: +$$\langle +\hbox{offset}_{\hbox{susp}}, +\hbox{length}_{\hbox{susp}}, +\hbox{offset}_{\hbox{src}}, +\hbox{length}_{\hbox{src}} \rangle$$ + +In our final submission, we have used only the following two types +of common features: + +\begin{itemize} +\item word 5-grams, from words of three or more characters, sorted, lowercased +\item stop-word 8-grams, from 50 most-frequent English words (including + the possessive suffix 's), unsorted, lowercased, with 8-grams formed + only from the seven most-frequent words ({\it the, of, a, in, to, 's}) + removed +\end{itemize} + +We have gathered all the common features for a given document pair, and formed +{\it valid intervals} from them, as described in \cite{Kasprzak2009a} +(a similar approach is used also in \cite{stamatatos2011plagiarism}). +The algorithm is modified for multi-feature detection to use character offsets +only instead of feature order numbers. We have used valid intervals +consisting of at least 5 common features, with the maximum allowed gap +inside the interval (characters not belonging to any common feature +of a given valid interval) set to 3,500 characters. + +We have also experimented with modifying the allowed gap size using the +intrinsic plagiarism detection: to allow only shorter gap if the common +features around the gap belong to different passages, detected as plagiarized +in the suspicious document by the intrinsic detector, and allow larger gap, +if both the surrounding common features belong to the same passage, +detected by the intrinsic detector. This approach, however, did not show +any improvement against allowed gap of a static size, so it was omitted +from the final submission. + +\subsection{Postprocessing} + + +\subsection{Further discussion} + +In the full paper, we will also discuss the following topics: + +\begin{itemize} +\item language detection +\item suitability of plagdet score\cite{potthastframework} for performance measurement +\item feasibility of our approach in large-scale systems +\item other possible features to use, especially for cross-lingual detection +\item discussion of parameter settings +\end{itemize} +