yenya-detailed.tex

   1 \section{Detailed Document Comparison}
   2
   3 \subsection{General Approach}
   4
   5 The approach Masaryk University team has used in PAN 2012 Plagiarism
   6 detection---detailed comparison sub-task is based on the same approach
   7 that we have used in PAN 2010 \cite{Kasprzak2010}.  This time, we have
   8 used a similar approach, enhanced by several means
   9
  10 The algorithm evaluates the document pair in several stages:
  11
  12 \begin{itemize}
  13 \item intrinsic plagiarism detection
  14 \item language detection of the source document
  15 \begin{itemize}
  16 \item cross-lingual plagiarism detection, if the source document is not in English
  17 \end{itemize}
  18 \item detecting intervals with common features
  19 \item post-processing phase, mainly serves for merging the nearby common intervals
  20 \end{itemize}
  21
  22 \subsection{Intrinsic plagiarism detection}
  23
  24 Our approach is based on character $n$-gram profiles of the interval of
  25 the fixed size (in terms of $n$-grams), and their differences to the
  26 profile of the whole document \cite{pan09stamatatos}. We have further
  27 enhanced the approach with using gaussian smoothing of the style-change
  28 function \cite{Kasprzak2010}.
  29
  30 For PAN 2012, we have experimented with using 1-, 2-, and 3-grams instead
  31 of only 3-grams, and using the different measure of the difference between
  32 the n-gram profiles. We have used an approach similar to \cite{ngram},
  33 where we have compute the profile as an ordered set of 400 most-frequent
  34 $n$-grams in a given text (the whole document or a partial window). Apart
  35 from ordering the set we have ignored the actual number of occurrences
  36 of a given $n$-gram altogether, and used the value inveresly
  37 proportional to the $n$-gram order in the profile, in accordance with
  38 the Zipf's law \cite{zipf1935psycho}.
  39
  40 This approach has provided more stable style-change function than
  41 than the one proposed in \cite{pan09stamatatos}. Because of pair-wise
  42 nature of the detailed comparison sub-task, we couldn't use the results
  43 of the intrinsic detection immediately, so we wanted to use them
  44 as hints to the external detection.
  45
  46 \subsection{Cross-lingual detection}
  47
  48 %For language detection, we used the $n$-gram based categorization \cite{ngram}.
  49 %We have computed the language profiles from the source documents of the
  50 %training corpus (using the annotations from the corpus itself). The result
  51 %of this approach was better than using the stopwords-based detection we have
  52 %used in PAN 2010. However, there were still mis-detected documents,
  53 %mainly the long lists of surnames and other tabular data. We have added
  54 %an ad-hoc fix, where for documents having their profile too distant from all of
  55 %English, German, and Spanish profiles, we have declared them to be in English.
  56
  57 For cross-lingual plagiarism detection, our aim was to use the public
  58 interface to Google translate if possible, and use the resulting document
  59 as the source for standard intra-lingual detector.
  60 Should the translation service not be available, we wanted
  61 to use the fall-back strategy of translating isolated words only,
  62 with the additional exact matching of longer words (we have used words with
  63 5 characters or longer).
  64 We have supposed these longer words can be names or specialized terms,
  65 present in both languages.
  66
  67 We have used dictionaries from several sources, like
  68 {\tt dicts.info\footnote{\url{http://www.dicts.info/}}},
  69 {\tt omegawiki\footnote{\url{http://www.omegawiki.org/}}},
  70 and {\tt wiktionary\footnote{\url{http://en.wiktionary.org/}}}. The source
  71 and translated document were aligned on a line-by-line basis.
  72
  73 In the final form of the detailed comparison sub-task, the results of machine
  74 translation of the source documents were provided to the detector programs
  75 by the surrounding environment, so we have discarded the language detection
  76 and machine translation from our submission altogether, and used only
  77 line-by-line alignment of the source and translated document for calculating
  78 the offsets of text features in the source document.
  79
  80 \subsection{Multi-feature Plagiarism Detection}
  81
  82 Our pair-wise plagiarism detection is based on finding common passages
  83 of text, present both in the source and suspicious document. We call them
  84 {\it features}. In PAN 2010, we have used sorted word 5-grams, formed from
  85 words of three or more characters, as features to compare.
  86 Recently, other means of plagiarism detection have been explored:
  87 Stop-word $n$-gram detection is one of them
  88 \cite{stamatatos2011plagiarism}.
  89
  90 We propose the plagiarism detection system based on detecting common
  91 features of various type, like word $n$-grams, stopword $n$-grams,
  92 translated words or word bigrams, exact common longer words from document
  93 pairs having each document in a different language, etc. The system
  94 has to be to the great extent independent of the specialities of various
  95 feature types. It cannot, for example, use the order of given features
  96 as a measure of distance between the features, as for example, several
  97 word 5-grams can be fully contained inside one stopword 8-gram.
  98
  99 We thus define {\it common feature} of two documents (susp and src)
 100 as the following tuple:
 101 $$\langle
 102 \hbox{offset}_{\hbox{susp}},
 103 \hbox{length}_{\hbox{susp}},
 104 \hbox{offset}_{\hbox{src}},
 105 \hbox{length}_{\hbox{src}} \rangle$$
 106
 107 In our final submission, we have used only the following two types
 108 of common features:
 109
 110 \begin{itemize}
 111 \item word 5-grams, from words of three or more characters, sorted, lowercased
 112 \item stop-word 8-grams, from 50 most-frequent English words (including
 113         the possessive suffix 's), unsorted, lowercased, with 8-grams formed
 114         only from the seven most-frequent words ({\it the, of, a, in, to, 's})
 115         removed
 116 \end{itemize}
 117
 118 We have gathered all the common features for a given document pair, and formed
 119 {\it valid intervals} from them, as described in \cite{Kasprzak2009a}
 120 (a similar approach is used also in \cite{stamatatos2011plagiarism}).
 121 The algorithm is modified for multi-feature detection to use character offsets
 122 only instead of feature order numbers. We have used valid intervals
 123 consisting of at least 5 common features, with the maximum allowed gap
 124 inside the interval (characters not belonging to any common feature
 125 of a given valid interval) set to 3,500 characters.
 126
 127 We have also experimented with modifying the allowed gap size using the
 128 intrinsic plagiarism detection: to allow only shorter gap if the common
 129 features around the gap belong to different passages, detected as plagiarized
 130 in the suspicious document by the intrinsic detector, and allow larger gap,
 131 if both the surrounding common features belong to the same passage,
 132 detected by the intrinsic detector. This approach, however, did not show
 133 any improvement against allowed gap of a static size, so it was omitted
 134 from the final submission.
 135
 136 \subsection{Postprocessing}
 137
 138
 139 \subsection{Further discussion}
 140
 141 In the full paper, we will also discuss the following topics:
 142
 143 \begin{itemize}
 144 \item language detection
 145 \item suitability of plagdet score\cite{potthastframework} for performance measurement
 146 \item feasibility of our approach in large-scale systems
 147 \item other possible features to use, especially for cross-lingual detection
 148 \item discussion of parameter settings
 149 \end{itemize}
 150