extended-abstract.tex

   1 \documentclass{llncs}
   2 \usepackage[american]{babel}
   3 %\usepackage[T1]{fontenc}
   4 \usepackage{times}
   5 \usepackage{graphicx}
   6 \usepackage[utf8]{inputenc}
   7
   8 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
   9 \begin{document}
  10
  11 \title{Multi-feature Plagiarism Detection}
  12
  13 \author{Jan Kasprzak \and \v{S}imon Suchomel \and Michal Brandejs}
  14 \institute{Faculty of Informatics, Masaryk University \\
  15 {\tt\{kas,suchomel,brandejs\}@fi.muni.cz}}
  16
  17 \maketitle
  18
  19 \section{General Approach}
  20
  21 Our approach in PAN 2012 Plagiarism detection---Detailed comparison sub-task
  22 is loosely based on the approach we have used in PAN 2010 \cite{Kasprzak2010}.
  23
  24 %The algorithm evaluates the document pair in several stages:
  25 %
  26 %\begin{itemize}
  27 %\item intrinsic plagiarism detection
  28 %\item language detection of the source document
  29 %\begin{itemize}
  30 %\item cross-lingual plagiarism detection, if the source document is not in English
  31 %\end{itemize}
  32 %\item detecting intervals with common features
  33 %\item post-processing phase, mainly serves for merging the nearby common intervals
  34 %\end{itemize}
  35
  36 %\section{Intrinsic plagiarism detection}
  37 %
  38 %Our approach is based on character $n$-gram profiles of the interval of
  39 %the fixed size (in terms of $n$-grams), and their differences to the
  40 %profile of the whole document \cite{pan09stamatatos}. We have further
  41 %enhanced the approach with using gaussian smoothing of the style-change
  42 %function \cite{Kasprzak2010}.
  43 %
  44 %For PAN 2012, we have experimented with using 1-, 2-, and 3-grams instead
  45 %of only 3-grams, and using the different measure of the difference between
  46 %the n-gram profiles. We have used an approach similar to \cite{ngram},
  47 %where we have compute the profile as an ordered set of 400 most-frequent
  48 %$n$-grams in a given text (the whole document or a partial window). Apart
  49 %from ordering the set, we have ignored the actual number of occurrences
  50 %of a given $n$-gram altogether, and used the value inveresly
  51 %proportional to the $n$-gram order in the profile, in accordance with
  52 %the Zipf's law \cite{zipf1935psycho}.
  53 %
  54 %This approach has provided more stable style-change function than
  55 %than the one proposed in \cite{pan09stamatatos}. Because of pair-wise
  56 %nature of the detailed comparison sub-task, we couldn't use the results
  57 %of the intrinsic detection immediately, therefore we wanted to use them
  58 %as hints to the external detection.
  59
  60 \section{Cross-lingual Plagiarism Detection}
  61
  62 %For language detection, we used the $n$-gram based categorization \cite{ngram}.
  63 %We have computed the language profiles from the source documents of the
  64 %training corpus (using the annotations from the corpus itself). The result
  65 %of this approach was better than using the stopwords-based detection we have
  66 %used in PAN 2010. However, there were still mis-detected documents,
  67 %mainly the long lists of surnames and other tabular data. We have added
  68 %an ad-hoc fix, where for documents having their profile too distant from all of
  69 %English, German, and Spanish profiles, we have declared them to be in English.
  70
  71 %For cross-lingual plagiarism detection, our aim was to use the public
  72 %interface to Google translate if possible, and use the resulting document
  73 %as the source for standard intra-lingual detector.
  74 %Should the translation service not be available, we wanted
  75 %to use the fall-back strategy of translating isolated words only,
  76 %with the additional exact matching of longer words (we have used words with
  77 %5 characters or longer).
  78 %We have supposed that these longer words can be names or specialized terms,
  79 %present in both languages.
  80
  81 %We have used dictionaries from several sources, like
  82 %{\it dicts.info}\footnote{\url{http://www.dicts.info/}},
  83 %{\it omegawiki}\footnote{\url{http://www.omegawiki.org/}},
  84 %and {\it wiktionary}\footnote{\url{http://en.wiktionary.org/}}. The source
  85 %and translated document were aligned on a line-by-line basis.
  86
  87 In the final form of the detailed comparison sub-task, the results of machine
  88 translation of the source documents were provided to the detector programs
  89 by the surrounding environment, so we have discarded the language detection
  90 and machine translation from our submission altogether, and used only
  91 line-by-line alignment of the source and translated document for calculating
  92 the offsets of text features in the source document. We have then treated
  93 the translated documents the same way as the source documents in English.
  94
  95 \section{Multi-feature Plagiarism Detection}
  96
  97 Our pair-wise plagiarism detection is based on finding common passages
  98 of text, present both in the source and in the suspicious document. We call them
  99 {\it common features}. In PAN 2010, we have used sorted word 5-grams, formed from
 100 words of three or more characters, as features to compare.
 101 Recently, other means of plagiarism detection have been explored:
 102 stopword $n$-gram detection is one of them
 103 \cite{stamatatos2011plagiarism}.
 104
 105 We propose the plagiarism detection system based on detecting common
 106 features of various types, for example word $n$-grams, stopword $n$-grams,
 107 translated single words, translated word bigrams,
 108 exact common longer words from document pairs having each document
 109 in a different language, etc. The system
 110 has to be to the great extent independent of the specialities of various
 111 feature types. It cannot, for example, use the order of given features
 112 as a measure of distance between the features, as for example, several
 113 word 5-grams can be fully contained inside one stopword 8-gram.
 114
 115 We therefore propose to describe the {\it common feature} of two documents
 116 (susp and src) with the following tuple:
 117 $\langle
 118 \hbox{offset}_{\hbox{susp}},
 119 \hbox{length}_{\hbox{susp}},
 120 \hbox{offset}_{\hbox{src}},
 121 \hbox{length}_{\hbox{src}} \rangle$. This way, the common feature is
 122 described purely in terms of character offsets, belonging to the feature
 123 in both documents. In our final submission, we have used the following two types
 124 of common features:
 125
 126 \begin{itemize}
 127 \item word 5-grams, from words of three or more characters, sorted, lowercased
 128 \item stopword 8-grams, from 50 most-frequent English words (including
 129         the possessive suffix 's), unsorted, lowercased, with 8-grams formed
 130         only from the seven most-frequent words ({\it the, of, a, in, to, 's})
 131         removed
 132 \end{itemize}
 133
 134 We have gathered all the common features of both types for a given document
 135 pair, and formed {\it valid intervals} from them, as described
 136 in \cite{Kasprzak2009a}. A similar approach is used also in
 137 \cite{stamatatos2011plagiarism}.
 138 The algorithm is modified for multi-feature detection to use character offsets
 139 only instead of feature order numbers. We have used valid intervals
 140 consisting of at least 5 common features, with the maximum allowed gap
 141 inside the interval (characters not belonging to any common feature
 142 of a given valid interval) set to 3,500 characters.
 143
 144 %We have also experimented with modifying the allowed gap size using the
 145 %intrinsic plagiarism detection: to allow only shorter gap if the common
 146 %features around the gap belong to different passages, detected as plagiarized
 147 %in the suspicious document by the intrinsic detector, and allow larger gap,
 148 %if both the surrounding common features belong to the same passage,
 149 %detected by the intrinsic detector. This approach, however, did not show
 150 %any improvement against allowed gap of a static size, so it was omitted
 151 %from the final submission.
 152
 153 \section{Postprocessing}
 154
 155 In the postprocessing phase, we took the resulting valid intervals,
 156 and made attempt to further improve the results. We have firstly
 157 removed overlaps: if both overlapping intervals were
 158 shorter than 300 characters, we have removed both of them. Otherwise, we
 159 kept the longer detection (longer in terms of length in the suspicious document).
 160
 161 We have then joined the adjacent valid intervals into one detection,
 162 if at least one of the following criteria has been met:
 163 \begin{itemize}
 164 \item the gap between the intervals contained at least 4 common features,
 165 and it contained at least one feature per 10,000
 166 characters\footnote{we have computed the length of the gap as the number
 167 of characters between the detections in the source document, plus the
 168 number of charaters between the detections in the suspicious document.}, or
 169 \item the gap was smaller than 30,000 characters and the size of the adjacent
 170 valid intervals was at least twice as big as the gap between them, or
 171 \item the gap was smaller than 30,000 characters and the number of common
 172 features per character in the adjacent interval was not more than three times
 173 bigger than number of features per character in the possible joined interval.
 174 \end{itemize}
 175
 176 These parameters were fine-tuned to achieve the best results on the training corpus. With these parameters, our algorithm got the total plagdet score of 0.73 on the training corpus.
 177
 178 \section{Further discussion}
 179
 180 As in our PAN 2010 submission, we tried to make use of the intrinsic plagiarism
 181 detection, but despite making further improvements to the intrinsic plagiarism detector, we have again failed to reach any significant improvement
 182 when using it as a hint for the external plagiarism detection.
 183
 184 In the full paper, we will also discuss the following topics:
 185
 186 \begin{itemize}
 187 \item language detection and cross-language common features
 188 \item intrinsic plagiarism detection
 189 \item suitability of plagdet score\cite{potthastframework} for performance measurement
 190 \item feasibility of our approach in large-scale systems
 191 \item discussion of parameter settings
 192 \end{itemize}
 193
 194 \nocite{pan09stamatatos}
 195 %\nocite{ngram}
 196
 197 \bibliographystyle{splncs03}
 198 \begin{raggedright}
 199 \bibliography{paper}
 200 \end{raggedright}
 201
 202 \end{document}
 203
 204
 205 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 206