extended-abstract.tex

   1 \documentclass{llncs}
   2 \usepackage[american]{babel}
   3 %\usepackage[T1]{fontenc}
   4 \usepackage{times}
   5 \usepackage{graphicx}
   6 \usepackage[utf8]{inputenc}
   7
   8 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
   9 \begin{document}
  10
  11 \title{Multi-feature Plagiarism Detection}
  12
  13 \author{Jan Kasprzak \and \v{S}imon Suchomel \and Michal Brandejs}
  14 \institute{Faculty of Informatics, Masaryk University \\
  15 {\tt\{kas,suchomel,brandejs\}@fi.muni.cz}}
  16
  17 \maketitle
  18
  19 \section{General Approach}
  20
  21 The approach Masaryk University team has used in PAN 2012 Plagiarism
  22 detection---detailed comparison sub-task is based on the same approach
  23 that we have used in PAN 2010 \cite{Kasprzak2010}.  This time, we have
  24 used a similar approach, enhanced by several means
  25
  26 The algorithm evaluates the document pair in several stages:
  27
  28 \begin{itemize}
  29 \item intrinsic plagiarism detection
  30 \item language detection of the source document
  31 \begin{itemize}
  32 \item cross-lingual plagiarism detection, if the source document is not in English
  33 \end{itemize}
  34 \item detecting intervals with common features
  35 \item post-processing phase, mainly serves for merging the nearby common intervals
  36 \end{itemize}
  37
  38 \section{Intrinsic plagiarism detection}
  39
  40 Our approach is based on character $n$-gram profiles of the interval of
  41 the fixed size (in terms of $n$-grams), and their differences to the
  42 profile of the whole document \cite{pan09stamatatos}. We have further
  43 enhanced the approach with using gaussian smoothing of the style-change
  44 function \cite{Kasprzak2010}.
  45
  46 For PAN 2012, we have experimented with using 1-, 2-, and 3-grams instead
  47 of only 3-grams, and using the different measure of the difference between
  48 the n-gram profiles. We have used an approach similar to \cite{ngram},
  49 where we have compute the profile as an ordered set of 400 most-frequent
  50 $n$-grams in a given text (the whole document or a partial window). Apart
  51 from ordering the set we have ignored the actual number of occurrences
  52 of a given $n$-gram altogether, and used the value inveresly
  53 proportional to the $n$-gram order in the profile, in accordance with
  54 the Zipf's law \cite{zipf1935psycho}.
  55
  56 This approach has provided more stable style-change function than
  57 than the one proposed in \cite{pan09stamatatos}. Because of pair-wise
  58 nature of the detailed comparison sub-task, we couldn't use the results
  59 of the intrinsic detection immediately, so we wanted to use them
  60 as hints to the external detection.
  61
  62 \section{Cross-lingual detection}
  63
  64 %For language detection, we used the $n$-gram based categorization \cite{ngram}.
  65 %We have computed the language profiles from the source documents of the
  66 %training corpus (using the annotations from the corpus itself). The result
  67 %of this approach was better than using the stopwords-based detection we have
  68 %used in PAN 2010. However, there were still mis-detected documents,
  69 %mainly the long lists of surnames and other tabular data. We have added
  70 %an ad-hoc fix, where for documents having their profile too distant from all of
  71 %English, German, and Spanish profiles, we have declared them to be in English.
  72
  73 For cross-lingual plagiarism detection, our aim was to use the public
  74 interface to Google translate if possible, and use the resulting document
  75 as the source for standard intra-lingual detector.
  76 Should the translation service not be available, we wanted
  77 to use the fall-back strategy of translating isolated words only,
  78 with the additional exact matching of longer words (we have used words with
  79 5 characters or longer).
  80 We have supposed these longer words can be names or specialized terms,
  81 present in both languages.
  82
  83 We have used dictionaries from several sources, like
  84 {\tt dicts.info\footnote{\url{http://www.dicts.info/}}},
  85 {\tt omegawiki\footnote{\url{http://www.omegawiki.org/}}},
  86 and {\tt wiktionary\footnote{\url{http://en.wiktionary.org/}}}. The source
  87 and translated document were aligned on a line-by-line basis.
  88
  89 In the final form of the detailed comparison sub-task, the results of machine
  90 translation of the source documents were provided to the detector programs
  91 by the surrounding environment, so we have discarded the language detection
  92 and machine translation from our submission altogether, and used only
  93 line-by-line alignment of the source and translated document for calculating
  94 the offsets of text features in the source document.
  95
  96 \section{Multi-feature Plagiarism Detection}
  97
  98 Our pair-wise plagiarism detection is based on finding common passages
  99 of text, present both in the source and suspicious document. We call them
 100 {\it features}. In PAN 2010, we have used sorted word 5-grams, formed from
 101 words of three or more characters, as features to compare.
 102 Recently, other means of plagiarism detection have been explored:
 103 Stop-word $n$-gram detection is one of them
 104 \cite{stamatatos2011plagiarism}.
 105
 106 We propose the plagiarism detection system based on detecting common
 107 features of various type, like word $n$-grams, stopword $n$-grams,
 108 translated words or word bigrams, exact common longer words from document
 109 pairs having each document in a different language, etc. The system
 110 has to be to the great extent independent of the specialities of various
 111 feature types. It cannot, for example, use the order of given features
 112 as a measure of distance between the features, as for example, several
 113 word 5-grams can be fully contained inside one stopword 8-gram.
 114
 115 We thus define {\it common feature} of two documents (susp and src)
 116 as the following tuple:
 117 $$\langle
 118 \hbox{offset}_{\hbox{susp}},
 119 \hbox{length}_{\hbox{susp}},
 120 \hbox{offset}_{\hbox{src}},
 121 \hbox{length}_{\hbox{src}} \rangle$$
 122
 123 In our final submission, we have used only the following two types
 124 of common features:
 125
 126 \begin{itemize}
 127 \item word 5-grams, from words of three or more characters, sorted, lowercased
 128 \item stop-word 8-grams, from 50 most-frequent English words (including
 129         the possessive suffix 's), unsorted, lowercased, with 8-grams formed
 130         only from the seven most-frequent words ({\it the, of, a, in, to, 's})
 131         removed
 132 \end{itemize}
 133
 134 We have gathered all the common features for a given document pair, and formed
 135 {\it valid intervals} from them, as described in \cite{Kasprzak2009a}
 136 (a similar approach is used also in \cite{stamatatos2011plagiarism}).
 137 The algorithm is modified for multi-feature detection to use character offsets
 138 only instead of feature order numbers. We have used valid intervals
 139 consisting of at least 5 common features, with the maximum allowed gap
 140 inside the interval (characters not belonging to any common feature
 141 of a given valid interval) set to 3,500 characters.
 142
 143 We have also experimented with modifying the allowed gap size using the
 144 intrinsic plagiarism detection: to allow only shorter gap if the common
 145 features around the gap belong to different passages, detected as plagiarized
 146 in the suspicious document by the intrinsic detector, and allow larger gap,
 147 if both the surrounding common features belong to the same passage,
 148 detected by the intrinsic detector. This approach, however, did not show
 149 any improvement against allowed gap of a static size, so it was omitted
 150 from the final submission.
 151
 152 \section{Postprocessing}
 153
 154
 155 \section{Further discussion}
 156
 157 In the full paper, we will also discuss the following topics:
 158
 159 \begin{itemize}
 160 \item language detection
 161 \item suitability of plagdet score\cite{potthastframework} for performance measurement
 162 \item feasibility of our approach in large-scale systems
 163 \item other possible features to use, especially for cross-lingual detection
 164 \item discussion of parameter settings
 165 \end{itemize}
 166
 167 \bibliographystyle{splncs03}
 168 \begin{raggedright}
 169 \bibliography{paper}
 170 \end{raggedright}
 171
 172 \end{document}
 173
 174
 175 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 176