yenya-detailed.tex

   1 \section{Detailed Document Comparison}~\label{yenya}
   2
   3 \label{detailed}
   4
   5 The detailed comparison task of PAN 2012 consisted in a comparison
   6 of given document pairs, with the expected output being the annotation of
   7 similarities found between these documents.
   8 The submitted program has been run in a controlled environment
   9 separately for each document pair, without the possibility of keeping any
  10 data between runs.
  11
  12 In this section, we describe our approach in the detailed comparison
  13 task. The rest of this section is organized as follows: in the next
  14 subsection, we summarise the differences from our previous approach.
  15 In subsection \ref{sec-alg-overview}, we give an overview of our approach.
  16 TODO napsat jak to nakonec bude.
  17
  18 \subsection{Differences Against PAN 2010}
  19
  20 Our approach in this task
  21 is loosely based on the approach we have used in PAN 2010 \cite{Kasprzak2010}.
  22 The main difference is that instead of looking for similarities of
  23 one type (for PAN 2010, we have used word 5-grams),
  24 we have developed a method of evaluating multiple types of similarities
  25 (we call them {\it common features}) of different properties, such as
  26 density and length.
  27
  28 As a proof of concept, we have used two types of common features: word
  29 5-grams and stop-word 8-grams, the later being based on the method described in
  30 \cite{stamatatos2011plagiarism}.
  31
  32 In addition to the above, we have made several minor improvements to the
  33 algorithm, such as parameter tuning and improving the detections
  34 merging in the post-processing stage.
  35
  36 \subsection{Algorithm Overview}
  37 \label{sec-alg-overview}
  38
  39 The algorithm evaluates the document pair in several stages:
  40
  41 \begin{itemize}
  42 \item intrinsic plagiarism detection
  43 \item language detection of the source document
  44 \begin{itemize}
  45 \item cross-lingual plagiarism detection, if the source document is not in English
  46 \end{itemize}
  47 \item detecting intervals with common features
  48 \item post-processing phase, mainly serves for merging the nearby common intervals
  49 \end{itemize}
  50
  51 \subsection{Multi-feature Plagiarism Detection}
  52
  53 Our pair-wise plagiarism detection is based on finding common passages
  54 of text, present both in the source and in the suspicious document. We call them
  55 {\it common features}. In PAN 2010, we have used sorted word 5-grams, formed from
  56 words of three or more characters, as features to compare.
  57 Recently, other means of plagiarism detection have been explored:
  58 stopword $n$-gram detection is one of them
  59 \cite{stamatatos2011plagiarism}.
  60
  61 We propose the plagiarism detection system based on detecting common
  62 features of various types, for example word $n$-grams, stopword $n$-grams,
  63 translated single words, translated word bigrams,
  64 exact common longer words from document pairs having each document
  65 in a different language, etc. The system
  66 has to be to the great extent independent of the specialities of various
  67 feature types. It cannot, for example, use the order of given features
  68 as a measure of distance between the features, as for example, several
  69 word 5-grams can be fully contained inside one stopword 8-gram.
  70
  71 We therefore propose to describe the {\it common feature} of two documents
  72 (susp and src) with the following tuple:
  73 $\langle
  74 \hbox{offset}_{\hbox{susp}},
  75 \hbox{length}_{\hbox{susp}},
  76 \hbox{offset}_{\hbox{src}},
  77 \hbox{length}_{\hbox{src}} \rangle$. This way, the common feature is
  78 described purely in terms of character offsets, belonging to the feature
  79 in both documents. In our final submission, we have used the following two types
  80 of common features:
  81
  82 \begin{itemize}
  83 \item word 5-grams, from words of three or more characters, sorted, lowercased
  84 \item stopword 8-grams, from 50 most-frequent English words (including
  85         the possessive suffix 's), unsorted, lowercased, with 8-grams formed
  86         only from the seven most-frequent words ({\it the, of, a, in, to, 's})
  87         removed
  88 \end{itemize}
  89
  90 We have gathered all the common features of both types for a given document
  91 pair, and formed {\it valid intervals} from them, as described
  92 in \cite{Kasprzak2009a}. A similar approach is used also in
  93 \cite{stamatatos2011plagiarism}.
  94 The algorithm is modified for multi-feature detection to use character offsets
  95 only instead of feature order numbers. We have used valid intervals
  96 consisting of at least 5 common features, with the maximum allowed gap
  97 inside the interval (characters not belonging to any common feature
  98 of a given valid interval) set to 3,500 characters.
  99
 100 We have also experimented with modifying the allowed gap size using the
 101 intrinsic plagiarism detection: to allow only shorter gap if the common
 102 features around the gap belong to different passages, detected as plagiarized
 103 in the suspicious document by the intrinsic detector, and allow larger gap,
 104 if both the surrounding common features belong to the same passage,
 105 detected by the intrinsic detector. This approach, however, did not show
 106 any improvement against allowed gap of a static size, so it was omitted
 107 from the final submission.
 108
 109 \subsection{Postprocessing}
 110
 111 In the postprocessing phase, we took the resulting valid intervals,
 112 and made attempt to further improve the results. We have firstly
 113 removed overlaps: if both overlapping intervals were
 114 shorter than 300 characters, we have removed both of them. Otherwise, we
 115 kept the longer detection (longer in terms of length in the suspicious document).
 116
 117 We have then joined the adjacent valid intervals into one detection,
 118 if at least one of the following criteria has been met:
 119 \begin{itemize}
 120 \item the gap between the intervals contained at least 4 common features,
 121 and it contained at least one feature per 10,000
 122 characters\footnote{we have computed the length of the gap as the number
 123 of characters between the detections in the source document, plus the
 124 number of charaters between the detections in the suspicious document.}, or
 125 \item the gap was smaller than 30,000 characters and the size of the adjacent
 126 valid intervals was at least twice as big as the gap between them, or
 127 \item the gap was smaller than 30,000 characters and the number of common
 128 features per character in the adjacent interval was not more than three times
 129 bigger than number of features per character in the possible joined interval.
 130 \end{itemize}
 131
 132 These parameters were fine-tuned to achieve the best results on the training corpus. With these parameters, our algorithm got the total plagdet score of 0.73 on the training corpus.
 133
 134 \subsection{Other Approaches Tried}
 135
 136 There are several other approaches we have evaluated, but which were
 137 omitted from our final submission for various reasons. We think mentioning
 138 them here is worthwhile nevertheless.
 139
 140 \subsubsection{Intrinsic Plagiarism Detection}
 141
 142 Our approach is based on character $n$-gram profiles of the interval of
 143 the fixed size (in terms of $n$-grams), and their differences to the
 144 profile of the whole document \cite{pan09stamatatos}. We have further
 145 enhanced the approach with using gaussian smoothing of the style-change
 146 function \cite{Kasprzak2010}.
 147
 148 For PAN 2012, we have experimented with using 1-, 2-, and 3-grams instead
 149 of only 3-grams, and using the different measure of the difference between
 150 the n-gram profiles. We have used an approach similar to \cite{ngram},
 151 where we have compute the profile as an ordered set of 400 most-frequent
 152 $n$-grams in a given text (the whole document or a partial window). Apart
 153 from ordering the set, we have ignored the actual number of occurrences
 154 of a given $n$-gram altogether, and used the value inveresly
 155 proportional to the $n$-gram order in the profile, in accordance with
 156 the Zipf's law \cite{zipf1935psycho}.
 157
 158 This approach has provided more stable style-change function than
 159 than the one proposed in \cite{pan09stamatatos}. Because of pair-wise
 160 nature of the detailed comparison sub-task, we couldn't use the results
 161 of the intrinsic detection immediately, therefore we wanted to use them
 162 as hints to the external detection.
 163
 164 \subsubsection{Language Detection}
 165
 166 For language detection, we used the $n$-gram based categorization \cite{ngram}.
 167 We have computed the language profiles from the source documents of the
 168 training corpus (using the annotations from the corpus itself). The result
 169 of this approach was better than using the stopwords-based detection we have
 170 used in PAN 2010. However, there were still mis-detected documents,
 171 mainly the long lists of surnames and other tabular data. We have added
 172 an ad-hoc fix, where for documents having their profile too distant from all of
 173 English, German, and Spanish profiles, we have declared them to be in English.
 174
 175 \subsubsection{Cross-lingual Plagiarism Detection}
 176
 177 For cross-lingual plagiarism detection, our aim was to use the public
 178 interface to Google translate if possible, and use the resulting document
 179 as the source for standard intra-lingual detector.
 180 Should the translation service not be available, we wanted
 181 to use the fall-back strategy of translating isolated words only,
 182 with the additional exact matching of longer words (we have used words with
 183 5 characters or longer).
 184 We have supposed that these longer words can be names or specialized terms,
 185 present in both languages.
 186
 187 We have used dictionaries from several sources, like
 188 {\it dicts.info}\footnote{\url{http://www.dicts.info/}},
 189 {\it omegawiki}\footnote{\url{http://www.omegawiki.org/}},
 190 and {\it wiktionary}\footnote{\url{http://en.wiktionary.org/}}. The source
 191 and translated document were aligned on a line-by-line basis.
 192
 193 In the final form of the detailed comparison sub-task, the results of machine
 194 translation of the source documents were provided to the detector programs
 195 by the surrounding environment, so we have discarded the language detection
 196 and machine translation from our submission altogether, and used only
 197 line-by-line alignment of the source and translated document for calculating
 198 the offsets of text features in the source document. We have then treated
 199 the translated documents the same way as the source documents in English.
 200
 201 \subsection{Further discussion}
 202
 203 As in our PAN 2010 submission, we tried to make use of the intrinsic plagiarism
 204 detection, but despite making further improvements to the intrinsic plagiarism detector, we have again failed to reach any significant improvement
 205 when using it as a hint for the external plagiarism detection.
 206
 207 In the full paper, we will also discuss the following topics:
 208
 209 \begin{itemize}
 210 \item language detection and cross-language common features
 211 \item intrinsic plagiarism detection
 212 \item suitability of plagdet score\cite{potthastframework} for performance measurement
 213 \item feasibility of our approach in large-scale systems
 214 \item discussion of parameter settings
 215 \end{itemize}
 216
 217 \nocite{pan09stamatatos}
 218 \nocite{ngram}
 219
 220 \endinput
 221
 222 Co chci diskutovat v zaveru:
 223 - nebylo mozno cachovat data
 224 - nebylo mozno vylucovat prekryvajici se podobnosti
 225 - cili udaje o run-time jsou uplne nahouby
 226 - 669 radku kodu bez komentaru a prazdnych radku
 227 - hranice mezi pasazema nekdy zahrnovala whitespace a nekdy ne.
 228
 229 Diskuse plagdet:
 230 - uzivatele chteji "aby odevzdej ukazovalo 0\% shody", nezajima je
 231         co to cislo znamena
 232 - nezalezi na hranicich detekovane pasaze
 233 - false-positives jsou daleko horsi
 234 - granularita je zlo
 235
 236 Finalni vysledky nad testovacim korpusem:
 237
 238 0.7288 0.5994 0.9306 1.0007   2012-06-16 02:23   plagdt recall precis granul
 239                             01-no-plagiarism     0.0000 0.0000 0.0000 1.0000
 240                             02-no-obfuscation    0.9476 0.9627 0.9330 1.0000
 241                             03-artificial-low    0.8726 0.8099 0.9477 1.0013
 242                             04-artificial-high   0.3649 0.2255 0.9562 1.0000
 243                             05-translation       0.7610 0.6662 0.8884 1.0008
 244                             06-simulated-paraphr 0.5972 0.4369 0.9433 1.0000
 245
 246 Vysledky nad souteznimi daty:
 247 plagdet         precision       recall          granularity
 248 0.6826726       0.8931670       0.5524708       1.0000000
 249
 250 Run-time:
 251 12500 sekund tokenizace vcetne sc a detekce jazyka
 252 2500 sekund bez sc a detekce jazyka
 253 14 sekund vyhodnoceni valid intervalu a postprocessing
 254
 255
 256 TODO:
 257 - hranici podle hustoty matchovani
 258 - xml tridit podle this_offset
 259
 260 Tady je obsah souboru JOURNAL - jak jsem meril nektera vylepseni:
 261 =================================================================
 262 baseline.py
 263 0.1250 0.1259 0.9783 2.4460   2012-05-03 06:02   plagdt recall precis granul
 264                             01_no_plagiarism     1.0000 1.0000 1.0000 1.0000
 265                             02_no_obfuscation    0.8608 0.8609 0.8618 1.0009
 266                             03_artificial_low    0.1006 0.1118 0.9979 2.9974
 267                             04_artificial_high   0.0054 0.0029 0.9991 1.0778
 268                             05_translation       0.0003 0.0002 1.0000 1.2143
 269                             06_simulated_paraphr 0.0565 0.0729 0.9983 4.3075
 270
 271 valid_intervals bez postprocessingu (takhle jsem to poprve odevzdal)
 272 0.3183 0.2034 0.9883 1.0850   2012-05-25 15:25   plagdt recall precis granul
 273                             01_no_plagiarism     1.0000 1.0000 1.0000 1.0000
 274                             02_no_obfuscation    0.9861 0.9973 0.9752 1.0000
 275                             03_artificial_low    0.4127 0.3006 0.9975 1.1724
 276                             04_artificial_high   0.0008 0.0004 1.0000 1.0000
 277                             05_translation       0.0001 0.0000 1.0000 1.0000
 278                             06_simulated_paraphr 0.3470 0.2248 0.9987 1.0812
 279
 280 postprocessed (slucovani blizkych intervalu)
 281 0.3350 0.2051 0.9863 1.0188   2012-05-25 15:27   plagdt recall precis granul
 282                             01_no_plagiarism     1.0000 1.0000 1.0000 1.0000
 283                             02_no_obfuscation    0.9863 0.9973 0.9755 1.0000
 284                             03_artificial_low    0.4541 0.3057 0.9942 1.0417
 285                             04_artificial_high   0.0008 0.0004 1.0000 1.0000
 286                             05_translation       0.0001 0.0000 1.0000 1.0000
 287                             06_simulated_paraphr 0.3702 0.2279 0.9986 1.0032
 288
 289 whitespace (uprava whitespaces)
 290 0.3353 0.2053 0.9858 1.0188   2012-05-31 17:57   plagdt recall precis granul
 291                             01_no_plagiarism     1.0000 1.0000 1.0000 1.0000
 292                             02_no_obfuscation    0.9865 0.9987 0.9745 1.0000
 293                             03_artificial_low    0.4546 0.3061 0.9940 1.0417
 294                             04_artificial_high   0.0008 0.0004 1.0000 1.0000
 295                             05_translation       0.0001 0.0000 1.0000 1.0000
 296                             06_simulated_paraphr 0.3705 0.2281 0.9985 1.0032
 297
 298 gap_100: whitespace, + ve valid intervalu dovolim mezeru 100 petic misto 50
 299 0.3696 0.2305 0.9838 1.0148   2012-05-31 18:07   plagdt recall precis granul
 300                             01_no_plagiarism     1.0000 1.0000 1.0000 1.0000
 301                             02_no_obfuscation    0.9850 0.9987 0.9717 1.0000
 302                             03_artificial_low    0.5423 0.3846 0.9922 1.0310
 303                             04_artificial_high   0.0058 0.0029 0.9151 1.0000
 304                             05_translation       0.0001 0.0000 1.0000 1.0000
 305                             06_simulated_paraphr 0.4207 0.2667 0.9959 1.0000
 306
 307 gap_200: whitespace, + ve valid intervalu dovolim mezeru 200 petic misto 50
 308 0.3906 0.2456 0.9769 1.0070   2012-05-31 18:09   plagdt recall precis granul
 309                             01_no_plagiarism     1.0000 1.0000 1.0000 1.0000
 310                             02_no_obfuscation    0.9820 0.9987 0.9659 1.0000
 311                             03_artificial_low    0.5976 0.4346 0.9875 1.0139
 312                             04_artificial_high   0.0087 0.0044 0.9374 1.0000
 313                             05_translation       0.0001 0.0001 1.0000 1.0000
 314                             06_simulated_paraphr 0.4360 0.2811 0.9708 1.0000
 315
 316 gap_200_int_10: gap_200, + valid int. ma min. 10 petic misto 20
 317 0.4436 0.2962 0.9660 1.0308   2012-05-31 18:11   plagdt recall precis granul
 318                             01_no_plagiarism     1.0000 1.0000 1.0000 1.0000
 319                             02_no_obfuscation    0.9612 0.9987 0.9264 1.0000
 320                             03_artificial_low    0.7048 0.5808 0.9873 1.0530
 321                             04_artificial_high   0.0457 0.0242 0.9762 1.0465
 322                             05_translation       0.0008 0.0004 1.0000 1.0000
 323                             06_simulated_paraphr 0.5123 0.3485 0.9662 1.0000
 324
 325 no_trans: gap_200_int_10, + nedetekovat preklady vubec, abych se vyhnul F-P
 326 0.4432 0.2959 0.9658 1.0310   2012-06-01 16:41   plagdt recall precis granul
 327                             01_no_plagiarism     1.0000 1.0000 1.0000 1.0000
 328                             02_no_obfuscation    0.9608 0.9980 0.9263 1.0000
 329                             03_artificial_low    0.7045 0.5806 0.9872 1.0530
 330                             04_artificial_high   0.0457 0.0242 0.9762 1.0465
 331                             05_translation       0.0000 0.0000 0.0000 1.0000
 332                             06_simulated_paraphr 0.5123 0.3485 0.9662 1.0000
 333
 334
 335 swng_unsorted se stejnym postprocessingem jako vyse "whitespace"
 336 0.2673 0.1584 0.9281 1.0174   2012-05-31 14:20   plagdt recall precis granul
 337                             01_no_plagiarism     0.0000 0.0000 0.0000 1.0000
 338                             02_no_obfuscation    0.9439 0.9059 0.9851 1.0000
 339                             03_artificial_low    0.3178 0.1952 0.9954 1.0377
 340                             04_artificial_high   0.0169 0.0095 0.9581 1.1707
 341                             05_translation       0.0042 0.0028 0.0080 1.0000
 342                             06_simulated_paraphr 0.1905 0.1060 0.9434 1.0000
 343
 344 swng_sorted
 345 0.2550 0.1906 0.4067 1.0253   2012-05-30 16:07   plagdt recall precis granul
 346                             01_no_plagiarism     0.0000 0.0000 0.0000 1.0000
 347                             02_no_obfuscation    0.6648 0.9146 0.5222 1.0000
 348                             03_artificial_low    0.4093 0.2867 0.8093 1.0483
 349                             04_artificial_high   0.0454 0.0253 0.4371 1.0755
 350                             05_translation       0.0030 0.0019 0.0064 1.0000
 351                             06_simulated_paraphr 0.1017 0.1382 0.0814 1.0106
 352
 353 sort_susp: gap_200_int_10 + postprocessing tridim intervaly podle offsetu v susp, nikoliv v src
 354 0.4437 0.2962 0.9676 1.0308   2012-06-01 18:06   plagdt recall precis granul
 355                             01_no_plagiarism     1.0000 1.0000 1.0000 1.0000
 356                             02_no_obfuscation    0.9641 0.9987 0.9317 1.0000
 357                             03_artificial_low    0.7048 0.5809 0.9871 1.0530
 358                             04_artificial_high   0.0457 0.0242 0.9762 1.0465
 359                             05_translation       0.0008 0.0004 1.0000 1.0000
 360                             06_simulated_paraphr 0.5123 0.3485 0.9662 1.0000
 361
 362 post_gap2_16000: sort_susp, + sloucit dva intervaly pokud je < 16000 znaku a mezera je jen polovina velikosti tech intervalu (bylo 4000)
 363 0.4539 0.2983 0.9642 1.0054   2012-06-01 18:09   plagdt recall precis granul
 364                             01_no_plagiarism     1.0000 1.0000 1.0000 1.0000
 365                             02_no_obfuscation    0.9631 0.9987 0.9300 1.0000
 366                             03_artificial_low    0.7307 0.5883 0.9814 1.0094
 367                             04_artificial_high   0.0480 0.0247 0.9816 1.0078
 368                             05_translation       0.0008 0.0004 1.0000 1.0000
 369                             06_simulated_paraphr 0.5133 0.3487 0.9721 1.0000
 370
 371 post_gap2_32000: sort_susp, + sloucit intervaly < 32000 znaku a mezera aspon polovina velikosti
 372 0.4543 0.2986 0.9638 1.0050   2012-06-01 18:12   plagdt recall precis granul
 373                             01_no_plagiarism     1.0000 1.0000 1.0000 1.0000
 374                             02_no_obfuscation    0.9628 0.9987 0.9294 1.0000
 375                             03_artificial_low    0.7315 0.5893 0.9798 1.0085
 376                             04_artificial_high   0.0480 0.0247 0.9816 1.0078
 377                             05_translation       0.0008 0.0004 1.0000 1.0000
 378                             06_simulated_paraphr 0.5138 0.3487 0.9763 1.0000
 379
 380 post_gap2_64000: sort_susp, + sloucit intervaly < 32000 znaku a mezera aspon pol
 381 ovina velikosti
 382 0.4543 0.2988 0.9616 1.0050   2012-06-01 18:21   plagdt recall precis granul
 383                             01_no_plagiarism     1.0000 1.0000 1.0000 1.0000
 384                             02_no_obfuscation    0.9603 0.9987 0.9248 1.0000
 385                             03_artificial_low    0.7316 0.5901 0.9782 1.0085
 386                             04_artificial_high   0.0480 0.0247 0.9816 1.0078
 387                             05_translation       0.0008 0.0004 1.0000 1.0000
 388                             06_simulated_paraphr 0.5138 0.3487 0.9763 1.0000
 389
 390 post_gap1_2000: post_gap2_32000, + spojit bez podminek veci co maji mezeru pod 2000 (bylo 600)
 391 0.4543 0.2986 0.9635 1.0050   2012-06-01 18:29   plagdt recall precis granul
 392                             01_no_plagiarism     1.0000 1.0000 1.0000 1.0000
 393                             02_no_obfuscation    0.9628 0.9987 0.9294 1.0000
 394                             03_artificial_low    0.7315 0.5895 0.9794 1.0085
 395                             04_artificial_high   0.0480 0.0247 0.9816 1.0078
 396                             05_translation       0.0008 0.0004 1.0000 1.0000
 397                             06_simulated_paraphr 0.5138 0.3487 0.9763 1.0000
 398
 399