yenya-detailed.tex

   1 \section{Detailed Document Comparison}~\label{yenya}
   2
   3 \label{detailed}
   4
   5 The detailed comparison task of PAN 2012 consisted in a comparison
   6 of given document pairs, with the expected output being the annotation of
   7 similarities found between these documents.
   8 The submitted program has been run in a controlled environment
   9 separately for each document pair, without the possibility of keeping any
  10 data between runs.
  11
  12 In this section, we describe our approach in the detailed comparison
  13 task. The rest of this section is organized as follows: in the next
  14 subsection, we summarise the differences from our previous approach.
  15 In subsection \ref{sec-alg-overview}, we give an overview of our approach.
  16 TODO napsat jak to nakonec bude.
  17
  18 \subsection{Differences Against PAN 2010}
  19
  20 Our approach in this task
  21 is loosely based on the approach we have used in PAN 2010 \cite{Kasprzak2010}.
  22 The main difference is that instead of looking for similarities of
  23 one type (for PAN 2010, we have used word 5-grams),
  24 we have developed a method of evaluating multiple types of similarities
  25 (we call them {\it common features}) of different properties, such as
  26 density and length.
  27
  28 As a proof of concept, we have used two types of common features: word
  29 5-grams and stop-word 8-grams, the later being based on the method described in
  30 \cite{stamatatos2011plagiarism}.
  31
  32 In addition to the above, we have made several minor improvements to the
  33 algorithm, such as parameter tuning and improving the detections
  34 merging in the post-processing stage.
  35
  36 \subsection{Algorithm Overview}
  37 \label{sec-alg-overview}
  38
  39 The algorithm evaluates the document pair in several stages:
  40
  41 \begin{itemize}
  42 \item tokenizing both the suspicious and source documents
  43 \item forming {\it features} from some tokens
  44 \item discovering {\it common features}
  45 \item making {\it valid intervals} from common features
  46 \item postprocessing
  47 \end{itemize}
  48
  49 \subsection{Tokenization}
  50
  51 We tokenize the document into words, where word is a sequence of one
  52 or more characters of the {\it Letter} Unicode class.
  53 With each word, two additional attributes, needed for further processing,
  54 are associated: the offset where the word begins, and the word length.
  55
  56 The offset where the word begins is not necessarily the first letter character
  57 of the word itself. We have discovered that in the training corpus,
  58 some plagiarized passages were annotated including the preceding
  59 non-letter characters. We have used the following heuristics to add
  60 parts of the inter-word gap to the previous, or the next adjacent word:
  61
  62 \begin{itemize}
  63 \item when the inter-word gap contains interpunction (any of the dot,
  64 semicolon, colon, comma, exclamation mark, question mark, or quotes),
  65 add the characters up to, and including the interpunction, to the previous
  66 word, ignore the space character(s) after the interpunction, and add
  67 the rest to tne next word.
  68 \item otherwise, when the inter-word gap contains newline, add the character
  69 before the first newline to the previous word, ignore the first newline
  70 character, and add the rest to the next word.
  71 \item otherwise, ignore the inter-word gap characters altogether.
  72 \end{itemize}
  73
  74 When the detection program has been presented three different
  75 files instead of two (meaning the third one is machine-translated
  76 version of the second one), we have tokenized the translated document instead
  77 of the source one. We have make use of the line-by-line alignment of the
  78 source and machine-translated documents to transform the word offsets
  79 and lengths in the translated document to the terms of the source document.
  80
  81 \subsection{Features}
  82
  83 We have used features of two types:
  84
  85 \begin{itemize}
  86 \item Lexicographically sorted word 5-grams, formed of words at least
  87 three characters long, and
  88 \item unsorted stop-word 8-grams, formed from 50 most frequent words in English,
  89 as described in \cite{stamatatos2011plagiarism}. We have further ignored
  90 the 8-grams, formed solely from the six most frequent English words
  91 (the, of, and, a, in, to), or the possessive {\it'{}s}.
  92 \end{itemize}
  93
  94 We have represented each feature with the 32 highest-order bits of its
  95 MD5 digest. This is only a performance optimization, targeted for
  96 larger systems. The number of features in a document pair is several orders
  97 of magnitude lower than $2^{32}$, so the probability of hash function
  98 collision is low. For pair-wise comparison, it would be feasible to compare
  99 the features directly instead of their MD5 sums.
 100
 101 Each feature has also the offset and length attributes.
 102 Offset is taken as the offset of the first word in a given feature,
 103 and length is the offset of the last character in a given feature
 104 minus the offset of the feature itself.
 105
 106 \subsection{Common Features}
 107
 108 For further processing, we have taken into account only the features
 109 present both in source and suspicious document. For each such
 110 {\it common feature}, we have created the list of
 111 $(\makebox{offset}, \makebox{length})$ pairs for the source document,
 112 and a similar list for the suspicious document. Note that a given feature
 113 can occur multiple times both in source and suspicious document.
 114
 115 \subsection{Valid Intervals}
 116
 117 To detect a plagiarized passage, we need to find a set of common features,
 118 which map to a dense-enough interval both in the source and suspicious
 119 document. In our previous work, we have presented the algorithm
 120 for discovering these {\it valid intervals} \cite{Kasprzak2009a}.
 121 A similar approach is used also in \cite{stamatatos2011plagiarism}.
 122 Both of these algorithms use the features of a single type, which
 123 allows to use the ordering of features as a measure of distance.
 124
 125 When we use features of different types, there is no natural ordering
 126 of them: for example, a stop-word 8-gram can span multiple sentences,
 127 which can contain several word 5-grams. The assumption of both of the
 128 above algorithms, that the last character of the previous feature
 129 is before the last character of the current feature, is broken.
 130
 131 We have modified the algorithm for computing valid intervals with
 132 multi-feature detection to use character offsets
 133 only instead of feature order numbers. We have used valid intervals
 134 consisting of at least 4 common features, with the maximum allowed gap
 135 inside the interval (characters not belonging to any common feature
 136 of a given valid interval) set to 4000 characters.
 137
 138 \subsection{Postprocessing}
 139
 140 In the postprocessing phase, we took the resulting valid intervals,
 141 and made attempt to further improve the results. We have firstly
 142 removed overlaps: if both overlapping intervals were
 143 shorter than 300 characters, we have removed both of them. Otherwise, we
 144 kept the longer detection (longer in terms of length in the suspicious document).
 145
 146 We have then joined the adjacent valid intervals into one detection,
 147 if at least one of the following criteria has been met:
 148 \begin{itemize}
 149 \item the gap between the intervals contained at least 4 common features,
 150 and it contained at least one feature per 10,000
 151 characters\footnote{we have computed the length of the gap as the number
 152 of characters between the detections in the source document, plus the
 153 number of charaters between the detections in the suspicious document.}, or
 154 \item the gap was smaller than 30,000 characters and the size of the adjacent
 155 valid intervals was at least twice as big as the gap between them, or
 156 \item the gap was smaller than 30,000 characters and the number of common
 157 features per character in the adjacent interval was not more than three times
 158 bigger than number of features per character in the possible joined interval.
 159 \end{itemize}
 160
 161 \subsection{Results}
 162
 163 These parameters were fine-tuned to achieve the best results on the training
 164 corpus. With these parameters, our algorithm got the total plagdet score
 165 of 0.7288 on the training corpus. The details of the performance of
 166 our algorithm are presented in Table \ref{table-final}.
 167 In the PAN 2012 competition, we have acchieved the plagdet score
 168 of 0.6827, precision 0.8932, recall 0.5525, and granularity 1.0000.
 169
 170 \begin{table}
 171 \begin{center}
 172 \begin{tabular}{|l|r|r|r|r|}
 173 \hline
 174 &plagdet&recall&precision&granularity\\
 175 \hline
 176 whole corpus&0.7288&0.5994&0.9306&1.0007\\
 177 \hline
 178 01-no-plagiarism    &0.0000&0.0000&0.0000&1.0000\\
 179 02-no-obfuscation   &0.9476&0.9627&0.9330&1.0000\\
 180 03-artificial-low   &0.8726&0.8099&0.9477&1.0013\\
 181 04-artificial-high  &0.3649&0.2255&0.9562&1.0000\\
 182 05-translation      &0.7610&0.6662&0.8884&1.0008\\
 183 06-simulated-paraphrase&0.5972&0.4369&0.9433&1.0000\\
 184 \hline
 185 \end{tabular}
 186 \end{center}
 187 \caption{Performance on the training corpus}
 188 \label{table-final}
 189 \end{table}
 190
 191 \subsection{Other Approaches Explored}
 192
 193 There are several other approaches we have evaluated, but which were
 194 omitted from our final submission for various reasons. We think mentioning
 195 them here is worthwhile nevertheless.
 196
 197 \subsubsection{Intrinsic Plagiarism Detection}
 198
 199 Our approach is based on character $n$-gram profiles of the interval of
 200 the fixed size (in terms of $n$-grams), and their differences to the
 201 profile of the whole document \cite{pan09stamatatos}. We have further
 202 enhanced the approach with using gaussian smoothing of the style-change
 203 function \cite{Kasprzak2010}.
 204
 205 For PAN 2012, we have experimented with using 1-, 2-, and 3-grams instead
 206 of only 3-grams, and using the different measure of the difference between
 207 the n-gram profiles. We have used an approach similar to \cite{ngram},
 208 where we have compute the profile as an ordered set of 400 most-frequent
 209 $n$-grams in a given text (the whole document or a partial window). Apart
 210 from ordering the set, we have ignored the actual number of occurrences
 211 of a given $n$-gram altogether, and used the value inveresly
 212 proportional to the $n$-gram order in the profile, in accordance with
 213 the Zipf's law \cite{zipf1935psycho}.
 214
 215 This approach has provided more stable style-change function than
 216 than the one proposed in \cite{pan09stamatatos}. Because of pair-wise
 217 nature of the detailed comparison sub-task, we couldn't use the results
 218 of the intrinsic detection immediately, therefore we wanted to use them
 219 as hints to the external detection.
 220
 221 We have also experimented with modifying the allowed gap size using the
 222 intrinsic plagiarism detection: to allow only shorter gap if the common
 223 features around the gap belong to different passages, detected as plagiarized
 224 in the suspicious document by the intrinsic detector, and allow larger gap,
 225 if both the surrounding common features belong to the same passage,
 226 detected by the intrinsic detector. This approach, however, did not show
 227 any improvement against allowed gap of a static size, so it was omitted
 228 from the final submission.
 229
 230 \subsubsection{Language Detection}
 231
 232 For language detection, we used the $n$-gram based categorization \cite{ngram}.
 233 We have computed the language profiles from the source documents of the
 234 training corpus (using the annotations from the corpus itself). The result
 235 of this approach was better than using the stopwords-based detection we have
 236 used in PAN 2010. However, there were still mis-detected documents,
 237 mainly the long lists of surnames and other tabular data. We have added
 238 an ad-hoc fix, where for documents having their profile too distant from all of
 239 English, German, and Spanish profiles, we have declared them to be in English.
 240
 241 \subsubsection{Cross-lingual Plagiarism Detection}
 242
 243 For cross-lingual plagiarism detection, our aim was to use the public
 244 interface to Google translate if possible, and use the resulting document
 245 as the source for standard intra-lingual detector.
 246 Should the translation service not be available, we wanted
 247 to use the fall-back strategy of translating isolated words only,
 248 with the additional exact matching of longer words (we have used words with
 249 5 characters or longer).
 250 We have supposed that these longer words can be names or specialized terms,
 251 present in both languages.
 252
 253 We have used dictionaries from several sources, like
 254 {\it dicts.info}\footnote{\url{http://www.dicts.info/}},
 255 {\it omegawiki}\footnote{\url{http://www.omegawiki.org/}},
 256 and {\it wiktionary}\footnote{\url{http://en.wiktionary.org/}}. The source
 257 and translated document were aligned on a line-by-line basis.
 258
 259 In the final form of the detailed comparison sub-task, the results of machine
 260 translation of the source documents were provided to the detector programs
 261 by the surrounding environment, so we have discarded the language detection
 262 and machine translation from our submission altogether, and used only
 263 line-by-line alignment of the source and translated document for calculating
 264 the offsets of text features in the source document. We have then treated
 265 the translated documents the same way as the source documents in English.
 266
 267 \subsection{Performance Notes}
 268
 269 We consider comparing the performance of PAN 2012 submissions almost
 270 meaningless, because any sane system would precompute features for all
 271 documents in a given set of suspicious and source documents, and use the
 272 results for pair-wise comparison, expecting that any document will take
 273 part in more than one pair.
 274
 275 We did not use this exact split in our submission, but in order to be able
 276 to evaluate various approaches faster, we have split our computation into
 277 the following two parts: in the first part, common features have been
 278 computed, and the results stored into a file\footnote{We have use the
 279 {\tt Storable.pm} storage available in Perl.}. The second part
 280 then used this data to compute valid intervals and do post-processing.
 281
 282 We did our development on a machine with four six-core AMD 8139 CPUs
 283 (2800 MHz), and 128 GB RAM. The first phase took about 2500 seconds
 284 on this host, and the second phase took 14 seconds. Computing the
 285 plagdet score using the official script in Python took between 120 and
 286 180 seconds, as there is no parallelism in this script.
 287
 288 When we have tried to use intrinsic plagiarism detection and language
 289 detection, the first phase took about 12500 seconds. Thus omitting these
 290 featurs clearly provided huge performance improvement.
 291
 292 The code has been written in Perl, and had about 669 lines of code,
 293 not counting comments and blank lines.
 294
 295 \subsection{Further discussion}
 296
 297 As in our PAN 2010 submission, we tried to make use of the intrinsic plagiarism
 298 detection, but despite making further improvements to the intrinsic plagiarism detector, we have again failed to reach any significant improvement
 299 when using it as a hint for the external plagiarism detection.
 300
 301 In the full paper, we will also discuss the following topics:
 302
 303 \begin{itemize}
 304 \item language detection and cross-language common features
 305 \item intrinsic plagiarism detection
 306 \item suitability of plagdet score\cite{potthastframework} for performance measurement
 307 \item feasibility of our approach in large-scale systems
 308 \item discussion of parameter settings
 309 \end{itemize}
 310
 311 \nocite{pan09stamatatos}
 312 \nocite{ngram}
 313
 314 \endinput
 315
 316 Co chci diskutovat v zaveru:
 317 - nebylo mozno cachovat data
 318 - nebylo mozno vylucovat prekryvajici se podobnosti
 319 - cili udaje o run-time jsou uplne nahouby
 320 - 669 radku kodu bez komentaru a prazdnych radku
 321 - hranice mezi pasazema nekdy zahrnovala whitespace a nekdy ne.
 322
 323 Diskuse plagdet:
 324 - uzivatele chteji "aby odevzdej ukazovalo 0\% shody", nezajima je
 325         co to cislo znamena
 326 - nezalezi na hranicich detekovane pasaze
 327 - false-positives jsou daleko horsi
 328 - granularita je zlo
 329
 330 Finalni vysledky nad testovacim korpusem:
 331
 332 0.7288 0.5994 0.9306 1.0007   2012-06-16 02:23   plagdt recall precis granul
 333                             01-no-plagiarism     0.0000 0.0000 0.0000 1.0000
 334                             02-no-obfuscation    0.9476 0.9627 0.9330 1.0000
 335                             03-artificial-low    0.8726 0.8099 0.9477 1.0013
 336                             04-artificial-high   0.3649 0.2255 0.9562 1.0000
 337                             05-translation       0.7610 0.6662 0.8884 1.0008
 338                             06-simulated-paraphr 0.5972 0.4369 0.9433 1.0000
 339
 340 Vysledky nad souteznimi daty:
 341 plagdet         precision       recall          granularity
 342 0.6826726       0.8931670       0.5524708       1.0000000
 343
 344 Run-time:
 345 12500 sekund tokenizace vcetne sc a detekce jazyka
 346 2500 sekund bez sc a detekce jazyka
 347 14 sekund vyhodnoceni valid intervalu a postprocessing
 348
 349
 350 TODO:
 351 - hranici podle hustoty matchovani
 352 - xml tridit podle this_offset
 353
 354 Tady je obsah souboru JOURNAL - jak jsem meril nektera vylepseni:
 355 =================================================================
 356 baseline.py
 357 0.1250 0.1259 0.9783 2.4460   2012-05-03 06:02   plagdt recall precis granul
 358                             01_no_plagiarism     1.0000 1.0000 1.0000 1.0000
 359                             02_no_obfuscation    0.8608 0.8609 0.8618 1.0009
 360                             03_artificial_low    0.1006 0.1118 0.9979 2.9974
 361                             04_artificial_high   0.0054 0.0029 0.9991 1.0778
 362                             05_translation       0.0003 0.0002 1.0000 1.2143
 363                             06_simulated_paraphr 0.0565 0.0729 0.9983 4.3075
 364
 365 valid_intervals bez postprocessingu (takhle jsem to poprve odevzdal)
 366 0.3183 0.2034 0.9883 1.0850   2012-05-25 15:25   plagdt recall precis granul
 367                             01_no_plagiarism     1.0000 1.0000 1.0000 1.0000
 368                             02_no_obfuscation    0.9861 0.9973 0.9752 1.0000
 369                             03_artificial_low    0.4127 0.3006 0.9975 1.1724
 370                             04_artificial_high   0.0008 0.0004 1.0000 1.0000
 371                             05_translation       0.0001 0.0000 1.0000 1.0000
 372                             06_simulated_paraphr 0.3470 0.2248 0.9987 1.0812
 373
 374 postprocessed (slucovani blizkych intervalu)
 375 0.3350 0.2051 0.9863 1.0188   2012-05-25 15:27   plagdt recall precis granul
 376                             01_no_plagiarism     1.0000 1.0000 1.0000 1.0000
 377                             02_no_obfuscation    0.9863 0.9973 0.9755 1.0000
 378                             03_artificial_low    0.4541 0.3057 0.9942 1.0417
 379                             04_artificial_high   0.0008 0.0004 1.0000 1.0000
 380                             05_translation       0.0001 0.0000 1.0000 1.0000
 381                             06_simulated_paraphr 0.3702 0.2279 0.9986 1.0032
 382
 383 whitespace (uprava whitespaces)
 384 0.3353 0.2053 0.9858 1.0188   2012-05-31 17:57   plagdt recall precis granul
 385                             01_no_plagiarism     1.0000 1.0000 1.0000 1.0000
 386                             02_no_obfuscation    0.9865 0.9987 0.9745 1.0000
 387                             03_artificial_low    0.4546 0.3061 0.9940 1.0417
 388                             04_artificial_high   0.0008 0.0004 1.0000 1.0000
 389                             05_translation       0.0001 0.0000 1.0000 1.0000
 390                             06_simulated_paraphr 0.3705 0.2281 0.9985 1.0032
 391
 392 gap_100: whitespace, + ve valid intervalu dovolim mezeru 100 petic misto 50
 393 0.3696 0.2305 0.9838 1.0148   2012-05-31 18:07   plagdt recall precis granul
 394                             01_no_plagiarism     1.0000 1.0000 1.0000 1.0000
 395                             02_no_obfuscation    0.9850 0.9987 0.9717 1.0000
 396                             03_artificial_low    0.5423 0.3846 0.9922 1.0310
 397                             04_artificial_high   0.0058 0.0029 0.9151 1.0000
 398                             05_translation       0.0001 0.0000 1.0000 1.0000
 399                             06_simulated_paraphr 0.4207 0.2667 0.9959 1.0000
 400
 401 gap_200: whitespace, + ve valid intervalu dovolim mezeru 200 petic misto 50
 402 0.3906 0.2456 0.9769 1.0070   2012-05-31 18:09   plagdt recall precis granul
 403                             01_no_plagiarism     1.0000 1.0000 1.0000 1.0000
 404                             02_no_obfuscation    0.9820 0.9987 0.9659 1.0000
 405                             03_artificial_low    0.5976 0.4346 0.9875 1.0139
 406                             04_artificial_high   0.0087 0.0044 0.9374 1.0000
 407                             05_translation       0.0001 0.0001 1.0000 1.0000
 408                             06_simulated_paraphr 0.4360 0.2811 0.9708 1.0000
 409
 410 gap_200_int_10: gap_200, + valid int. ma min. 10 petic misto 20
 411 0.4436 0.2962 0.9660 1.0308   2012-05-31 18:11   plagdt recall precis granul
 412                             01_no_plagiarism     1.0000 1.0000 1.0000 1.0000
 413                             02_no_obfuscation    0.9612 0.9987 0.9264 1.0000
 414                             03_artificial_low    0.7048 0.5808 0.9873 1.0530
 415                             04_artificial_high   0.0457 0.0242 0.9762 1.0465
 416                             05_translation       0.0008 0.0004 1.0000 1.0000
 417                             06_simulated_paraphr 0.5123 0.3485 0.9662 1.0000
 418
 419 no_trans: gap_200_int_10, + nedetekovat preklady vubec, abych se vyhnul F-P
 420 0.4432 0.2959 0.9658 1.0310   2012-06-01 16:41   plagdt recall precis granul
 421                             01_no_plagiarism     1.0000 1.0000 1.0000 1.0000
 422                             02_no_obfuscation    0.9608 0.9980 0.9263 1.0000
 423                             03_artificial_low    0.7045 0.5806 0.9872 1.0530
 424                             04_artificial_high   0.0457 0.0242 0.9762 1.0465
 425                             05_translation       0.0000 0.0000 0.0000 1.0000
 426                             06_simulated_paraphr 0.5123 0.3485 0.9662 1.0000
 427
 428
 429 swng_unsorted se stejnym postprocessingem jako vyse "whitespace"
 430 0.2673 0.1584 0.9281 1.0174   2012-05-31 14:20   plagdt recall precis granul
 431                             01_no_plagiarism     0.0000 0.0000 0.0000 1.0000
 432                             02_no_obfuscation    0.9439 0.9059 0.9851 1.0000
 433                             03_artificial_low    0.3178 0.1952 0.9954 1.0377
 434                             04_artificial_high   0.0169 0.0095 0.9581 1.1707
 435                             05_translation       0.0042 0.0028 0.0080 1.0000
 436                             06_simulated_paraphr 0.1905 0.1060 0.9434 1.0000
 437
 438 swng_sorted
 439 0.2550 0.1906 0.4067 1.0253   2012-05-30 16:07   plagdt recall precis granul
 440                             01_no_plagiarism     0.0000 0.0000 0.0000 1.0000
 441                             02_no_obfuscation    0.6648 0.9146 0.5222 1.0000
 442                             03_artificial_low    0.4093 0.2867 0.8093 1.0483
 443                             04_artificial_high   0.0454 0.0253 0.4371 1.0755
 444                             05_translation       0.0030 0.0019 0.0064 1.0000
 445                             06_simulated_paraphr 0.1017 0.1382 0.0814 1.0106
 446
 447 sort_susp: gap_200_int_10 + postprocessing tridim intervaly podle offsetu v susp, nikoliv v src
 448 0.4437 0.2962 0.9676 1.0308   2012-06-01 18:06   plagdt recall precis granul
 449                             01_no_plagiarism     1.0000 1.0000 1.0000 1.0000
 450                             02_no_obfuscation    0.9641 0.9987 0.9317 1.0000
 451                             03_artificial_low    0.7048 0.5809 0.9871 1.0530
 452                             04_artificial_high   0.0457 0.0242 0.9762 1.0465
 453                             05_translation       0.0008 0.0004 1.0000 1.0000
 454                             06_simulated_paraphr 0.5123 0.3485 0.9662 1.0000
 455
 456 post_gap2_16000: sort_susp, + sloucit dva intervaly pokud je < 16000 znaku a mezera je jen polovina velikosti tech intervalu (bylo 4000)
 457 0.4539 0.2983 0.9642 1.0054   2012-06-01 18:09   plagdt recall precis granul
 458                             01_no_plagiarism     1.0000 1.0000 1.0000 1.0000
 459                             02_no_obfuscation    0.9631 0.9987 0.9300 1.0000
 460                             03_artificial_low    0.7307 0.5883 0.9814 1.0094
 461                             04_artificial_high   0.0480 0.0247 0.9816 1.0078
 462                             05_translation       0.0008 0.0004 1.0000 1.0000
 463                             06_simulated_paraphr 0.5133 0.3487 0.9721 1.0000
 464
 465 post_gap2_32000: sort_susp, + sloucit intervaly < 32000 znaku a mezera aspon polovina velikosti
 466 0.4543 0.2986 0.9638 1.0050   2012-06-01 18:12   plagdt recall precis granul
 467                             01_no_plagiarism     1.0000 1.0000 1.0000 1.0000
 468                             02_no_obfuscation    0.9628 0.9987 0.9294 1.0000
 469                             03_artificial_low    0.7315 0.5893 0.9798 1.0085
 470                             04_artificial_high   0.0480 0.0247 0.9816 1.0078
 471                             05_translation       0.0008 0.0004 1.0000 1.0000
 472                             06_simulated_paraphr 0.5138 0.3487 0.9763 1.0000
 473
 474 post_gap2_64000: sort_susp, + sloucit intervaly < 32000 znaku a mezera aspon pol
 475 ovina velikosti
 476 0.4543 0.2988 0.9616 1.0050   2012-06-01 18:21   plagdt recall precis granul
 477                             01_no_plagiarism     1.0000 1.0000 1.0000 1.0000
 478                             02_no_obfuscation    0.9603 0.9987 0.9248 1.0000
 479                             03_artificial_low    0.7316 0.5901 0.9782 1.0085
 480                             04_artificial_high   0.0480 0.0247 0.9816 1.0078
 481                             05_translation       0.0008 0.0004 1.0000 1.0000
 482                             06_simulated_paraphr 0.5138 0.3487 0.9763 1.0000
 483
 484 post_gap1_2000: post_gap2_32000, + spojit bez podminek veci co maji mezeru pod 2000 (bylo 600)
 485 0.4543 0.2986 0.9635 1.0050   2012-06-01 18:29   plagdt recall precis granul
 486                             01_no_plagiarism     1.0000 1.0000 1.0000 1.0000
 487                             02_no_obfuscation    0.9628 0.9987 0.9294 1.0000
 488                             03_artificial_low    0.7315 0.5895 0.9794 1.0085
 489                             04_artificial_high   0.0480 0.0247 0.9816 1.0078
 490                             05_translation       0.0008 0.0004 1.0000 1.0000
 491                             06_simulated_paraphr 0.5138 0.3487 0.9763 1.0000
 492
 493