yenya-detailed.tex

   1 \section{Detailed Document Comparison}~\label{yenya}
   2
   3 \label{detailed}
   4
   5 The detailed comparison task of PAN 2012 consisted in a comparison
   6 of given document pairs, with the expected output being the annotation of
   7 similarities found between these documents.
   8 The submitted program has been run in a controlled environment
   9 separately for each document pair, without the possibility of keeping any
  10 data between runs.
  11
  12 %In this section, we describe our approach in the detailed comparison
  13 %task. The rest of this section is organized as follows: in the next
  14 %subsection, we summarise the differences from our previous approach.
  15 %In subsection \ref{sec-alg-overview}, we give an overview of our approach.
  16 %TODO napsat jak to nakonec bude.
  17
  18 \subsection{Differences Against PAN 2010}
  19
  20 Our approach in this task
  21 is loosely based on the approach we have used in PAN 2010 \cite{Kasprzak2010}.
  22 The main difference is that instead of looking for similarities of
  23 one type (for PAN 2010, we have used word 5-grams),
  24 we have developed a method of evaluating multiple types of similarities
  25 (we call them {\it common features}) of different properties, such as
  26 density and length.
  27
  28 As a proof of concept, we have used two types of common features: word
  29 5-grams and stop-word 8-grams, the later being based on the method described in
  30 \cite{stamatatos2011plagiarism}.
  31
  32 In addition to the above, we have made several minor improvements to the
  33 algorithm, such as parameter tuning and improving the detections
  34 merging in the post-processing stage.
  35
  36 \subsection{Algorithm Overview}
  37 \label{sec-alg-overview}
  38
  39 The algorithm evaluates the document pair in several stages:
  40
  41 \begin{itemize}
  42 \item tokenizing both the suspicious and source documents
  43 \item forming {\it features} from some tokens
  44 \item discovering {\it common features}
  45 \item making {\it valid intervals} from common features
  46 \item postprocessing
  47 \end{itemize}
  48
  49 \subsection{Tokenization}
  50
  51 We tokenize the document into words, where word is a sequence of one
  52 or more characters of the {\it Letter} Unicode class.
  53 With each word, two additional attributes, needed for further processing,
  54 are associated: the offset where the word begins, and the word length.
  55
  56 The offset where the word begins is not necessarily the first letter character
  57 of the word itself. We have discovered that in the training corpus,
  58 some plagiarized passages were annotated including the preceding
  59 non-letter characters. We have used the following heuristics to add
  60 parts of the inter-word gap to the previous, or the next adjacent word:
  61
  62 \begin{itemize}
  63 \item when the inter-word gap contains interpunction (any of the dot,
  64 semicolon, colon, comma, exclamation mark, question mark, or quotes),
  65 add the characters up to, and including the interpunction, to the previous
  66 word, ignore the space character(s) after the interpunction, and add
  67 the rest to tne next word.
  68 \item otherwise, when the inter-word gap contains newline, add the character
  69 before the first newline to the previous word, ignore the first newline
  70 character, and add the rest to the next word.
  71 \item otherwise, ignore the inter-word gap characters altogether.
  72 \end{itemize}
  73
  74 When the detection program has been presented three different
  75 files instead of two (meaning the third one is machine-translated
  76 version of the second one), we have tokenized the translated document instead
  77 of the source one. We have make use of the line-by-line alignment of the
  78 source and machine-translated documents to transform the word offsets
  79 and lengths in the translated document to the terms of the source document.
  80
  81 \subsection{Features}
  82
  83 We have used features of two types:
  84
  85 \begin{itemize}
  86 \item Lexicographically sorted word 5-grams, formed of words at least
  87 three characters long, and
  88 \item unsorted stop-word 8-grams, formed from 50 most frequent words in English,
  89 as described in \cite{stamatatos2011plagiarism}. We have further ignored
  90 the 8-grams, formed solely from the six most frequent English words
  91 (the, of, and, a, in, to), or the possessive {\it'{}s}.
  92 \end{itemize}
  93
  94 We have represented each feature with the 32 highest-order bits of its
  95 MD5 digest. This is only a performance optimization, targeted for
  96 larger systems. The number of features in a document pair is several orders
  97 of magnitude lower than $2^{32}$, so the probability of hash function
  98 collision is low. For pair-wise comparison, it would be feasible to compare
  99 the features directly instead of their MD5 sums.
 100
 101 Each feature has also the offset and length attributes.
 102 Offset is taken as the offset of the first word in a given feature,
 103 and length is the offset of the last character in a given feature
 104 minus the offset of the feature itself.
 105
 106 \subsection{Common Features}
 107
 108 For further processing, we have taken into account only the features
 109 present both in source and suspicious document. For each such
 110 {\it common feature}, we have created the list of
 111 $(\makebox{offset}, \makebox{length})$ pairs for the source document,
 112 and a similar list for the suspicious document. Note that a given feature
 113 can occur multiple times both in source and suspicious document.
 114
 115 \subsection{Valid Intervals}
 116
 117 To detect a plagiarized passage, we need to find a set of common features,
 118 which map to a dense-enough interval both in the source and suspicious
 119 document. In our previous work, we have presented the algorithm
 120 for discovering these {\it valid intervals} \cite{Kasprzak2009a}.
 121 A similar approach is used also in \cite{stamatatos2011plagiarism}.
 122 Both of these algorithms use the features of a single type, which
 123 allows to use the ordering of features as a measure of distance.
 124
 125 When we use features of different types, there is no natural ordering
 126 of them: for example, a stop-word 8-gram can span multiple sentences,
 127 which can contain several word 5-grams. The assumption of both of the
 128 above algorithms, that the last character of the previous feature
 129 is before the last character of the current feature, is broken.
 130
 131 We have modified the algorithm for computing valid intervals with
 132 multi-feature detection to use character offsets
 133 only instead of feature order numbers. We have used valid intervals
 134 consisting of at least 4 common features, with the maximum allowed gap
 135 inside the interval (characters not belonging to any common feature
 136 of a given valid interval) set to 4000 characters.
 137
 138 \subsection{Postprocessing}
 139 \label{postprocessing}
 140
 141 In the postprocessing phase, we took the resulting valid intervals,
 142 and made attempt to further improve the results. We have firstly
 143 removed overlaps: if both overlapping intervals were
 144 shorter than 300 characters, we have removed both of them. Otherwise, we
 145 kept the longer detection (longer in terms of length in the suspicious document).
 146
 147 We have then joined the adjacent valid intervals into one detection,
 148 if at least one of the following criteria has been met:
 149 \begin{itemize}
 150 \item the gap between the intervals contained at least 4 common features,
 151 and it contained at least one feature per 10,000
 152 characters\footnote{we have computed the length of the gap as the number
 153 of characters between the detections in the source document, plus the
 154 number of charaters between the detections in the suspicious document.}, or
 155 \item the gap was smaller than 30,000 characters and the size of the adjacent
 156 valid intervals was at least twice as big as the gap between them, or
 157 \item the gap was smaller than 30,000 characters and the number of common
 158 features per character in the adjacent interval was not more than three times
 159 bigger than number of features per character in the possible joined interval.
 160 \end{itemize}
 161
 162 \subsection{Results}
 163
 164 These parameters were fine-tuned to achieve the best results on the training
 165 corpus. With these parameters, our algorithm got the total plagdet score
 166 of 0.7288 on the training corpus. The details of the performance of
 167 our algorithm are presented in Table \ref{table-final}.
 168 In the PAN 2012 competition, we have acchieved the plagdet score
 169 of 0.6827, precision 0.8932, recall 0.5525, and granularity 1.0000.
 170
 171 \begin{table}
 172 \begin{center}
 173 \begin{tabular}{|l|r|r|r|r|}
 174 \hline
 175 &plagdet&recall&precision&granularity\\
 176 \hline
 177 whole corpus&0.7288&0.5994&0.9306&1.0007\\
 178 \hline
 179 01-no-plagiarism    &0.0000&0.0000&0.0000&1.0000\\
 180 02-no-obfuscation   &0.9476&0.9627&0.9330&1.0000\\
 181 03-artificial-low   &0.8726&0.8099&0.9477&1.0013\\
 182 04-artificial-high  &0.3649&0.2255&0.9562&1.0000\\
 183 05-translation      &0.7610&0.6662&0.8884&1.0008\\
 184 06-simulated-paraphrase&0.5972&0.4369&0.9433&1.0000\\
 185 \hline
 186 \end{tabular}
 187 \end{center}
 188 \caption{Performance on the training corpus}
 189 \label{table-final}
 190 \end{table}
 191
 192 \subsection{Other Approaches Explored}
 193
 194 There are several other approaches we have evaluated, but which were
 195 omitted from our final submission for various reasons. We think mentioning
 196 them here is worthwhile nevertheless.
 197
 198 \subsubsection{Intrinsic Plagiarism Detection}
 199
 200 We tested the approach based on character $n$-gram profiles of the interval of
 201 the fixed size (in terms of $n$-grams), and their differences to the
 202 profile of the whole document \cite{pan09stamatatos}. We have further
 203 enhanced the approach with using gaussian smoothing of the style-change
 204 function \cite{Kasprzak2010}. For PAN 2012, we made further improvements
 205 to the algorithm, resulting in more stable style change function in
 206 both short and long documents.
 207
 208 We tried to use the results of the intrinsic plagiarism detection
 209 as hint for the post-processing phase, allowing to merge larger
 210 intervals, if they both belong to the same passage, detected by
 211 the intrinsic detector. This approach did not provide improvement
 212 when compared to the static gap limits, as described in Section
 213 \ref{postprocessing}, so we have omitted it from our final submission.
 214
 215 %\subsubsection{Language Detection}
 216 %
 217 %For language detection, we used the $n$-gram based categorization \cite{ngram}.
 218 %We computed the language profiles from the source documents of the
 219 %training corpus (using the annotations from the corpus itself). The result
 220 %of this approach was better than using the stopwords-based detection we have
 221 %used in PAN 2010. However, there were still mis-detected documents,
 222 %mainly the long lists of surnames and other tabular data. We added
 223 %an ad-hoc fix, where for documents having their profile too distant from all of
 224 %English, German, and Spanish profiles, we declared them to be in English.
 225
 226 \subsubsection{Cross-lingual Plagiarism Detection}
 227
 228 For cross-lingual plagiarism detection, our aim was to use the public
 229 interface to Google translate if possible, and use the resulting document
 230 as the source for standard intra-lingual detector.
 231 Should the translation service not be available, we wanted
 232 to use the fall-back strategy of translating isolated words only,
 233 with the additional exact matching of longer words (we have used words with
 234 5 characters or longer).
 235 We have supposed that these longer words can be names or specialized terms,
 236 present in both languages.
 237
 238 We used dictionaries from several sources, for example
 239 {\it dicts.info}\footnote{\url{http://www.dicts.info/}},
 240 {\it omegawiki}\footnote{\url{http://www.omegawiki.org/}},
 241 and {\it wiktionary}\footnote{\url{http://en.wiktionary.org/}}.
 242
 243 In the final submission, we simply used the machine translated texts,
 244 which were provided to the running program from the surrounding environment.
 245
 246
 247 \subsection{Further discussion}
 248
 249 From our previous PAN submissions, we knew that the precision of our
 250 system was good, and because of the way how the final score is computed, we
 251 wanted to exchange a bit worse precision for better recall and granularity.
 252 So we pushed the parameters towards detecting more plagiarized passages,
 253 even when the number of common features was not especially high.
 254
 255 \subsubsection{Plagdet score}
 256
 257 Our results from tuning the parameters show that the plagdet score\cite{potthastfamework}
 258 is not a good measure for comparing the plagiarism detection systems:
 259 for example, the gap of 30,000 characters, described in Section \ref{postprocessing},
 260 can easily mean several pages of text. And still the system with this
 261 parameter set so high resulted in better plagdet score.
 262
 263 Another problem of plagdet can be
 264 seen in the 01-no-plagiarism part of the training corpus: the border
 265 between the perfect score 1 and the score 0 is a single false-positive
 266 detection. Plagdet does not distinguish between the system reporting this
 267 single false-positive, and the system reporting the whole data as plagiarized.
 268 Both get the score 0. However, our experience from real-world plagiarism detection systems show that
 269 the plagiarized documents are in a clear minority, so the performance of
 270 the detection system on non-plagiarized documents is very important.
 271
 272 \subsubsection{Performance Notes}
 273
 274 We consider comparing the CPU-time performance of PAN 2012 submissions almost
 275 meaningless, because any sane system would precompute features for all
 276 documents in a given set of suspicious and source documents, and use the
 277 results for pair-wise comparison, expecting that any document will take
 278 part in more than one pair.
 279
 280 Also, the pair-wise comparison without caching any intermediate results
 281 lead to worse overall performance: in our PAN 2010 submission, one of the
 282 post-processing steps was to remove all the overlapping detections
 283 from a given suspicious documents, when these detections were from different
 284 source doucments, and were short enough. This removed many false-positives
 285 and improved the precision of our system. This kind of heuristics was
 286 not possible in PAN 2012.
 287
 288 As for the performance of our system, we split the task into two parts:
 289 1. finding the common features, and 2. computing valid intervals and
 290 postprocessing. The first part is more CPU intensive, and the results
 291 can be cached. The second part is fast enough to allow us to evaluate
 292 many combinations of parameters.
 293
 294 We did our development on a machine with four six-core AMD 8139 CPUs
 295 (2800 MHz), and 128 GB RAM. The first phase took about 2500 seconds
 296 on this host, and the second phase took 14 seconds. Computing the
 297 plagdet score using the official script in Python took between 120 and
 298 180 seconds, as there is no parallelism in this script.
 299
 300 When we have tried to use intrinsic plagiarism detection and language
 301 detection, the first phase took about 12500 seconds. Thus omitting these
 302 featurs clearly provided huge performance improvement.
 303
 304 The code was written in Perl, and had about 669 lines of code,
 305 not counting comments and blank lines.
 306
 307 \endinput
 308
 309 - hranice mezi pasazema nekdy zahrnovala whitespace a nekdy ne.
 310
 311 Diskuse plagdet:
 312 - uzivatele chteji "aby odevzdej ukazovalo 0\% shody", nezajima je
 313         co to cislo znamena
 314 - nezalezi na hranicich detekovane pasaze
 315 - false-positives jsou daleko horsi
 316 - granularita je zlo
 317
 318 Finalni vysledky nad testovacim korpusem:
 319
 320 0.7288 0.5994 0.9306 1.0007   2012-06-16 02:23   plagdt recall precis granul
 321                             01-no-plagiarism     0.0000 0.0000 0.0000 1.0000
 322                             02-no-obfuscation    0.9476 0.9627 0.9330 1.0000
 323                             03-artificial-low    0.8726 0.8099 0.9477 1.0013
 324                             04-artificial-high   0.3649 0.2255 0.9562 1.0000
 325                             05-translation       0.7610 0.6662 0.8884 1.0008
 326                             06-simulated-paraphr 0.5972 0.4369 0.9433 1.0000
 327
 328 Vysledky nad souteznimi daty:
 329 plagdet         precision       recall          granularity
 330 0.6826726       0.8931670       0.5524708       1.0000000
 331
 332 Run-time:
 333 12500 sekund tokenizace vcetne sc a detekce jazyka
 334 2500 sekund bez sc a detekce jazyka
 335 14 sekund vyhodnoceni valid intervalu a postprocessing
 336
 337
 338 TODO:
 339 - hranici podle hustoty matchovani
 340 - xml tridit podle this_offset
 341
 342 Tady je obsah souboru JOURNAL - jak jsem meril nektera vylepseni:
 343 =================================================================
 344 baseline.py
 345 0.1250 0.1259 0.9783 2.4460   2012-05-03 06:02   plagdt recall precis granul
 346                             01_no_plagiarism     1.0000 1.0000 1.0000 1.0000
 347                             02_no_obfuscation    0.8608 0.8609 0.8618 1.0009
 348                             03_artificial_low    0.1006 0.1118 0.9979 2.9974
 349                             04_artificial_high   0.0054 0.0029 0.9991 1.0778
 350                             05_translation       0.0003 0.0002 1.0000 1.2143
 351                             06_simulated_paraphr 0.0565 0.0729 0.9983 4.3075
 352
 353 valid_intervals bez postprocessingu (takhle jsem to poprve odevzdal)
 354 0.3183 0.2034 0.9883 1.0850   2012-05-25 15:25   plagdt recall precis granul
 355                             01_no_plagiarism     1.0000 1.0000 1.0000 1.0000
 356                             02_no_obfuscation    0.9861 0.9973 0.9752 1.0000
 357                             03_artificial_low    0.4127 0.3006 0.9975 1.1724
 358                             04_artificial_high   0.0008 0.0004 1.0000 1.0000
 359                             05_translation       0.0001 0.0000 1.0000 1.0000
 360                             06_simulated_paraphr 0.3470 0.2248 0.9987 1.0812
 361
 362 postprocessed (slucovani blizkych intervalu)
 363 0.3350 0.2051 0.9863 1.0188   2012-05-25 15:27   plagdt recall precis granul
 364                             01_no_plagiarism     1.0000 1.0000 1.0000 1.0000
 365                             02_no_obfuscation    0.9863 0.9973 0.9755 1.0000
 366                             03_artificial_low    0.4541 0.3057 0.9942 1.0417
 367                             04_artificial_high   0.0008 0.0004 1.0000 1.0000
 368                             05_translation       0.0001 0.0000 1.0000 1.0000
 369                             06_simulated_paraphr 0.3702 0.2279 0.9986 1.0032
 370
 371 whitespace (uprava whitespaces)
 372 0.3353 0.2053 0.9858 1.0188   2012-05-31 17:57   plagdt recall precis granul
 373                             01_no_plagiarism     1.0000 1.0000 1.0000 1.0000
 374                             02_no_obfuscation    0.9865 0.9987 0.9745 1.0000
 375                             03_artificial_low    0.4546 0.3061 0.9940 1.0417
 376                             04_artificial_high   0.0008 0.0004 1.0000 1.0000
 377                             05_translation       0.0001 0.0000 1.0000 1.0000
 378                             06_simulated_paraphr 0.3705 0.2281 0.9985 1.0032
 379
 380 gap_100: whitespace, + ve valid intervalu dovolim mezeru 100 petic misto 50
 381 0.3696 0.2305 0.9838 1.0148   2012-05-31 18:07   plagdt recall precis granul
 382                             01_no_plagiarism     1.0000 1.0000 1.0000 1.0000
 383                             02_no_obfuscation    0.9850 0.9987 0.9717 1.0000
 384                             03_artificial_low    0.5423 0.3846 0.9922 1.0310
 385                             04_artificial_high   0.0058 0.0029 0.9151 1.0000
 386                             05_translation       0.0001 0.0000 1.0000 1.0000
 387                             06_simulated_paraphr 0.4207 0.2667 0.9959 1.0000
 388
 389 gap_200: whitespace, + ve valid intervalu dovolim mezeru 200 petic misto 50
 390 0.3906 0.2456 0.9769 1.0070   2012-05-31 18:09   plagdt recall precis granul
 391                             01_no_plagiarism     1.0000 1.0000 1.0000 1.0000
 392                             02_no_obfuscation    0.9820 0.9987 0.9659 1.0000
 393                             03_artificial_low    0.5976 0.4346 0.9875 1.0139
 394                             04_artificial_high   0.0087 0.0044 0.9374 1.0000
 395                             05_translation       0.0001 0.0001 1.0000 1.0000
 396                             06_simulated_paraphr 0.4360 0.2811 0.9708 1.0000
 397
 398 gap_200_int_10: gap_200, + valid int. ma min. 10 petic misto 20
 399 0.4436 0.2962 0.9660 1.0308   2012-05-31 18:11   plagdt recall precis granul
 400                             01_no_plagiarism     1.0000 1.0000 1.0000 1.0000
 401                             02_no_obfuscation    0.9612 0.9987 0.9264 1.0000
 402                             03_artificial_low    0.7048 0.5808 0.9873 1.0530
 403                             04_artificial_high   0.0457 0.0242 0.9762 1.0465
 404                             05_translation       0.0008 0.0004 1.0000 1.0000
 405                             06_simulated_paraphr 0.5123 0.3485 0.9662 1.0000
 406
 407 no_trans: gap_200_int_10, + nedetekovat preklady vubec, abych se vyhnul F-P
 408 0.4432 0.2959 0.9658 1.0310   2012-06-01 16:41   plagdt recall precis granul
 409                             01_no_plagiarism     1.0000 1.0000 1.0000 1.0000
 410                             02_no_obfuscation    0.9608 0.9980 0.9263 1.0000
 411                             03_artificial_low    0.7045 0.5806 0.9872 1.0530
 412                             04_artificial_high   0.0457 0.0242 0.9762 1.0465
 413                             05_translation       0.0000 0.0000 0.0000 1.0000
 414                             06_simulated_paraphr 0.5123 0.3485 0.9662 1.0000
 415
 416
 417 swng_unsorted se stejnym postprocessingem jako vyse "whitespace"
 418 0.2673 0.1584 0.9281 1.0174   2012-05-31 14:20   plagdt recall precis granul
 419                             01_no_plagiarism     0.0000 0.0000 0.0000 1.0000
 420                             02_no_obfuscation    0.9439 0.9059 0.9851 1.0000
 421                             03_artificial_low    0.3178 0.1952 0.9954 1.0377
 422                             04_artificial_high   0.0169 0.0095 0.9581 1.1707
 423                             05_translation       0.0042 0.0028 0.0080 1.0000
 424                             06_simulated_paraphr 0.1905 0.1060 0.9434 1.0000
 425
 426 swng_sorted
 427 0.2550 0.1906 0.4067 1.0253   2012-05-30 16:07   plagdt recall precis granul
 428                             01_no_plagiarism     0.0000 0.0000 0.0000 1.0000
 429                             02_no_obfuscation    0.6648 0.9146 0.5222 1.0000
 430                             03_artificial_low    0.4093 0.2867 0.8093 1.0483
 431                             04_artificial_high   0.0454 0.0253 0.4371 1.0755
 432                             05_translation       0.0030 0.0019 0.0064 1.0000
 433                             06_simulated_paraphr 0.1017 0.1382 0.0814 1.0106
 434
 435 sort_susp: gap_200_int_10 + postprocessing tridim intervaly podle offsetu v susp, nikoliv v src
 436 0.4437 0.2962 0.9676 1.0308   2012-06-01 18:06   plagdt recall precis granul
 437                             01_no_plagiarism     1.0000 1.0000 1.0000 1.0000
 438                             02_no_obfuscation    0.9641 0.9987 0.9317 1.0000
 439                             03_artificial_low    0.7048 0.5809 0.9871 1.0530
 440                             04_artificial_high   0.0457 0.0242 0.9762 1.0465
 441                             05_translation       0.0008 0.0004 1.0000 1.0000
 442                             06_simulated_paraphr 0.5123 0.3485 0.9662 1.0000
 443
 444 post_gap2_16000: sort_susp, + sloucit dva intervaly pokud je < 16000 znaku a mezera je jen polovina velikosti tech intervalu (bylo 4000)
 445 0.4539 0.2983 0.9642 1.0054   2012-06-01 18:09   plagdt recall precis granul
 446                             01_no_plagiarism     1.0000 1.0000 1.0000 1.0000
 447                             02_no_obfuscation    0.9631 0.9987 0.9300 1.0000
 448                             03_artificial_low    0.7307 0.5883 0.9814 1.0094
 449                             04_artificial_high   0.0480 0.0247 0.9816 1.0078
 450                             05_translation       0.0008 0.0004 1.0000 1.0000
 451                             06_simulated_paraphr 0.5133 0.3487 0.9721 1.0000
 452
 453 post_gap2_32000: sort_susp, + sloucit intervaly < 32000 znaku a mezera aspon polovina velikosti
 454 0.4543 0.2986 0.9638 1.0050   2012-06-01 18:12   plagdt recall precis granul
 455                             01_no_plagiarism     1.0000 1.0000 1.0000 1.0000
 456                             02_no_obfuscation    0.9628 0.9987 0.9294 1.0000
 457                             03_artificial_low    0.7315 0.5893 0.9798 1.0085
 458                             04_artificial_high   0.0480 0.0247 0.9816 1.0078
 459                             05_translation       0.0008 0.0004 1.0000 1.0000
 460                             06_simulated_paraphr 0.5138 0.3487 0.9763 1.0000
 461
 462 post_gap2_64000: sort_susp, + sloucit intervaly < 32000 znaku a mezera aspon pol
 463 ovina velikosti
 464 0.4543 0.2988 0.9616 1.0050   2012-06-01 18:21   plagdt recall precis granul
 465                             01_no_plagiarism     1.0000 1.0000 1.0000 1.0000
 466                             02_no_obfuscation    0.9603 0.9987 0.9248 1.0000
 467                             03_artificial_low    0.7316 0.5901 0.9782 1.0085
 468                             04_artificial_high   0.0480 0.0247 0.9816 1.0078
 469                             05_translation       0.0008 0.0004 1.0000 1.0000
 470                             06_simulated_paraphr 0.5138 0.3487 0.9763 1.0000
 471
 472 post_gap1_2000: post_gap2_32000, + spojit bez podminek veci co maji mezeru pod 2000 (bylo 600)
 473 0.4543 0.2986 0.9635 1.0050   2012-06-01 18:29   plagdt recall precis granul
 474                             01_no_plagiarism     1.0000 1.0000 1.0000 1.0000
 475                             02_no_obfuscation    0.9628 0.9987 0.9294 1.0000
 476                             03_artificial_low    0.7315 0.5895 0.9794 1.0085
 477                             04_artificial_high   0.0480 0.0247 0.9816 1.0078
 478                             05_translation       0.0008 0.0004 1.0000 1.0000
 479                             06_simulated_paraphr 0.5138 0.3487 0.9763 1.0000
 480
 481