yenya-detailed.tex

   1 \section{Detailed Document Comparison}~\label{yenya}
   2
   3 \label{detailed}
   4
   5 The detailed comparison task of PAN 2012 consisted in a comparison
   6 of given document pairs, with the expected output being the annotation of
   7 similarities found between these documents.
   8 The submitted program was running in a controlled environment
   9 separately for each document pair, without the possibility of keeping any
  10 cached data between runs.
  11
  12 %In this section, we describe our approach in the detailed comparison
  13 %task. The rest of this section is organized as follows: in the next
  14 %subsection, we summarise the differences from our previous approach.
  15 %In subsection \ref{sec-alg-overview}, we give an overview of our approach.
  16 %TODO napsat jak to nakonec bude.
  17
  18 \subsection{Differences Against PAN 2010}
  19
  20 Our approach in this task
  21 is loosely based on the approach we used in PAN 2010 \cite{Kasprzak2010}.
  22 The main difference is that instead of looking for similarities of
  23 one type (for PAN 2010, we have used word 5-grams),
  24 we developed a method of evaluating multiple types of similarities
  25 (we call them {\it common features}) of different properties, such as
  26 density and length.
  27
  28 As a proof of concept, we used two types of common features: word
  29 5-grams and stop word 8-grams, the later being based on the method described in
  30 \cite{stamatatos2011plagiarism}.
  31
  32 In addition to the above, we made several minor improvements to the
  33 algorithm such as parameter tuning and improving the detections
  34 merging in the post-processing stage.
  35
  36 \subsection{Algorithm Overview}
  37 \label{sec-alg-overview}
  38
  39 The algorithm evaluates the document pair in several stages:
  40
  41 \begin{itemize}
  42 \item tokenizing both the suspicious and source documents
  43 \item forming {\it features} from some tokens
  44 \item discovering {\it common features}
  45 \item making {\it valid intervals} from common features
  46 \item postprocessing
  47 \end{itemize}
  48
  49 \subsection{Tokenization}
  50
  51 We tokenize the document into words, where word is a sequence of one
  52 or more characters of the {\it Letter} Unicode class.
  53 With each word, two additional attributes needed for further processing,
  54 are associated: the offset where the word begins, and the word length.
  55
  56 The offset where the word begins is not necessarily the first letter character
  57 of the word itself. We discovered that in the training corpus
  58 some plagiarized passages were annotated including the preceding
  59 non-letter characters. We used the following heuristics to add
  60 parts of the inter-word gap to the previous or the next adjacent word:
  61
  62 \begin{itemize}
  63 \item When the inter-word gap contains interpunction (any of the dot,
  64 semicolon, colon, comma, exclamation mark, question mark, or quotes):
  65 \begin{itemize}
  66 \item add the characters up to and including the interpunction character
  67 to the previous word,
  68 \item ignore the space character(s) after the interpunction
  69 character,
  70 \item add the rest to the next word.
  71 \end{itemize}
  72 \item Otherwise, when the inter-word gap contains newline:
  73 \begin{itemize}
  74 \item  add the character before the first newline to the previous word,
  75 \item ignore the first newline character,
  76 \item add the rest to the next word.
  77 \end{itemize}
  78 \item Otherwise: ignore the inter-word gap characters altogether.
  79 \end{itemize}
  80
  81 When the detection program was given three different
  82 files instead of two (meaning the third one is machine-translated
  83 version of the second one), we tokenized the translated document instead
  84 of the source one. We used the line-by-line alignment of the
  85 source and machine-translated documents to transform the word offsets
  86 and lengths in the translated document to the terms of the source document.
  87
  88 \subsection{Features}
  89
  90 We have used features of two types:
  91
  92 \begin{itemize}
  93 \item Lexicographically sorted word 5-grams, formed of words at least
  94 three characters long.
  95 \item Unsorted stop word 8-grams, formed from 50 most frequent words in English,
  96 as described in \cite{stamatatos2011plagiarism}. We have further ignored
  97 the 8-grams, formed solely from the six most frequent English words
  98 ({\it the}, {\it of}, {\it and}, {\it a}, {\it in}, {\it to}), or the possessive {\it'{}s}.
  99 \end{itemize}
 100
 101 We represented each feature with the 32 highest-order bits of its
 102 MD5 digest. This is only a performance optimization targeted for
 103 larger systems. The number of features in a document pair is several orders
 104 of magnitude lower than $2^{32}$, thus the probability of hash function
 105 collision is low. For pair-wise comparison, it would be feasible to compare
 106 the features directly instead of their MD5 sums.
 107
 108 Each feature has also two attributes: offset and length.
 109 Offset is taken as the offset of the first word in a given feature,
 110 and length is the offset of the last character in a given feature
 111 minus the offset of the feature itself.
 112
 113 \subsection{Common Features}
 114
 115 For further processing, we took into account only the features
 116 present both in source and suspicious document. For each such
 117 {\it common feature}, we created the list of
 118 $(\makebox{offset}, \makebox{length})$ pairs for the source document,
 119 and a similar list for the suspicious document. Note that a given feature
 120 can occur multiple times both in source and suspicious document.
 121
 122 \subsection{Valid Intervals}
 123
 124 To detect a plagiarized passage, we need to find a set of common features,
 125 which map to a dense-enough interval both in the source and suspicious
 126 document. In our previous work, we described the algorithm
 127 for discovering these {\it valid intervals} \cite{Kasprzak2009a}.
 128 A similar approach is used also in \cite{stamatatos2011plagiarism}.
 129 Both of these algorithms use features of a single type, which
 130 allows to use the ordering of features as a measure of distance.
 131
 132 When we use features of different types, there is no natural ordering
 133 of them: for example a stop word 8-gram can span multiple sentences,
 134 which can contain several word 5-grams. The assumption of both of the
 135 above algorithms, that the last character of the previous feature
 136 is before the last character of the current feature, is broken.
 137
 138 We modified the algorithm for computing valid intervals with
 139 multi-feature detection to use character offsets
 140 only instead of feature order numbers. We used valid intervals
 141 consisting of at least 4 common features, with the maximum allowed gap
 142 inside the interval (characters not belonging to any common feature
 143 of a given valid interval) set to 4000 characters.
 144
 145 \subsection{Postprocessing}
 146 \label{postprocessing}
 147
 148 In the postprocessing phase we took the resulting valid intervals
 149 and made attempt to further improve the results. We firstly
 150 removed overlaps: if both overlapping intervals were
 151 shorter than 300 characters, we have removed both of them. Otherwise, we
 152 kept the longer detection (longer in terms of length in the suspicious document).
 153
 154 We then joined the adjacent valid intervals into one detection,
 155 if at least one of the following criteria were met:
 156 \begin{itemize}
 157 \item the gap between the intervals contained at least 4 common features,
 158 and it contained at least one feature per 10,000
 159 characters\footnote{we have computed the length of the gap as the number
 160 of characters between the detections in the source document, plus the
 161 number of charaters between the detections in the suspicious document.}
 162 \item the gap was smaller than 30,000 characters and the size of the adjacent
 163 valid intervals was at least twice as big as the gap between them
 164 \item the gap was smaller than 30,000 characters and the number of common
 165 features per character in the adjacent interval was not more than three times
 166 bigger than number of features per character in the possible joined interval.
 167 \end{itemize}
 168
 169 \subsection{Results}
 170
 171 These parameters were fine-tuned to achieve the best results on the training
 172 corpus. With these parameters, our algorithm got the total plagdet score
 173 of 0.7288 on the training corpus. The details of the performance of
 174 our algorithm are presented in Table \ref{table-final}.
 175 In the PAN 2012 competition, we have acchieved the plagdet score
 176 of 0.6827, precision 0.8932, recall 0.5525, and granularity 1.0000.
 177
 178 \begin{table}
 179 \begin{center}
 180 \begin{tabular}{|l|r|r|r|r|}
 181 \hline
 182 &plagdet&recall&precision&granularity\\
 183 \hline
 184 whole corpus&0.7288&0.5994&0.9306&1.0007\\
 185 \hline
 186 01-no-plagiarism    &0.0000&0.0000&0.0000&1.0000\\
 187 02-no-obfuscation   &0.9476&0.9627&0.9330&1.0000\\
 188 03-artificial-low   &0.8726&0.8099&0.9477&1.0013\\
 189 04-artificial-high  &0.3649&0.2255&0.9562&1.0000\\
 190 05-translation      &0.7610&0.6662&0.8884&1.0008\\
 191 06-simulated-paraphrase&0.5972&0.4369&0.9433&1.0000\\
 192 \hline
 193 \end{tabular}
 194 \end{center}
 195 \caption{Performance on the training corpus}
 196 \label{table-final}
 197 \end{table}
 198
 199 \subsection{Other Approaches Explored}
 200
 201 There are several other approaches we evaluated, but which were
 202 omitted from our final submission for various reasons. We think mentioning
 203 them here is worthwhile nevertheless:
 204
 205 \subsubsection{Intrinsic Plagiarism Detection}
 206
 207 We tested the approach based on character $n$-gram profiles of the interval of
 208 the fixed size (in terms of $n$-grams), and their differences to the
 209 profile of the whole document \cite{pan09stamatatos}. We have further
 210 enhanced the approach with using gaussian smoothing of the style-change
 211 function \cite{Kasprzak2010}. For PAN 2012, we made further improvements
 212 to the algorithm, resulting in more stable style change function in
 213 both short and long documents.
 214
 215 We tried to use the results of the intrinsic plagiarism detection
 216 as hint for the post-processing phase, allowing to merge larger
 217 intervals, if they both belong to the same passage, detected by
 218 the intrinsic detector. This approach did not provide improvement
 219 when compared to the static gap limits, as described in Section
 220 \ref{postprocessing}, therefore we have omitted it from our final submission.
 221
 222 %\subsubsection{Language Detection}
 223 %
 224 %For language detection, we used the $n$-gram based categorization \cite{ngram}.
 225 %We computed the language profiles from the source documents of the
 226 %training corpus (using the annotations from the corpus itself). The result
 227 %of this approach was better than using the stopwords-based detection we have
 228 %used in PAN 2010. However, there were still mis-detected documents,
 229 %mainly the long lists of surnames and other tabular data. We added
 230 %an ad-hoc fix, where for documents having their profile too distant from all of
 231 %English, German, and Spanish profiles, we declared them to be in English.
 232
 233 \subsubsection{Cross-lingual Plagiarism Detection}
 234
 235 For cross-lingual plagiarism detection, our aim was to use the public
 236 interface to Google Translate\footnote{\url{http://translate.google.com/}} if possible, and use the resulting document
 237 as the source for standard intra-lingual detector.
 238 Should the translation service not be available, we wanted
 239 to use the fall-back strategy of translating isolated words only,
 240 with the additional exact matching of longer words (we have used words with
 241 5 characters or longer).
 242 We have supposed that these longer words can be names or specialized terms,
 243 present in both languages.
 244
 245 We used dictionaries from several sources, for example
 246 {\it dicts.info}\footnote{\url{http://www.dicts.info/}},
 247 {\it omegawiki}\footnote{\url{http://www.omegawiki.org/}},
 248 and {\it wiktionary}\footnote{\url{http://en.wiktionary.org/}}.
 249
 250 In the final submission, we simply used the machine translated texts,
 251 which were provided to the running program from the surrounding environment.
 252
 253
 254 \subsection{Further discussion}
 255
 256 From our previous PAN submissions, we knew that the precision of our
 257 system was good, and because of the way how the final score is computed, we
 258 wanted to exchange a bit worse precision for better recall and granularity.
 259 So we pushed the parameters towards detecting more plagiarized passages,
 260 even when the number of common features was not especially high.
 261
 262 \subsubsection{Plagdet score}
 263
 264 Our results from tuning the parameters show that the plagdet score\cite{potthastframework}
 265 is not a good measure for comparing the plagiarism detection systems:
 266 for example, the gap of 30,000 characters, described in Section \ref{postprocessing},
 267 can easily mean several pages of text. And still the system with this
 268 parameter set so high resulted in better plagdet score.
 269
 270 Another problem of plagdet can be
 271 seen in the 01-no-plagiarism part of the training corpus: the border
 272 between the perfect score 1 and the score 0 is a single false-positive
 273 detection. Plagdet does not distinguish between the system reporting this
 274 single false-positive, and the system reporting the whole data as plagiarized.
 275 Both get the score 0. However, our experience from real-world plagiarism detection systems show that
 276 the plagiarized documents are in a clear minority, so the performance of
 277 the detection system on non-plagiarized documents is very important.
 278
 279 \subsubsection{Performance Notes}
 280
 281 We consider comparing the CPU-time performance of PAN 2012 submissions almost
 282 meaningless, because any sane system would precompute features for all
 283 documents in a given set of suspicious and source documents, and use the
 284 results for pair-wise comparison, expecting that any document will take
 285 part in more than one pair.
 286
 287 Also, the pair-wise comparison without caching any intermediate results
 288 lead to worse overall performance: in our PAN 2010 submission, one of the
 289 post-processing steps was to remove all the overlapping detections
 290 from a given suspicious documents, when these detections were from different
 291 source doucments, and were short enough. This removed many false-positives
 292 and improved the precision of our system. This kind of heuristics was
 293 not possible in PAN 2012.
 294
 295 As for the performance of our system, we split the task into two parts:
 296 1. finding the common features, and 2. computing valid intervals and
 297 postprocessing. The first part is more CPU intensive, and the results
 298 can be cached. The second part is fast enough to allow us to evaluate
 299 many combinations of parameters.
 300
 301 We did our development on a machine with four six-core AMD 8139 CPUs
 302 (2800 MHz), and 128 GB RAM. The first phase took about 2500 seconds
 303 on this host, and the second phase took 14 seconds. Computing the
 304 plagdet score using the official script in Python took between 120 and
 305 180 seconds, as there is no parallelism in this script.
 306
 307 When we tried to use intrinsic plagiarism detection and language
 308 detection, the first phase took about 12500 seconds. Thus omitting these
 309 featurs clearly provided huge performance improvement.
 310
 311 The code was written in Perl, and had about 669 lines of code,
 312 not counting comments and blank lines.
 313
 314 \endinput
 315
 316 - hranice mezi pasazema nekdy zahrnovala whitespace a nekdy ne.
 317
 318 Diskuse plagdet:
 319 - uzivatele chteji "aby odevzdej ukazovalo 0\% shody", nezajima je
 320         co to cislo znamena
 321 - nezalezi na hranicich detekovane pasaze
 322 - false-positives jsou daleko horsi
 323 - granularita je zlo
 324
 325 Finalni vysledky nad testovacim korpusem:
 326
 327 0.7288 0.5994 0.9306 1.0007   2012-06-16 02:23   plagdt recall precis granul
 328                             01-no-plagiarism     0.0000 0.0000 0.0000 1.0000
 329                             02-no-obfuscation    0.9476 0.9627 0.9330 1.0000
 330                             03-artificial-low    0.8726 0.8099 0.9477 1.0013
 331                             04-artificial-high   0.3649 0.2255 0.9562 1.0000
 332                             05-translation       0.7610 0.6662 0.8884 1.0008
 333                             06-simulated-paraphr 0.5972 0.4369 0.9433 1.0000
 334
 335 Vysledky nad souteznimi daty:
 336 plagdet         precision       recall          granularity
 337 0.6826726       0.8931670       0.5524708       1.0000000
 338
 339 Run-time:
 340 12500 sekund tokenizace vcetne sc a detekce jazyka
 341 2500 sekund bez sc a detekce jazyka
 342 14 sekund vyhodnoceni valid intervalu a postprocessing
 343
 344
 345 TODO:
 346 - hranici podle hustoty matchovani
 347 - xml tridit podle this_offset
 348
 349 Tady je obsah souboru JOURNAL - jak jsem meril nektera vylepseni:
 350 =================================================================
 351 baseline.py
 352 0.1250 0.1259 0.9783 2.4460   2012-05-03 06:02   plagdt recall precis granul
 353                             01_no_plagiarism     1.0000 1.0000 1.0000 1.0000
 354                             02_no_obfuscation    0.8608 0.8609 0.8618 1.0009
 355                             03_artificial_low    0.1006 0.1118 0.9979 2.9974
 356                             04_artificial_high   0.0054 0.0029 0.9991 1.0778
 357                             05_translation       0.0003 0.0002 1.0000 1.2143
 358                             06_simulated_paraphr 0.0565 0.0729 0.9983 4.3075
 359
 360 valid_intervals bez postprocessingu (takhle jsem to poprve odevzdal)
 361 0.3183 0.2034 0.9883 1.0850   2012-05-25 15:25   plagdt recall precis granul
 362                             01_no_plagiarism     1.0000 1.0000 1.0000 1.0000
 363                             02_no_obfuscation    0.9861 0.9973 0.9752 1.0000
 364                             03_artificial_low    0.4127 0.3006 0.9975 1.1724
 365                             04_artificial_high   0.0008 0.0004 1.0000 1.0000
 366                             05_translation       0.0001 0.0000 1.0000 1.0000
 367                             06_simulated_paraphr 0.3470 0.2248 0.9987 1.0812
 368
 369 postprocessed (slucovani blizkych intervalu)
 370 0.3350 0.2051 0.9863 1.0188   2012-05-25 15:27   plagdt recall precis granul
 371                             01_no_plagiarism     1.0000 1.0000 1.0000 1.0000
 372                             02_no_obfuscation    0.9863 0.9973 0.9755 1.0000
 373                             03_artificial_low    0.4541 0.3057 0.9942 1.0417
 374                             04_artificial_high   0.0008 0.0004 1.0000 1.0000
 375                             05_translation       0.0001 0.0000 1.0000 1.0000
 376                             06_simulated_paraphr 0.3702 0.2279 0.9986 1.0032
 377
 378 whitespace (uprava whitespaces)
 379 0.3353 0.2053 0.9858 1.0188   2012-05-31 17:57   plagdt recall precis granul
 380                             01_no_plagiarism     1.0000 1.0000 1.0000 1.0000
 381                             02_no_obfuscation    0.9865 0.9987 0.9745 1.0000
 382                             03_artificial_low    0.4546 0.3061 0.9940 1.0417
 383                             04_artificial_high   0.0008 0.0004 1.0000 1.0000
 384                             05_translation       0.0001 0.0000 1.0000 1.0000
 385                             06_simulated_paraphr 0.3705 0.2281 0.9985 1.0032
 386
 387 gap_100: whitespace, + ve valid intervalu dovolim mezeru 100 petic misto 50
 388 0.3696 0.2305 0.9838 1.0148   2012-05-31 18:07   plagdt recall precis granul
 389                             01_no_plagiarism     1.0000 1.0000 1.0000 1.0000
 390                             02_no_obfuscation    0.9850 0.9987 0.9717 1.0000
 391                             03_artificial_low    0.5423 0.3846 0.9922 1.0310
 392                             04_artificial_high   0.0058 0.0029 0.9151 1.0000
 393                             05_translation       0.0001 0.0000 1.0000 1.0000
 394                             06_simulated_paraphr 0.4207 0.2667 0.9959 1.0000
 395
 396 gap_200: whitespace, + ve valid intervalu dovolim mezeru 200 petic misto 50
 397 0.3906 0.2456 0.9769 1.0070   2012-05-31 18:09   plagdt recall precis granul
 398                             01_no_plagiarism     1.0000 1.0000 1.0000 1.0000
 399                             02_no_obfuscation    0.9820 0.9987 0.9659 1.0000
 400                             03_artificial_low    0.5976 0.4346 0.9875 1.0139
 401                             04_artificial_high   0.0087 0.0044 0.9374 1.0000
 402                             05_translation       0.0001 0.0001 1.0000 1.0000
 403                             06_simulated_paraphr 0.4360 0.2811 0.9708 1.0000
 404
 405 gap_200_int_10: gap_200, + valid int. ma min. 10 petic misto 20
 406 0.4436 0.2962 0.9660 1.0308   2012-05-31 18:11   plagdt recall precis granul
 407                             01_no_plagiarism     1.0000 1.0000 1.0000 1.0000
 408                             02_no_obfuscation    0.9612 0.9987 0.9264 1.0000
 409                             03_artificial_low    0.7048 0.5808 0.9873 1.0530
 410                             04_artificial_high   0.0457 0.0242 0.9762 1.0465
 411                             05_translation       0.0008 0.0004 1.0000 1.0000
 412                             06_simulated_paraphr 0.5123 0.3485 0.9662 1.0000
 413
 414 no_trans: gap_200_int_10, + nedetekovat preklady vubec, abych se vyhnul F-P
 415 0.4432 0.2959 0.9658 1.0310   2012-06-01 16:41   plagdt recall precis granul
 416                             01_no_plagiarism     1.0000 1.0000 1.0000 1.0000
 417                             02_no_obfuscation    0.9608 0.9980 0.9263 1.0000
 418                             03_artificial_low    0.7045 0.5806 0.9872 1.0530
 419                             04_artificial_high   0.0457 0.0242 0.9762 1.0465
 420                             05_translation       0.0000 0.0000 0.0000 1.0000
 421                             06_simulated_paraphr 0.5123 0.3485 0.9662 1.0000
 422
 423
 424 swng_unsorted se stejnym postprocessingem jako vyse "whitespace"
 425 0.2673 0.1584 0.9281 1.0174   2012-05-31 14:20   plagdt recall precis granul
 426                             01_no_plagiarism     0.0000 0.0000 0.0000 1.0000
 427                             02_no_obfuscation    0.9439 0.9059 0.9851 1.0000
 428                             03_artificial_low    0.3178 0.1952 0.9954 1.0377
 429                             04_artificial_high   0.0169 0.0095 0.9581 1.1707
 430                             05_translation       0.0042 0.0028 0.0080 1.0000
 431                             06_simulated_paraphr 0.1905 0.1060 0.9434 1.0000
 432
 433 swng_sorted
 434 0.2550 0.1906 0.4067 1.0253   2012-05-30 16:07   plagdt recall precis granul
 435                             01_no_plagiarism     0.0000 0.0000 0.0000 1.0000
 436                             02_no_obfuscation    0.6648 0.9146 0.5222 1.0000
 437                             03_artificial_low    0.4093 0.2867 0.8093 1.0483
 438                             04_artificial_high   0.0454 0.0253 0.4371 1.0755
 439                             05_translation       0.0030 0.0019 0.0064 1.0000
 440                             06_simulated_paraphr 0.1017 0.1382 0.0814 1.0106
 441
 442 sort_susp: gap_200_int_10 + postprocessing tridim intervaly podle offsetu v susp, nikoliv v src
 443 0.4437 0.2962 0.9676 1.0308   2012-06-01 18:06   plagdt recall precis granul
 444                             01_no_plagiarism     1.0000 1.0000 1.0000 1.0000
 445                             02_no_obfuscation    0.9641 0.9987 0.9317 1.0000
 446                             03_artificial_low    0.7048 0.5809 0.9871 1.0530
 447                             04_artificial_high   0.0457 0.0242 0.9762 1.0465
 448                             05_translation       0.0008 0.0004 1.0000 1.0000
 449                             06_simulated_paraphr 0.5123 0.3485 0.9662 1.0000
 450
 451 post_gap2_16000: sort_susp, + sloucit dva intervaly pokud je < 16000 znaku a mezera je jen polovina velikosti tech intervalu (bylo 4000)
 452 0.4539 0.2983 0.9642 1.0054   2012-06-01 18:09   plagdt recall precis granul
 453                             01_no_plagiarism     1.0000 1.0000 1.0000 1.0000
 454                             02_no_obfuscation    0.9631 0.9987 0.9300 1.0000
 455                             03_artificial_low    0.7307 0.5883 0.9814 1.0094
 456                             04_artificial_high   0.0480 0.0247 0.9816 1.0078
 457                             05_translation       0.0008 0.0004 1.0000 1.0000
 458                             06_simulated_paraphr 0.5133 0.3487 0.9721 1.0000
 459
 460 post_gap2_32000: sort_susp, + sloucit intervaly < 32000 znaku a mezera aspon polovina velikosti
 461 0.4543 0.2986 0.9638 1.0050   2012-06-01 18:12   plagdt recall precis granul
 462                             01_no_plagiarism     1.0000 1.0000 1.0000 1.0000
 463                             02_no_obfuscation    0.9628 0.9987 0.9294 1.0000
 464                             03_artificial_low    0.7315 0.5893 0.9798 1.0085
 465                             04_artificial_high   0.0480 0.0247 0.9816 1.0078
 466                             05_translation       0.0008 0.0004 1.0000 1.0000
 467                             06_simulated_paraphr 0.5138 0.3487 0.9763 1.0000
 468
 469 post_gap2_64000: sort_susp, + sloucit intervaly < 32000 znaku a mezera aspon pol
 470 ovina velikosti
 471 0.4543 0.2988 0.9616 1.0050   2012-06-01 18:21   plagdt recall precis granul
 472                             01_no_plagiarism     1.0000 1.0000 1.0000 1.0000
 473                             02_no_obfuscation    0.9603 0.9987 0.9248 1.0000
 474                             03_artificial_low    0.7316 0.5901 0.9782 1.0085
 475                             04_artificial_high   0.0480 0.0247 0.9816 1.0078
 476                             05_translation       0.0008 0.0004 1.0000 1.0000
 477                             06_simulated_paraphr 0.5138 0.3487 0.9763 1.0000
 478
 479 post_gap1_2000: post_gap2_32000, + spojit bez podminek veci co maji mezeru pod 2000 (bylo 600)
 480 0.4543 0.2986 0.9635 1.0050   2012-06-01 18:29   plagdt recall precis granul
 481                             01_no_plagiarism     1.0000 1.0000 1.0000 1.0000
 482                             02_no_obfuscation    0.9628 0.9987 0.9294 1.0000
 483                             03_artificial_low    0.7315 0.5895 0.9794 1.0085
 484                             04_artificial_high   0.0480 0.0247 0.9816 1.0078
 485                             05_translation       0.0008 0.0004 1.0000 1.0000
 486                             06_simulated_paraphr 0.5138 0.3487 0.9763 1.0000
 487
 488