X-Git-Url: https://www.fi.muni.cz/~kas/git//home/kas/public_html/git/?a=blobdiff_plain;f=yenya-detailed.tex;h=9493f7518d1764b22344f0985aa01a529872877e;hb=HEAD;hp=3615ab9b45e364f79d2562566d9488569701ab40;hpb=fb012b2add74c5aeee11754a3ae6df1394bfee25;p=pan12-paper.git

diff --git a/yenya-detailed.tex b/yenya-detailed.tex
old mode 100644
new mode 100755
index 3615ab9..9493f75
--- a/yenya-detailed.tex
+++ b/yenya-detailed.tex
@@ -1,150 +1,488 @@
-\section{Detailed Document Comparison}
+\section{Detailed Document Comparison}~\label{yenya}
 
-\subsection{General Approach}
+\label{detailed}
 
-The approach Masaryk University team has used in PAN 2012 Plagiarism
-detection---detailed comparison sub-task is based on the same approach
-that we have used in PAN 2010 \cite{Kasprzak2010}.  This time, we have
-used a similar approach, enhanced by several means
+The detailed comparison task of PAN 2012 consisted in a comparison
+of given document pairs, with the expected output being the annotation of
+similarities found between these documents.
+The submitted program was running in a controlled environment
+separately for each document pair, without the possibility of keeping any
+cached data between runs.
+
+%In this section, we describe our approach in the detailed comparison
+%task. The rest of this section is organized as follows: in the next
+%subsection, we summarise the differences from our previous approach.
+%In subsection \ref{sec-alg-overview}, we give an overview of our approach.
+%TODO napsat jak to nakonec bude.
+
+\subsection{Differences Against PAN 2010}
+
+Our approach in this task
+is loosely based on the approach we used in PAN 2010 \cite{Kasprzak2010}.
+The main difference is that instead of looking for similarities of
+one type (for PAN 2010, we have used word 5-grams),
+we developed a method of evaluating multiple types of similarities
+(we call them {\it common features}) of different properties, such as
+density and length.
+
+As a proof of concept, we used two types of common features: word
+5-grams and stop word 8-grams, the later being based on the method described in
+\cite{stamatatos2011plagiarism}.
+
+In addition to the above, we made several minor improvements to the
+algorithm such as parameter tuning and improving the detections
+merging in the post-processing stage.
+
+\subsection{Algorithm Overview}
+\label{sec-alg-overview}
 
 The algorithm evaluates the document pair in several stages:
 
 \begin{itemize}
-\item intrinsic plagiarism detection
-\item language detection of the source document
+\item tokenizing both the suspicious and source documents
+\item forming {\it features} from some tokens
+\item discovering {\it common features}
+\item making {\it valid intervals} from common features
+\item postprocessing
+\end{itemize}
+
+\subsection{Tokenization}
+
+We tokenize the document into words, where word is a sequence of one
+or more characters of the {\it Letter} Unicode class.
+With each word, two additional attributes needed for further processing,
+are associated: the offset where the word begins, and the word length.
+
+The offset where the word begins is not necessarily the first letter character
+of the word itself. We discovered that in the training corpus
+some plagiarized passages were annotated including the preceding
+non-letter characters. We used the following heuristics to add 
+parts of the inter-word gap to the previous or the next adjacent word:
+
+\begin{itemize}
+\item When the inter-word gap contains interpunction (any of the dot,
+semicolon, colon, comma, exclamation mark, question mark, or quotes):
+\begin{itemize}
+\item add the characters up to and including the interpunction character
+to the previous word,
+\item ignore the space character(s) after the interpunction
+character,
+\item add the rest to the next word.
+\end{itemize}
+\item Otherwise, when the inter-word gap contains newline:
+\begin{itemize}
+\item  add the character before the first newline to the previous word,
+\item ignore the first newline character,
+\item add the rest to the next word.
+\end{itemize}
+\item Otherwise: ignore the inter-word gap characters altogether.
+\end{itemize}
+
+When the detection program was given three different
+files instead of two (meaning the third one is machine-translated
+version of the second one), we tokenized the translated document instead
+of the source one. We used the line-by-line alignment of the
+source and machine-translated documents to transform the word offsets
+and lengths in the translated document to the terms of the source document.
+
+\subsection{Features}
+
+We have used features of two types:
+
 \begin{itemize}
-\item cross-lingual plagiarism detection, if the source document is not in English
+\item Lexicographically sorted word 5-grams, formed of words at least
+three characters long.
+\item Unsorted stop word 8-grams, formed from 50 most frequent words in English,
+as described in \cite{stamatatos2011plagiarism}. We have further ignored
+the 8-grams, formed solely from the six most frequent English words
+({\it the}, {\it of}, {\it and}, {\it a}, {\it in}, {\it to}), or the possessive {\it'{}s}.
 \end{itemize}
-\item detecting intervals with common features
-\item post-processing phase, mainly serves for merging the nearby common intervals
+
+We represented each feature with the 32 highest-order bits of its
+MD5 digest. This is only a performance optimization targeted for
+larger systems. The number of features in a document pair is several orders
+of magnitude lower than $2^{32}$, thus the probability of hash function
+collision is low. For pair-wise comparison, it would be feasible to compare
+the features directly instead of their MD5 sums.
+
+Each feature has also two attributes: offset and length.
+Offset is taken as the offset of the first word in a given feature,
+and length is the offset of the last character in a given feature
+minus the offset of the feature itself.
+
+\subsection{Common Features}
+
+For further processing, we took into account only the features
+present both in source and suspicious document. For each such
+{\it common feature}, we created the list of
+$(\makebox{offset}, \makebox{length})$ pairs for the source document,
+and a similar list for the suspicious document. Note that a given feature
+can occur multiple times both in source and suspicious document.
+
+\subsection{Valid Intervals}
+
+To detect a plagiarized passage, we need to find a set of common features,
+which map to a dense-enough interval both in the source and suspicious
+document. In our previous work, we described the algorithm
+for discovering these {\it valid intervals} \cite{Kasprzak2009a}.
+A similar approach is used also in \cite{stamatatos2011plagiarism}.
+Both of these algorithms use features of a single type, which 
+allows to use the ordering of features as a measure of distance.
+
+When we use features of different types, there is no natural ordering
+of them: for example a stop word 8-gram can span multiple sentences,
+which can contain several word 5-grams. The assumption of both of the
+above algorithms, that the last character of the previous feature
+is before the last character of the current feature, is broken.
+
+We modified the algorithm for computing valid intervals with
+multi-feature detection to use character offsets
+only instead of feature order numbers. We used valid intervals
+consisting of at least 4 common features, with the maximum allowed gap
+inside the interval (characters not belonging to any common feature
+of a given valid interval) set to 4000 characters.
+
+\subsection{Postprocessing}
+\label{postprocessing}
+
+In the postprocessing phase we took the resulting valid intervals
+and made attempt to further improve the results. We firstly
+removed overlaps: if both overlapping intervals were
+shorter than 300 characters, we have removed both of them. Otherwise, we
+kept the longer detection (longer in terms of length in the suspicious document).
+
+We then joined the adjacent valid intervals into one detection,
+if at least one of the following criteria were met:
+\begin{itemize}
+\item the gap between the intervals contained at least 4 common features,
+and it contained at least one feature per 10,000
+characters\footnote{we have computed the length of the gap as the number
+of characters between the detections in the source document, plus the
+number of charaters between the detections in the suspicious document.}
+\item the gap was smaller than 30,000 characters and the size of the adjacent
+valid intervals was at least twice as big as the gap between them
+\item the gap was smaller than 30,000 characters and the number of common
+features per character in the adjacent interval was not more than three times
+bigger than number of features per character in the possible joined interval.
 \end{itemize}
 
-\subsection{Intrinsic plagiarism detection}
+\subsection{Results}
+
+These parameters were fine-tuned to achieve the best results on the training
+corpus. With these parameters, our algorithm got the total plagdet score
+of 0.7288 on the training corpus. The details of the performance of
+our algorithm are presented in Table \ref{table-final}.
+In the PAN 2012 competition, we have acchieved the plagdet score
+of 0.6827, precision 0.8932, recall 0.5525, and granularity 1.0000.
 
-Our approach is based on character $n$-gram profiles of the interval of
+\begin{table}
+\begin{center}
+\begin{tabular}{|l|r|r|r|r|}
+\hline
+&plagdet&recall&precision&granularity\\
+\hline
+whole corpus&0.7288&0.5994&0.9306&1.0007\\
+\hline
+01-no-plagiarism    &0.0000&0.0000&0.0000&1.0000\\
+02-no-obfuscation   &0.9476&0.9627&0.9330&1.0000\\
+03-artificial-low   &0.8726&0.8099&0.9477&1.0013\\
+04-artificial-high  &0.3649&0.2255&0.9562&1.0000\\
+05-translation      &0.7610&0.6662&0.8884&1.0008\\
+06-simulated-paraphrase&0.5972&0.4369&0.9433&1.0000\\
+\hline
+\end{tabular}
+\end{center}
+\caption{Performance on the training corpus}
+\label{table-final}
+\end{table}
+
+\subsection{Other Approaches Explored}
+
+There are several other approaches we evaluated, but which were
+omitted from our final submission for various reasons. We think mentioning
+them here is worthwhile nevertheless:
+
+\subsubsection{Intrinsic Plagiarism Detection}
+
+We tested the approach based on character $n$-gram profiles of the interval of
 the fixed size (in terms of $n$-grams), and their differences to the
 profile of the whole document \cite{pan09stamatatos}. We have further
 enhanced the approach with using gaussian smoothing of the style-change
-function \cite{Kasprzak2010}.
-
-For PAN 2012, we have experimented with using 1-, 2-, and 3-grams instead
-of only 3-grams, and using the different measure of the difference between
-the n-gram profiles. We have used an approach similar to \cite{ngram},
-where we have compute the profile as an ordered set of 400 most-frequent
-$n$-grams in a given text (the whole document or a partial window). Apart
-from ordering the set we have ignored the actual number of occurrences
-of a given $n$-gram altogether, and used the value inveresly
-proportional to the $n$-gram order in the profile, in accordance with
-the Zipf's law \cite{zipf1935psycho}.
-
-This approach has provided more stable style-change function than
-than the one proposed in \cite{pan09stamatatos}. Because of pair-wise
-nature of the detailed comparison sub-task, we couldn't use the results
-of the intrinsic detection immediately, so we wanted to use them
-as hints to the external detection.
-
-\subsection{Cross-lingual detection}
+function \cite{Kasprzak2010}. For PAN 2012, we made further improvements
+to the algorithm, resulting in more stable style change function in
+both short and long documents.
+
+We tried to use the results of the intrinsic plagiarism detection
+as hint for the post-processing phase, allowing to merge larger
+intervals, if they both belong to the same passage, detected by
+the intrinsic detector. This approach did not provide improvement
+when compared to the static gap limits, as described in Section
+\ref{postprocessing}, therefore we have omitted it from our final submission.
 
+%\subsubsection{Language Detection}
+%
 %For language detection, we used the $n$-gram based categorization \cite{ngram}.
-%We have computed the language profiles from the source documents of the
+%We computed the language profiles from the source documents of the
 %training corpus (using the annotations from the corpus itself). The result
 %of this approach was better than using the stopwords-based detection we have
 %used in PAN 2010. However, there were still mis-detected documents,
-%mainly the long lists of surnames and other tabular data. We have added
+%mainly the long lists of surnames and other tabular data. We added
 %an ad-hoc fix, where for documents having their profile too distant from all of
-%English, German, and Spanish profiles, we have declared them to be in English.
+%English, German, and Spanish profiles, we declared them to be in English.
+
+\subsubsection{Cross-lingual Plagiarism Detection}
 
 For cross-lingual plagiarism detection, our aim was to use the public
-interface to Google translate if possible, and use the resulting document
+interface to Google Translate\footnote{\url{http://translate.google.com/}} if possible, and use the resulting document
 as the source for standard intra-lingual detector.
 Should the translation service not be available, we wanted
 to use the fall-back strategy of translating isolated words only,
 with the additional exact matching of longer words (we have used words with
 5 characters or longer).
-We have supposed these longer words can be names or specialized terms,
+We have supposed that these longer words can be names or specialized terms,
 present in both languages.
 
-We have used dictionaries from several sources, like
-{\tt dicts.info\footnote{\url{http://www.dicts.info/}}},
-{\tt omegawiki\footnote{\url{http://www.omegawiki.org/}}},
-and {\tt wiktionary\footnote{\url{http://en.wiktionary.org/}}}. The source
-and translated document were aligned on a line-by-line basis.
-
-In the final form of the detailed comparison sub-task, the results of machine
-translation of the source documents were provided to the detector programs
-by the surrounding environment, so we have discarded the language detection
-and machine translation from our submission altogether, and used only
-line-by-line alignment of the source and translated document for calculating
-the offsets of text features in the source document.
-
-\subsection{Multi-feature Plagiarism Detection}
-
-Our pair-wise plagiarism detection is based on finding common passages
-of text, present both in the source and suspicious document. We call them
-{\it features}. In PAN 2010, we have used sorted word 5-grams, formed from
-words of three or more characters, as features to compare.
-Recently, other means of plagiarism detection have been explored:
-Stop-word $n$-gram detection is one of them
-\cite{stamatatos2011plagiarism}.
+We used dictionaries from several sources, for example
+{\it dicts.info}\footnote{\url{http://www.dicts.info/}},
+{\it omegawiki}\footnote{\url{http://www.omegawiki.org/}},
+and {\it wiktionary}\footnote{\url{http://en.wiktionary.org/}}.
 
-We propose the plagiarism detection system based on detecting common
-features of various type, like word $n$-grams, stopword $n$-grams,
-translated words or word bigrams, exact common longer words from document
-pairs having each document in a different language, etc. The system
-has to be to the great extent independent of the specialities of various
-feature types. It cannot, for example, use the order of given features
-as a measure of distance between the features, as for example, several
-word 5-grams can be fully contained inside one stopword 8-gram.
-
-We thus define {\it common feature} of two documents (susp and src)
-as the following tuple:
-$$\langle
-\hbox{offset}_{\hbox{susp}},
-\hbox{length}_{\hbox{susp}},
-\hbox{offset}_{\hbox{src}},
-\hbox{length}_{\hbox{src}} \rangle$$
-
-In our final submission, we have used only the following two types
-of common features:
+In the final submission, we simply used the machine translated texts,
+which were provided to the running program from the surrounding environment.
 
-\begin{itemize}
-\item word 5-grams, from words of three or more characters, sorted, lowercased
-\item stop-word 8-grams, from 50 most-frequent English words (including
-	the possessive suffix 's), unsorted, lowercased, with 8-grams formed
-	only from the seven most-frequent words ({\it the, of, a, in, to, 's})
-	removed
-\end{itemize}
 
-We have gathered all the common features for a given document pair, and formed
-{\it valid intervals} from them, as described in \cite{Kasprzak2009a}
-(a similar approach is used also in \cite{stamatatos2011plagiarism}).
-The algorithm is modified for multi-feature detection to use character offsets
-only instead of feature order numbers. We have used valid intervals
-consisting of at least 5 common features, with the maximum allowed gap
-inside the interval (characters not belonging to any common feature
-of a given valid interval) set to 3,500 characters.
+\subsection{Further discussion}
 
-We have also experimented with modifying the allowed gap size using the
-intrinsic plagiarism detection: to allow only shorter gap if the common
-features around the gap belong to different passages, detected as plagiarized
-in the suspicious document by the intrinsic detector, and allow larger gap,
-if both the surrounding common features belong to the same passage,
-detected by the intrinsic detector. This approach, however, did not show
-any improvement against allowed gap of a static size, so it was omitted
-from the final submission.
+From our previous PAN submissions, we knew that the precision of our
+system was good, and because of the way how the final score is computed, we
+wanted to exchange a bit worse precision for better recall and granularity.
+So we pushed the parameters towards detecting more plagiarized passages,
+even when the number of common features was not especially high.
 
-\subsection{Postprocessing}
+\subsubsection{Plagdet score}
 
+Our results from tuning the parameters show that the plagdet score\cite{potthastframework}
+is not a good measure for comparing the plagiarism detection systems:
+for example, the gap of 30,000 characters, described in Section \ref{postprocessing},
+can easily mean several pages of text. And still the system with this
+parameter set so high resulted in better plagdet score.
 
-\subsection{Further discussion}
+Another problem of plagdet can be
+seen in the 01-no-plagiarism part of the training corpus: the border
+between the perfect score 1 and the score 0 is a single false-positive
+detection. Plagdet does not distinguish between the system reporting this
+single false-positive, and the system reporting the whole data as plagiarized.
+Both get the score 0. However, our experience from real-world plagiarism detection systems show that
+the plagiarized documents are in a clear minority, so the performance of
+the detection system on non-plagiarized documents is very important.
 
-In the full paper, we will also discuss the following topics:
+\subsubsection{Performance Notes}
+
+We consider comparing the CPU-time performance of PAN 2012 submissions almost
+meaningless, because any sane system would precompute features for all
+documents in a given set of suspicious and source documents, and use the
+results for pair-wise comparison, expecting that any document will take
+part in more than one pair.
+
+Also, the pair-wise comparison without caching any intermediate results
+lead to worse overall performance: in our PAN 2010 submission, one of the
+post-processing steps was to remove all the overlapping detections
+from a given suspicious documents, when these detections were from different
+source doucments, and were short enough. This removed many false-positives
+and improved the precision of our system. This kind of heuristics was
+not possible in PAN 2012.
+
+As for the performance of our system, we split the task into two parts:
+1. finding the common features, and 2. computing valid intervals and
+postprocessing. The first part is more CPU intensive, and the results
+can be cached. The second part is fast enough to allow us to evaluate
+many combinations of parameters.
+
+We did our development on a machine with four six-core AMD 8139 CPUs
+(2800 MHz), and 128 GB RAM. The first phase took about 2500 seconds
+on this host, and the second phase took 14 seconds. Computing the
+plagdet score using the official script in Python took between 120 and
+180 seconds, as there is no parallelism in this script.
+
+When we tried to use intrinsic plagiarism detection and language
+detection, the first phase took about 12500 seconds. Thus omitting these
+featurs clearly provided huge performance improvement.
+
+The code was written in Perl, and had about 669 lines of code,
+not counting comments and blank lines.
+
+\endinput
+
+- hranice mezi pasazema nekdy zahrnovala whitespace a nekdy ne.
+
+Diskuse plagdet:
+- uzivatele chteji "aby odevzdej ukazovalo 0\% shody", nezajima je
+	co to cislo znamena
+- nezalezi na hranicich detekovane pasaze
+- false-positives jsou daleko horsi
+- granularita je zlo
+
+Finalni vysledky nad testovacim korpusem:
+
+0.7288 0.5994 0.9306 1.0007   2012-06-16 02:23   plagdt recall precis granul
+                            01-no-plagiarism     0.0000 0.0000 0.0000 1.0000
+                            02-no-obfuscation    0.9476 0.9627 0.9330 1.0000
+                            03-artificial-low    0.8726 0.8099 0.9477 1.0013
+                            04-artificial-high   0.3649 0.2255 0.9562 1.0000
+                            05-translation       0.7610 0.6662 0.8884 1.0008
+                            06-simulated-paraphr 0.5972 0.4369 0.9433 1.0000
+
+Vysledky nad souteznimi daty:
+plagdet         precision       recall          granularity
+0.6826726	0.8931670	0.5524708	1.0000000
+
+Run-time:
+12500 sekund tokenizace vcetne sc a detekce jazyka
+2500 sekund bez sc a detekce jazyka
+14 sekund vyhodnoceni valid intervalu a postprocessing
+
+
+TODO:
+- hranici podle hustoty matchovani
+- xml tridit podle this_offset
+
+Tady je obsah souboru JOURNAL - jak jsem meril nektera vylepseni:
+=================================================================
+baseline.py
+0.1250 0.1259 0.9783 2.4460   2012-05-03 06:02   plagdt recall precis granul
+                            01_no_plagiarism     1.0000 1.0000 1.0000 1.0000
+                            02_no_obfuscation    0.8608 0.8609 0.8618 1.0009
+                            03_artificial_low    0.1006 0.1118 0.9979 2.9974
+                            04_artificial_high   0.0054 0.0029 0.9991 1.0778
+                            05_translation       0.0003 0.0002 1.0000 1.2143
+                            06_simulated_paraphr 0.0565 0.0729 0.9983 4.3075
+
+valid_intervals bez postprocessingu (takhle jsem to poprve odevzdal)
+0.3183 0.2034 0.9883 1.0850   2012-05-25 15:25   plagdt recall precis granul
+                            01_no_plagiarism     1.0000 1.0000 1.0000 1.0000
+                            02_no_obfuscation    0.9861 0.9973 0.9752 1.0000
+                            03_artificial_low    0.4127 0.3006 0.9975 1.1724
+                            04_artificial_high   0.0008 0.0004 1.0000 1.0000
+                            05_translation       0.0001 0.0000 1.0000 1.0000
+                            06_simulated_paraphr 0.3470 0.2248 0.9987 1.0812
+
+postprocessed (slucovani blizkych intervalu)
+0.3350 0.2051 0.9863 1.0188   2012-05-25 15:27   plagdt recall precis granul
+                            01_no_plagiarism     1.0000 1.0000 1.0000 1.0000
+                            02_no_obfuscation    0.9863 0.9973 0.9755 1.0000
+                            03_artificial_low    0.4541 0.3057 0.9942 1.0417
+                            04_artificial_high   0.0008 0.0004 1.0000 1.0000
+                            05_translation       0.0001 0.0000 1.0000 1.0000
+                            06_simulated_paraphr 0.3702 0.2279 0.9986 1.0032
+
+whitespace (uprava whitespaces)
+0.3353 0.2053 0.9858 1.0188   2012-05-31 17:57   plagdt recall precis granul
+                            01_no_plagiarism     1.0000 1.0000 1.0000 1.0000
+                            02_no_obfuscation    0.9865 0.9987 0.9745 1.0000
+                            03_artificial_low    0.4546 0.3061 0.9940 1.0417
+                            04_artificial_high   0.0008 0.0004 1.0000 1.0000
+                            05_translation       0.0001 0.0000 1.0000 1.0000
+                            06_simulated_paraphr 0.3705 0.2281 0.9985 1.0032
+
+gap_100: whitespace, + ve valid intervalu dovolim mezeru 100 petic misto 50
+0.3696 0.2305 0.9838 1.0148   2012-05-31 18:07   plagdt recall precis granul
+                            01_no_plagiarism     1.0000 1.0000 1.0000 1.0000
+                            02_no_obfuscation    0.9850 0.9987 0.9717 1.0000
+                            03_artificial_low    0.5423 0.3846 0.9922 1.0310
+                            04_artificial_high   0.0058 0.0029 0.9151 1.0000
+                            05_translation       0.0001 0.0000 1.0000 1.0000
+                            06_simulated_paraphr 0.4207 0.2667 0.9959 1.0000
+
+gap_200: whitespace, + ve valid intervalu dovolim mezeru 200 petic misto 50
+0.3906 0.2456 0.9769 1.0070   2012-05-31 18:09   plagdt recall precis granul
+                            01_no_plagiarism     1.0000 1.0000 1.0000 1.0000
+                            02_no_obfuscation    0.9820 0.9987 0.9659 1.0000
+                            03_artificial_low    0.5976 0.4346 0.9875 1.0139
+                            04_artificial_high   0.0087 0.0044 0.9374 1.0000
+                            05_translation       0.0001 0.0001 1.0000 1.0000
+                            06_simulated_paraphr 0.4360 0.2811 0.9708 1.0000
+
+gap_200_int_10: gap_200, + valid int. ma min. 10 petic misto 20
+0.4436 0.2962 0.9660 1.0308   2012-05-31 18:11   plagdt recall precis granul
+                            01_no_plagiarism     1.0000 1.0000 1.0000 1.0000
+                            02_no_obfuscation    0.9612 0.9987 0.9264 1.0000
+                            03_artificial_low    0.7048 0.5808 0.9873 1.0530
+                            04_artificial_high   0.0457 0.0242 0.9762 1.0465
+                            05_translation       0.0008 0.0004 1.0000 1.0000
+                            06_simulated_paraphr 0.5123 0.3485 0.9662 1.0000
+
+no_trans: gap_200_int_10, + nedetekovat preklady vubec, abych se vyhnul F-P
+0.4432 0.2959 0.9658 1.0310   2012-06-01 16:41   plagdt recall precis granul
+                            01_no_plagiarism     1.0000 1.0000 1.0000 1.0000
+                            02_no_obfuscation    0.9608 0.9980 0.9263 1.0000
+                            03_artificial_low    0.7045 0.5806 0.9872 1.0530
+                            04_artificial_high   0.0457 0.0242 0.9762 1.0465
+                            05_translation       0.0000 0.0000 0.0000 1.0000
+                            06_simulated_paraphr 0.5123 0.3485 0.9662 1.0000
+
+
+swng_unsorted se stejnym postprocessingem jako vyse "whitespace"
+0.2673 0.1584 0.9281 1.0174   2012-05-31 14:20   plagdt recall precis granul
+                            01_no_plagiarism     0.0000 0.0000 0.0000 1.0000
+                            02_no_obfuscation    0.9439 0.9059 0.9851 1.0000
+                            03_artificial_low    0.3178 0.1952 0.9954 1.0377
+                            04_artificial_high   0.0169 0.0095 0.9581 1.1707
+                            05_translation       0.0042 0.0028 0.0080 1.0000
+                            06_simulated_paraphr 0.1905 0.1060 0.9434 1.0000
+
+swng_sorted
+0.2550 0.1906 0.4067 1.0253   2012-05-30 16:07   plagdt recall precis granul
+                            01_no_plagiarism     0.0000 0.0000 0.0000 1.0000
+                            02_no_obfuscation    0.6648 0.9146 0.5222 1.0000
+                            03_artificial_low    0.4093 0.2867 0.8093 1.0483
+                            04_artificial_high   0.0454 0.0253 0.4371 1.0755
+                            05_translation       0.0030 0.0019 0.0064 1.0000
+                            06_simulated_paraphr 0.1017 0.1382 0.0814 1.0106
+
+sort_susp: gap_200_int_10 + postprocessing tridim intervaly podle offsetu v susp, nikoliv v src
+0.4437 0.2962 0.9676 1.0308   2012-06-01 18:06   plagdt recall precis granul
+                            01_no_plagiarism     1.0000 1.0000 1.0000 1.0000
+                            02_no_obfuscation    0.9641 0.9987 0.9317 1.0000
+                            03_artificial_low    0.7048 0.5809 0.9871 1.0530
+                            04_artificial_high   0.0457 0.0242 0.9762 1.0465
+                            05_translation       0.0008 0.0004 1.0000 1.0000
+                            06_simulated_paraphr 0.5123 0.3485 0.9662 1.0000
+
+post_gap2_16000: sort_susp, + sloucit dva intervaly pokud je < 16000 znaku a mezera je jen polovina velikosti tech intervalu (bylo 4000)
+0.4539 0.2983 0.9642 1.0054   2012-06-01 18:09   plagdt recall precis granul
+                            01_no_plagiarism     1.0000 1.0000 1.0000 1.0000
+                            02_no_obfuscation    0.9631 0.9987 0.9300 1.0000
+                            03_artificial_low    0.7307 0.5883 0.9814 1.0094
+                            04_artificial_high   0.0480 0.0247 0.9816 1.0078
+                            05_translation       0.0008 0.0004 1.0000 1.0000
+                            06_simulated_paraphr 0.5133 0.3487 0.9721 1.0000
+
+post_gap2_32000: sort_susp, + sloucit intervaly < 32000 znaku a mezera aspon polovina velikosti
+0.4543 0.2986 0.9638 1.0050   2012-06-01 18:12   plagdt recall precis granul
+                            01_no_plagiarism     1.0000 1.0000 1.0000 1.0000
+                            02_no_obfuscation    0.9628 0.9987 0.9294 1.0000
+                            03_artificial_low    0.7315 0.5893 0.9798 1.0085
+                            04_artificial_high   0.0480 0.0247 0.9816 1.0078
+                            05_translation       0.0008 0.0004 1.0000 1.0000
+                            06_simulated_paraphr 0.5138 0.3487 0.9763 1.0000
+
+post_gap2_64000: sort_susp, + sloucit intervaly < 32000 znaku a mezera aspon pol
+ovina velikosti
+0.4543 0.2988 0.9616 1.0050   2012-06-01 18:21   plagdt recall precis granul
+                            01_no_plagiarism     1.0000 1.0000 1.0000 1.0000
+                            02_no_obfuscation    0.9603 0.9987 0.9248 1.0000
+                            03_artificial_low    0.7316 0.5901 0.9782 1.0085
+                            04_artificial_high   0.0480 0.0247 0.9816 1.0078
+                            05_translation       0.0008 0.0004 1.0000 1.0000
+                            06_simulated_paraphr 0.5138 0.3487 0.9763 1.0000
+
+post_gap1_2000: post_gap2_32000, + spojit bez podminek veci co maji mezeru pod 2000 (bylo 600)
+0.4543 0.2986 0.9635 1.0050   2012-06-01 18:29   plagdt recall precis granul
+                            01_no_plagiarism     1.0000 1.0000 1.0000 1.0000
+                            02_no_obfuscation    0.9628 0.9987 0.9294 1.0000
+                            03_artificial_low    0.7315 0.5895 0.9794 1.0085
+                            04_artificial_high   0.0480 0.0247 0.9816 1.0078
+                            05_translation       0.0008 0.0004 1.0000 1.0000
+                            06_simulated_paraphr 0.5138 0.3487 0.9763 1.0000
 
-\begin{itemize}
-\item language detection
-\item suitability of plagdet score\cite{potthastframework} for performance measurement
-\item feasibility of our approach in large-scale systems
-\item other possible features to use, especially for cross-lingual detection
-\item discussion of parameter settings
-\end{itemize}