X-Git-Url: https://www.fi.muni.cz/~kas/git//home/kas/public_html/git/?a=blobdiff_plain;f=yenya-detailed.tex;h=9493f7518d1764b22344f0985aa01a529872877e;hb=HEAD;hp=3615ab9b45e364f79d2562566d9488569701ab40;hpb=fb012b2add74c5aeee11754a3ae6df1394bfee25;p=pan12-paper.git diff --git a/yenya-detailed.tex b/yenya-detailed.tex old mode 100644 new mode 100755 index 3615ab9..9493f75 --- a/yenya-detailed.tex +++ b/yenya-detailed.tex @@ -1,150 +1,488 @@ -\section{Detailed Document Comparison} +\section{Detailed Document Comparison}~\label{yenya} -\subsection{General Approach} +\label{detailed} -The approach Masaryk University team has used in PAN 2012 Plagiarism -detection---detailed comparison sub-task is based on the same approach -that we have used in PAN 2010 \cite{Kasprzak2010}. This time, we have -used a similar approach, enhanced by several means +The detailed comparison task of PAN 2012 consisted in a comparison +of given document pairs, with the expected output being the annotation of +similarities found between these documents. +The submitted program was running in a controlled environment +separately for each document pair, without the possibility of keeping any +cached data between runs. + +%In this section, we describe our approach in the detailed comparison +%task. The rest of this section is organized as follows: in the next +%subsection, we summarise the differences from our previous approach. +%In subsection \ref{sec-alg-overview}, we give an overview of our approach. +%TODO napsat jak to nakonec bude. + +\subsection{Differences Against PAN 2010} + +Our approach in this task +is loosely based on the approach we used in PAN 2010 \cite{Kasprzak2010}. +The main difference is that instead of looking for similarities of +one type (for PAN 2010, we have used word 5-grams), +we developed a method of evaluating multiple types of similarities +(we call them {\it common features}) of different properties, such as +density and length. + +As a proof of concept, we used two types of common features: word +5-grams and stop word 8-grams, the later being based on the method described in +\cite{stamatatos2011plagiarism}. + +In addition to the above, we made several minor improvements to the +algorithm such as parameter tuning and improving the detections +merging in the post-processing stage. + +\subsection{Algorithm Overview} +\label{sec-alg-overview} The algorithm evaluates the document pair in several stages: \begin{itemize} -\item intrinsic plagiarism detection -\item language detection of the source document +\item tokenizing both the suspicious and source documents +\item forming {\it features} from some tokens +\item discovering {\it common features} +\item making {\it valid intervals} from common features +\item postprocessing +\end{itemize} + +\subsection{Tokenization} + +We tokenize the document into words, where word is a sequence of one +or more characters of the {\it Letter} Unicode class. +With each word, two additional attributes needed for further processing, +are associated: the offset where the word begins, and the word length. + +The offset where the word begins is not necessarily the first letter character +of the word itself. We discovered that in the training corpus +some plagiarized passages were annotated including the preceding +non-letter characters. We used the following heuristics to add +parts of the inter-word gap to the previous or the next adjacent word: + +\begin{itemize} +\item When the inter-word gap contains interpunction (any of the dot, +semicolon, colon, comma, exclamation mark, question mark, or quotes): +\begin{itemize} +\item add the characters up to and including the interpunction character +to the previous word, +\item ignore the space character(s) after the interpunction +character, +\item add the rest to the next word. +\end{itemize} +\item Otherwise, when the inter-word gap contains newline: +\begin{itemize} +\item add the character before the first newline to the previous word, +\item ignore the first newline character, +\item add the rest to the next word. +\end{itemize} +\item Otherwise: ignore the inter-word gap characters altogether. +\end{itemize} + +When the detection program was given three different +files instead of two (meaning the third one is machine-translated +version of the second one), we tokenized the translated document instead +of the source one. We used the line-by-line alignment of the +source and machine-translated documents to transform the word offsets +and lengths in the translated document to the terms of the source document. + +\subsection{Features} + +We have used features of two types: + \begin{itemize} -\item cross-lingual plagiarism detection, if the source document is not in English +\item Lexicographically sorted word 5-grams, formed of words at least +three characters long. +\item Unsorted stop word 8-grams, formed from 50 most frequent words in English, +as described in \cite{stamatatos2011plagiarism}. We have further ignored +the 8-grams, formed solely from the six most frequent English words +({\it the}, {\it of}, {\it and}, {\it a}, {\it in}, {\it to}), or the possessive {\it'{}s}. \end{itemize} -\item detecting intervals with common features -\item post-processing phase, mainly serves for merging the nearby common intervals + +We represented each feature with the 32 highest-order bits of its +MD5 digest. This is only a performance optimization targeted for +larger systems. The number of features in a document pair is several orders +of magnitude lower than $2^{32}$, thus the probability of hash function +collision is low. For pair-wise comparison, it would be feasible to compare +the features directly instead of their MD5 sums. + +Each feature has also two attributes: offset and length. +Offset is taken as the offset of the first word in a given feature, +and length is the offset of the last character in a given feature +minus the offset of the feature itself. + +\subsection{Common Features} + +For further processing, we took into account only the features +present both in source and suspicious document. For each such +{\it common feature}, we created the list of +$(\makebox{offset}, \makebox{length})$ pairs for the source document, +and a similar list for the suspicious document. Note that a given feature +can occur multiple times both in source and suspicious document. + +\subsection{Valid Intervals} + +To detect a plagiarized passage, we need to find a set of common features, +which map to a dense-enough interval both in the source and suspicious +document. In our previous work, we described the algorithm +for discovering these {\it valid intervals} \cite{Kasprzak2009a}. +A similar approach is used also in \cite{stamatatos2011plagiarism}. +Both of these algorithms use features of a single type, which +allows to use the ordering of features as a measure of distance. + +When we use features of different types, there is no natural ordering +of them: for example a stop word 8-gram can span multiple sentences, +which can contain several word 5-grams. The assumption of both of the +above algorithms, that the last character of the previous feature +is before the last character of the current feature, is broken. + +We modified the algorithm for computing valid intervals with +multi-feature detection to use character offsets +only instead of feature order numbers. We used valid intervals +consisting of at least 4 common features, with the maximum allowed gap +inside the interval (characters not belonging to any common feature +of a given valid interval) set to 4000 characters. + +\subsection{Postprocessing} +\label{postprocessing} + +In the postprocessing phase we took the resulting valid intervals +and made attempt to further improve the results. We firstly +removed overlaps: if both overlapping intervals were +shorter than 300 characters, we have removed both of them. Otherwise, we +kept the longer detection (longer in terms of length in the suspicious document). + +We then joined the adjacent valid intervals into one detection, +if at least one of the following criteria were met: +\begin{itemize} +\item the gap between the intervals contained at least 4 common features, +and it contained at least one feature per 10,000 +characters\footnote{we have computed the length of the gap as the number +of characters between the detections in the source document, plus the +number of charaters between the detections in the suspicious document.} +\item the gap was smaller than 30,000 characters and the size of the adjacent +valid intervals was at least twice as big as the gap between them +\item the gap was smaller than 30,000 characters and the number of common +features per character in the adjacent interval was not more than three times +bigger than number of features per character in the possible joined interval. \end{itemize} -\subsection{Intrinsic plagiarism detection} +\subsection{Results} + +These parameters were fine-tuned to achieve the best results on the training +corpus. With these parameters, our algorithm got the total plagdet score +of 0.7288 on the training corpus. The details of the performance of +our algorithm are presented in Table \ref{table-final}. +In the PAN 2012 competition, we have acchieved the plagdet score +of 0.6827, precision 0.8932, recall 0.5525, and granularity 1.0000. -Our approach is based on character $n$-gram profiles of the interval of +\begin{table} +\begin{center} +\begin{tabular}{|l|r|r|r|r|} +\hline +&plagdet&recall&precision&granularity\\ +\hline +whole corpus&0.7288&0.5994&0.9306&1.0007\\ +\hline +01-no-plagiarism &0.0000&0.0000&0.0000&1.0000\\ +02-no-obfuscation &0.9476&0.9627&0.9330&1.0000\\ +03-artificial-low &0.8726&0.8099&0.9477&1.0013\\ +04-artificial-high &0.3649&0.2255&0.9562&1.0000\\ +05-translation &0.7610&0.6662&0.8884&1.0008\\ +06-simulated-paraphrase&0.5972&0.4369&0.9433&1.0000\\ +\hline +\end{tabular} +\end{center} +\caption{Performance on the training corpus} +\label{table-final} +\end{table} + +\subsection{Other Approaches Explored} + +There are several other approaches we evaluated, but which were +omitted from our final submission for various reasons. We think mentioning +them here is worthwhile nevertheless: + +\subsubsection{Intrinsic Plagiarism Detection} + +We tested the approach based on character $n$-gram profiles of the interval of the fixed size (in terms of $n$-grams), and their differences to the profile of the whole document \cite{pan09stamatatos}. We have further enhanced the approach with using gaussian smoothing of the style-change -function \cite{Kasprzak2010}. - -For PAN 2012, we have experimented with using 1-, 2-, and 3-grams instead -of only 3-grams, and using the different measure of the difference between -the n-gram profiles. We have used an approach similar to \cite{ngram}, -where we have compute the profile as an ordered set of 400 most-frequent -$n$-grams in a given text (the whole document or a partial window). Apart -from ordering the set we have ignored the actual number of occurrences -of a given $n$-gram altogether, and used the value inveresly -proportional to the $n$-gram order in the profile, in accordance with -the Zipf's law \cite{zipf1935psycho}. - -This approach has provided more stable style-change function than -than the one proposed in \cite{pan09stamatatos}. Because of pair-wise -nature of the detailed comparison sub-task, we couldn't use the results -of the intrinsic detection immediately, so we wanted to use them -as hints to the external detection. - -\subsection{Cross-lingual detection} +function \cite{Kasprzak2010}. For PAN 2012, we made further improvements +to the algorithm, resulting in more stable style change function in +both short and long documents. + +We tried to use the results of the intrinsic plagiarism detection +as hint for the post-processing phase, allowing to merge larger +intervals, if they both belong to the same passage, detected by +the intrinsic detector. This approach did not provide improvement +when compared to the static gap limits, as described in Section +\ref{postprocessing}, therefore we have omitted it from our final submission. +%\subsubsection{Language Detection} +% %For language detection, we used the $n$-gram based categorization \cite{ngram}. -%We have computed the language profiles from the source documents of the +%We computed the language profiles from the source documents of the %training corpus (using the annotations from the corpus itself). The result %of this approach was better than using the stopwords-based detection we have %used in PAN 2010. However, there were still mis-detected documents, -%mainly the long lists of surnames and other tabular data. We have added +%mainly the long lists of surnames and other tabular data. We added %an ad-hoc fix, where for documents having their profile too distant from all of -%English, German, and Spanish profiles, we have declared them to be in English. +%English, German, and Spanish profiles, we declared them to be in English. + +\subsubsection{Cross-lingual Plagiarism Detection} For cross-lingual plagiarism detection, our aim was to use the public -interface to Google translate if possible, and use the resulting document +interface to Google Translate\footnote{\url{http://translate.google.com/}} if possible, and use the resulting document as the source for standard intra-lingual detector. Should the translation service not be available, we wanted to use the fall-back strategy of translating isolated words only, with the additional exact matching of longer words (we have used words with 5 characters or longer). -We have supposed these longer words can be names or specialized terms, +We have supposed that these longer words can be names or specialized terms, present in both languages. -We have used dictionaries from several sources, like -{\tt dicts.info\footnote{\url{http://www.dicts.info/}}}, -{\tt omegawiki\footnote{\url{http://www.omegawiki.org/}}}, -and {\tt wiktionary\footnote{\url{http://en.wiktionary.org/}}}. The source -and translated document were aligned on a line-by-line basis. - -In the final form of the detailed comparison sub-task, the results of machine -translation of the source documents were provided to the detector programs -by the surrounding environment, so we have discarded the language detection -and machine translation from our submission altogether, and used only -line-by-line alignment of the source and translated document for calculating -the offsets of text features in the source document. - -\subsection{Multi-feature Plagiarism Detection} - -Our pair-wise plagiarism detection is based on finding common passages -of text, present both in the source and suspicious document. We call them -{\it features}. In PAN 2010, we have used sorted word 5-grams, formed from -words of three or more characters, as features to compare. -Recently, other means of plagiarism detection have been explored: -Stop-word $n$-gram detection is one of them -\cite{stamatatos2011plagiarism}. +We used dictionaries from several sources, for example +{\it dicts.info}\footnote{\url{http://www.dicts.info/}}, +{\it omegawiki}\footnote{\url{http://www.omegawiki.org/}}, +and {\it wiktionary}\footnote{\url{http://en.wiktionary.org/}}. -We propose the plagiarism detection system based on detecting common -features of various type, like word $n$-grams, stopword $n$-grams, -translated words or word bigrams, exact common longer words from document -pairs having each document in a different language, etc. The system -has to be to the great extent independent of the specialities of various -feature types. It cannot, for example, use the order of given features -as a measure of distance between the features, as for example, several -word 5-grams can be fully contained inside one stopword 8-gram. - -We thus define {\it common feature} of two documents (susp and src) -as the following tuple: -$$\langle -\hbox{offset}_{\hbox{susp}}, -\hbox{length}_{\hbox{susp}}, -\hbox{offset}_{\hbox{src}}, -\hbox{length}_{\hbox{src}} \rangle$$ - -In our final submission, we have used only the following two types -of common features: +In the final submission, we simply used the machine translated texts, +which were provided to the running program from the surrounding environment. -\begin{itemize} -\item word 5-grams, from words of three or more characters, sorted, lowercased -\item stop-word 8-grams, from 50 most-frequent English words (including - the possessive suffix 's), unsorted, lowercased, with 8-grams formed - only from the seven most-frequent words ({\it the, of, a, in, to, 's}) - removed -\end{itemize} -We have gathered all the common features for a given document pair, and formed -{\it valid intervals} from them, as described in \cite{Kasprzak2009a} -(a similar approach is used also in \cite{stamatatos2011plagiarism}). -The algorithm is modified for multi-feature detection to use character offsets -only instead of feature order numbers. We have used valid intervals -consisting of at least 5 common features, with the maximum allowed gap -inside the interval (characters not belonging to any common feature -of a given valid interval) set to 3,500 characters. +\subsection{Further discussion} -We have also experimented with modifying the allowed gap size using the -intrinsic plagiarism detection: to allow only shorter gap if the common -features around the gap belong to different passages, detected as plagiarized -in the suspicious document by the intrinsic detector, and allow larger gap, -if both the surrounding common features belong to the same passage, -detected by the intrinsic detector. This approach, however, did not show -any improvement against allowed gap of a static size, so it was omitted -from the final submission. +From our previous PAN submissions, we knew that the precision of our +system was good, and because of the way how the final score is computed, we +wanted to exchange a bit worse precision for better recall and granularity. +So we pushed the parameters towards detecting more plagiarized passages, +even when the number of common features was not especially high. -\subsection{Postprocessing} +\subsubsection{Plagdet score} +Our results from tuning the parameters show that the plagdet score\cite{potthastframework} +is not a good measure for comparing the plagiarism detection systems: +for example, the gap of 30,000 characters, described in Section \ref{postprocessing}, +can easily mean several pages of text. And still the system with this +parameter set so high resulted in better plagdet score. -\subsection{Further discussion} +Another problem of plagdet can be +seen in the 01-no-plagiarism part of the training corpus: the border +between the perfect score 1 and the score 0 is a single false-positive +detection. Plagdet does not distinguish between the system reporting this +single false-positive, and the system reporting the whole data as plagiarized. +Both get the score 0. However, our experience from real-world plagiarism detection systems show that +the plagiarized documents are in a clear minority, so the performance of +the detection system on non-plagiarized documents is very important. -In the full paper, we will also discuss the following topics: +\subsubsection{Performance Notes} + +We consider comparing the CPU-time performance of PAN 2012 submissions almost +meaningless, because any sane system would precompute features for all +documents in a given set of suspicious and source documents, and use the +results for pair-wise comparison, expecting that any document will take +part in more than one pair. + +Also, the pair-wise comparison without caching any intermediate results +lead to worse overall performance: in our PAN 2010 submission, one of the +post-processing steps was to remove all the overlapping detections +from a given suspicious documents, when these detections were from different +source doucments, and were short enough. This removed many false-positives +and improved the precision of our system. This kind of heuristics was +not possible in PAN 2012. + +As for the performance of our system, we split the task into two parts: +1. finding the common features, and 2. computing valid intervals and +postprocessing. The first part is more CPU intensive, and the results +can be cached. The second part is fast enough to allow us to evaluate +many combinations of parameters. + +We did our development on a machine with four six-core AMD 8139 CPUs +(2800 MHz), and 128 GB RAM. The first phase took about 2500 seconds +on this host, and the second phase took 14 seconds. Computing the +plagdet score using the official script in Python took between 120 and +180 seconds, as there is no parallelism in this script. + +When we tried to use intrinsic plagiarism detection and language +detection, the first phase took about 12500 seconds. Thus omitting these +featurs clearly provided huge performance improvement. + +The code was written in Perl, and had about 669 lines of code, +not counting comments and blank lines. + +\endinput + +- hranice mezi pasazema nekdy zahrnovala whitespace a nekdy ne. + +Diskuse plagdet: +- uzivatele chteji "aby odevzdej ukazovalo 0\% shody", nezajima je + co to cislo znamena +- nezalezi na hranicich detekovane pasaze +- false-positives jsou daleko horsi +- granularita je zlo + +Finalni vysledky nad testovacim korpusem: + +0.7288 0.5994 0.9306 1.0007 2012-06-16 02:23 plagdt recall precis granul + 01-no-plagiarism 0.0000 0.0000 0.0000 1.0000 + 02-no-obfuscation 0.9476 0.9627 0.9330 1.0000 + 03-artificial-low 0.8726 0.8099 0.9477 1.0013 + 04-artificial-high 0.3649 0.2255 0.9562 1.0000 + 05-translation 0.7610 0.6662 0.8884 1.0008 + 06-simulated-paraphr 0.5972 0.4369 0.9433 1.0000 + +Vysledky nad souteznimi daty: +plagdet precision recall granularity +0.6826726 0.8931670 0.5524708 1.0000000 + +Run-time: +12500 sekund tokenizace vcetne sc a detekce jazyka +2500 sekund bez sc a detekce jazyka +14 sekund vyhodnoceni valid intervalu a postprocessing + + +TODO: +- hranici podle hustoty matchovani +- xml tridit podle this_offset + +Tady je obsah souboru JOURNAL - jak jsem meril nektera vylepseni: +================================================================= +baseline.py +0.1250 0.1259 0.9783 2.4460 2012-05-03 06:02 plagdt recall precis granul + 01_no_plagiarism 1.0000 1.0000 1.0000 1.0000 + 02_no_obfuscation 0.8608 0.8609 0.8618 1.0009 + 03_artificial_low 0.1006 0.1118 0.9979 2.9974 + 04_artificial_high 0.0054 0.0029 0.9991 1.0778 + 05_translation 0.0003 0.0002 1.0000 1.2143 + 06_simulated_paraphr 0.0565 0.0729 0.9983 4.3075 + +valid_intervals bez postprocessingu (takhle jsem to poprve odevzdal) +0.3183 0.2034 0.9883 1.0850 2012-05-25 15:25 plagdt recall precis granul + 01_no_plagiarism 1.0000 1.0000 1.0000 1.0000 + 02_no_obfuscation 0.9861 0.9973 0.9752 1.0000 + 03_artificial_low 0.4127 0.3006 0.9975 1.1724 + 04_artificial_high 0.0008 0.0004 1.0000 1.0000 + 05_translation 0.0001 0.0000 1.0000 1.0000 + 06_simulated_paraphr 0.3470 0.2248 0.9987 1.0812 + +postprocessed (slucovani blizkych intervalu) +0.3350 0.2051 0.9863 1.0188 2012-05-25 15:27 plagdt recall precis granul + 01_no_plagiarism 1.0000 1.0000 1.0000 1.0000 + 02_no_obfuscation 0.9863 0.9973 0.9755 1.0000 + 03_artificial_low 0.4541 0.3057 0.9942 1.0417 + 04_artificial_high 0.0008 0.0004 1.0000 1.0000 + 05_translation 0.0001 0.0000 1.0000 1.0000 + 06_simulated_paraphr 0.3702 0.2279 0.9986 1.0032 + +whitespace (uprava whitespaces) +0.3353 0.2053 0.9858 1.0188 2012-05-31 17:57 plagdt recall precis granul + 01_no_plagiarism 1.0000 1.0000 1.0000 1.0000 + 02_no_obfuscation 0.9865 0.9987 0.9745 1.0000 + 03_artificial_low 0.4546 0.3061 0.9940 1.0417 + 04_artificial_high 0.0008 0.0004 1.0000 1.0000 + 05_translation 0.0001 0.0000 1.0000 1.0000 + 06_simulated_paraphr 0.3705 0.2281 0.9985 1.0032 + +gap_100: whitespace, + ve valid intervalu dovolim mezeru 100 petic misto 50 +0.3696 0.2305 0.9838 1.0148 2012-05-31 18:07 plagdt recall precis granul + 01_no_plagiarism 1.0000 1.0000 1.0000 1.0000 + 02_no_obfuscation 0.9850 0.9987 0.9717 1.0000 + 03_artificial_low 0.5423 0.3846 0.9922 1.0310 + 04_artificial_high 0.0058 0.0029 0.9151 1.0000 + 05_translation 0.0001 0.0000 1.0000 1.0000 + 06_simulated_paraphr 0.4207 0.2667 0.9959 1.0000 + +gap_200: whitespace, + ve valid intervalu dovolim mezeru 200 petic misto 50 +0.3906 0.2456 0.9769 1.0070 2012-05-31 18:09 plagdt recall precis granul + 01_no_plagiarism 1.0000 1.0000 1.0000 1.0000 + 02_no_obfuscation 0.9820 0.9987 0.9659 1.0000 + 03_artificial_low 0.5976 0.4346 0.9875 1.0139 + 04_artificial_high 0.0087 0.0044 0.9374 1.0000 + 05_translation 0.0001 0.0001 1.0000 1.0000 + 06_simulated_paraphr 0.4360 0.2811 0.9708 1.0000 + +gap_200_int_10: gap_200, + valid int. ma min. 10 petic misto 20 +0.4436 0.2962 0.9660 1.0308 2012-05-31 18:11 plagdt recall precis granul + 01_no_plagiarism 1.0000 1.0000 1.0000 1.0000 + 02_no_obfuscation 0.9612 0.9987 0.9264 1.0000 + 03_artificial_low 0.7048 0.5808 0.9873 1.0530 + 04_artificial_high 0.0457 0.0242 0.9762 1.0465 + 05_translation 0.0008 0.0004 1.0000 1.0000 + 06_simulated_paraphr 0.5123 0.3485 0.9662 1.0000 + +no_trans: gap_200_int_10, + nedetekovat preklady vubec, abych se vyhnul F-P +0.4432 0.2959 0.9658 1.0310 2012-06-01 16:41 plagdt recall precis granul + 01_no_plagiarism 1.0000 1.0000 1.0000 1.0000 + 02_no_obfuscation 0.9608 0.9980 0.9263 1.0000 + 03_artificial_low 0.7045 0.5806 0.9872 1.0530 + 04_artificial_high 0.0457 0.0242 0.9762 1.0465 + 05_translation 0.0000 0.0000 0.0000 1.0000 + 06_simulated_paraphr 0.5123 0.3485 0.9662 1.0000 + + +swng_unsorted se stejnym postprocessingem jako vyse "whitespace" +0.2673 0.1584 0.9281 1.0174 2012-05-31 14:20 plagdt recall precis granul + 01_no_plagiarism 0.0000 0.0000 0.0000 1.0000 + 02_no_obfuscation 0.9439 0.9059 0.9851 1.0000 + 03_artificial_low 0.3178 0.1952 0.9954 1.0377 + 04_artificial_high 0.0169 0.0095 0.9581 1.1707 + 05_translation 0.0042 0.0028 0.0080 1.0000 + 06_simulated_paraphr 0.1905 0.1060 0.9434 1.0000 + +swng_sorted +0.2550 0.1906 0.4067 1.0253 2012-05-30 16:07 plagdt recall precis granul + 01_no_plagiarism 0.0000 0.0000 0.0000 1.0000 + 02_no_obfuscation 0.6648 0.9146 0.5222 1.0000 + 03_artificial_low 0.4093 0.2867 0.8093 1.0483 + 04_artificial_high 0.0454 0.0253 0.4371 1.0755 + 05_translation 0.0030 0.0019 0.0064 1.0000 + 06_simulated_paraphr 0.1017 0.1382 0.0814 1.0106 + +sort_susp: gap_200_int_10 + postprocessing tridim intervaly podle offsetu v susp, nikoliv v src +0.4437 0.2962 0.9676 1.0308 2012-06-01 18:06 plagdt recall precis granul + 01_no_plagiarism 1.0000 1.0000 1.0000 1.0000 + 02_no_obfuscation 0.9641 0.9987 0.9317 1.0000 + 03_artificial_low 0.7048 0.5809 0.9871 1.0530 + 04_artificial_high 0.0457 0.0242 0.9762 1.0465 + 05_translation 0.0008 0.0004 1.0000 1.0000 + 06_simulated_paraphr 0.5123 0.3485 0.9662 1.0000 + +post_gap2_16000: sort_susp, + sloucit dva intervaly pokud je < 16000 znaku a mezera je jen polovina velikosti tech intervalu (bylo 4000) +0.4539 0.2983 0.9642 1.0054 2012-06-01 18:09 plagdt recall precis granul + 01_no_plagiarism 1.0000 1.0000 1.0000 1.0000 + 02_no_obfuscation 0.9631 0.9987 0.9300 1.0000 + 03_artificial_low 0.7307 0.5883 0.9814 1.0094 + 04_artificial_high 0.0480 0.0247 0.9816 1.0078 + 05_translation 0.0008 0.0004 1.0000 1.0000 + 06_simulated_paraphr 0.5133 0.3487 0.9721 1.0000 + +post_gap2_32000: sort_susp, + sloucit intervaly < 32000 znaku a mezera aspon polovina velikosti +0.4543 0.2986 0.9638 1.0050 2012-06-01 18:12 plagdt recall precis granul + 01_no_plagiarism 1.0000 1.0000 1.0000 1.0000 + 02_no_obfuscation 0.9628 0.9987 0.9294 1.0000 + 03_artificial_low 0.7315 0.5893 0.9798 1.0085 + 04_artificial_high 0.0480 0.0247 0.9816 1.0078 + 05_translation 0.0008 0.0004 1.0000 1.0000 + 06_simulated_paraphr 0.5138 0.3487 0.9763 1.0000 + +post_gap2_64000: sort_susp, + sloucit intervaly < 32000 znaku a mezera aspon pol +ovina velikosti +0.4543 0.2988 0.9616 1.0050 2012-06-01 18:21 plagdt recall precis granul + 01_no_plagiarism 1.0000 1.0000 1.0000 1.0000 + 02_no_obfuscation 0.9603 0.9987 0.9248 1.0000 + 03_artificial_low 0.7316 0.5901 0.9782 1.0085 + 04_artificial_high 0.0480 0.0247 0.9816 1.0078 + 05_translation 0.0008 0.0004 1.0000 1.0000 + 06_simulated_paraphr 0.5138 0.3487 0.9763 1.0000 + +post_gap1_2000: post_gap2_32000, + spojit bez podminek veci co maji mezeru pod 2000 (bylo 600) +0.4543 0.2986 0.9635 1.0050 2012-06-01 18:29 plagdt recall precis granul + 01_no_plagiarism 1.0000 1.0000 1.0000 1.0000 + 02_no_obfuscation 0.9628 0.9987 0.9294 1.0000 + 03_artificial_low 0.7315 0.5895 0.9794 1.0085 + 04_artificial_high 0.0480 0.0247 0.9816 1.0078 + 05_translation 0.0008 0.0004 1.0000 1.0000 + 06_simulated_paraphr 0.5138 0.3487 0.9763 1.0000 -\begin{itemize} -\item language detection -\item suitability of plagdet score\cite{potthastframework} for performance measurement -\item feasibility of our approach in large-scale systems -\item other possible features to use, especially for cross-lingual detection -\item discussion of parameter settings -\end{itemize}