-\section{Detailed Document Comparison}
+\section{Detailed Document Comparison}~\label{yenya}
-\subsection{General Approach}
+\label{detailed}
-The approach Masaryk University team has used in PAN 2012 Plagiarism
-detection---detailed comparison sub-task is based on the same approach
-that we have used in PAN 2010 \cite{Kasprzak2010}. This time, we have
-used a similar approach, enhanced by several means
+The detailed comparison task of PAN 2012 consisted in a comparison
+of given document pairs, with the expected output being the annotation of
+similarities found between these documents.
+The submitted program has been run in a controlled environment
+separately for each document pair, without the possibility of keeping any
+data between runs.
+
+In this section, we describe our approach in the detailed comparison
+task. The rest of this section is organized as follows: in the next
+subsection, we summarise the differences from our previous approach.
+In subsection \ref{sec-alg-overview}, we give an overview of our approach.
+TODO napsat jak to nakonec bude.
+
+\subsection{Differences Against PAN 2010}
+
+Our approach in this task
+is loosely based on the approach we have used in PAN 2010 \cite{Kasprzak2010}.
+The main difference is that instead of looking for similarities of
+one type (for PAN 2010, we have used word 5-grams),
+we have developed a method of evaluating multiple types of similarities
+(we call them {\it common features}) of different properties, such as
+density and length.
+
+As a proof of concept, we have used two types of common features: word
+5-grams and stop-word 8-grams, the later being based on the method described in
+\cite{stamatatos2011plagiarism}.
+
+In addition to the above, we have made several minor improvements to the
+algorithm, such as parameter tuning and improving the detections
+merging in the post-processing stage.
+
+\subsection{Algorithm Overview}
+\label{sec-alg-overview}
The algorithm evaluates the document pair in several stages:
\item post-processing phase, mainly serves for merging the nearby common intervals
\end{itemize}
-\subsection{Intrinsic plagiarism detection}
+\subsection{Multi-feature Plagiarism Detection}
+
+Our pair-wise plagiarism detection is based on finding common passages
+of text, present both in the source and in the suspicious document. We call them
+{\it common features}. In PAN 2010, we have used sorted word 5-grams, formed from
+words of three or more characters, as features to compare.
+Recently, other means of plagiarism detection have been explored:
+stopword $n$-gram detection is one of them
+\cite{stamatatos2011plagiarism}.
+
+We propose the plagiarism detection system based on detecting common
+features of various types, for example word $n$-grams, stopword $n$-grams,
+translated single words, translated word bigrams,
+exact common longer words from document pairs having each document
+in a different language, etc. The system
+has to be to the great extent independent of the specialities of various
+feature types. It cannot, for example, use the order of given features
+as a measure of distance between the features, as for example, several
+word 5-grams can be fully contained inside one stopword 8-gram.
+
+We therefore propose to describe the {\it common feature} of two documents
+(susp and src) with the following tuple:
+$\langle
+\hbox{offset}_{\hbox{susp}},
+\hbox{length}_{\hbox{susp}},
+\hbox{offset}_{\hbox{src}},
+\hbox{length}_{\hbox{src}} \rangle$. This way, the common feature is
+described purely in terms of character offsets, belonging to the feature
+in both documents. In our final submission, we have used the following two types
+of common features:
+
+\begin{itemize}
+\item word 5-grams, from words of three or more characters, sorted, lowercased
+\item stopword 8-grams, from 50 most-frequent English words (including
+ the possessive suffix 's), unsorted, lowercased, with 8-grams formed
+ only from the seven most-frequent words ({\it the, of, a, in, to, 's})
+ removed
+\end{itemize}
+
+We have gathered all the common features of both types for a given document
+pair, and formed {\it valid intervals} from them, as described
+in \cite{Kasprzak2009a}. A similar approach is used also in
+\cite{stamatatos2011plagiarism}.
+The algorithm is modified for multi-feature detection to use character offsets
+only instead of feature order numbers. We have used valid intervals
+consisting of at least 5 common features, with the maximum allowed gap
+inside the interval (characters not belonging to any common feature
+of a given valid interval) set to 3,500 characters.
+
+We have also experimented with modifying the allowed gap size using the
+intrinsic plagiarism detection: to allow only shorter gap if the common
+features around the gap belong to different passages, detected as plagiarized
+in the suspicious document by the intrinsic detector, and allow larger gap,
+if both the surrounding common features belong to the same passage,
+detected by the intrinsic detector. This approach, however, did not show
+any improvement against allowed gap of a static size, so it was omitted
+from the final submission.
+
+\subsection{Postprocessing}
+
+In the postprocessing phase, we took the resulting valid intervals,
+and made attempt to further improve the results. We have firstly
+removed overlaps: if both overlapping intervals were
+shorter than 300 characters, we have removed both of them. Otherwise, we
+kept the longer detection (longer in terms of length in the suspicious document).
+
+We have then joined the adjacent valid intervals into one detection,
+if at least one of the following criteria has been met:
+\begin{itemize}
+\item the gap between the intervals contained at least 4 common features,
+and it contained at least one feature per 10,000
+characters\footnote{we have computed the length of the gap as the number
+of characters between the detections in the source document, plus the
+number of charaters between the detections in the suspicious document.}, or
+\item the gap was smaller than 30,000 characters and the size of the adjacent
+valid intervals was at least twice as big as the gap between them, or
+\item the gap was smaller than 30,000 characters and the number of common
+features per character in the adjacent interval was not more than three times
+bigger than number of features per character in the possible joined interval.
+\end{itemize}
+
+These parameters were fine-tuned to achieve the best results on the training corpus. With these parameters, our algorithm got the total plagdet score of 0.73 on the training corpus.
+
+\subsection{Other Approaches Tried}
+
+There are several other approaches we have evaluated, but which were
+omitted from our final submission for various reasons. We think mentioning
+them here is worthwhile nevertheless.
+
+\subsubsection{Intrinsic Plagiarism Detection}
Our approach is based on character $n$-gram profiles of the interval of
the fixed size (in terms of $n$-grams), and their differences to the
the n-gram profiles. We have used an approach similar to \cite{ngram},
where we have compute the profile as an ordered set of 400 most-frequent
$n$-grams in a given text (the whole document or a partial window). Apart
-from ordering the set we have ignored the actual number of occurrences
+from ordering the set, we have ignored the actual number of occurrences
of a given $n$-gram altogether, and used the value inveresly
proportional to the $n$-gram order in the profile, in accordance with
the Zipf's law \cite{zipf1935psycho}.
This approach has provided more stable style-change function than
than the one proposed in \cite{pan09stamatatos}. Because of pair-wise
nature of the detailed comparison sub-task, we couldn't use the results
-of the intrinsic detection immediately, so we wanted to use them
+of the intrinsic detection immediately, therefore we wanted to use them
as hints to the external detection.
-\subsection{Cross-lingual detection}
+\subsubsection{Language Detection}
+
+For language detection, we used the $n$-gram based categorization \cite{ngram}.
+We have computed the language profiles from the source documents of the
+training corpus (using the annotations from the corpus itself). The result
+of this approach was better than using the stopwords-based detection we have
+used in PAN 2010. However, there were still mis-detected documents,
+mainly the long lists of surnames and other tabular data. We have added
+an ad-hoc fix, where for documents having their profile too distant from all of
+English, German, and Spanish profiles, we have declared them to be in English.
-%For language detection, we used the $n$-gram based categorization \cite{ngram}.
-%We have computed the language profiles from the source documents of the
-%training corpus (using the annotations from the corpus itself). The result
-%of this approach was better than using the stopwords-based detection we have
-%used in PAN 2010. However, there were still mis-detected documents,
-%mainly the long lists of surnames and other tabular data. We have added
-%an ad-hoc fix, where for documents having their profile too distant from all of
-%English, German, and Spanish profiles, we have declared them to be in English.
+\subsubsection{Cross-lingual Plagiarism Detection}
For cross-lingual plagiarism detection, our aim was to use the public
interface to Google translate if possible, and use the resulting document
to use the fall-back strategy of translating isolated words only,
with the additional exact matching of longer words (we have used words with
5 characters or longer).
-We have supposed these longer words can be names or specialized terms,
+We have supposed that these longer words can be names or specialized terms,
present in both languages.
We have used dictionaries from several sources, like
-{\tt dicts.info\footnote{\url{http://www.dicts.info/}}},
-{\tt omegawiki\footnote{\url{http://www.omegawiki.org/}}},
-and {\tt wiktionary\footnote{\url{http://en.wiktionary.org/}}}. The source
+{\it dicts.info}\footnote{\url{http://www.dicts.info/}},
+{\it omegawiki}\footnote{\url{http://www.omegawiki.org/}},
+and {\it wiktionary}\footnote{\url{http://en.wiktionary.org/}}. The source
and translated document were aligned on a line-by-line basis.
In the final form of the detailed comparison sub-task, the results of machine
by the surrounding environment, so we have discarded the language detection
and machine translation from our submission altogether, and used only
line-by-line alignment of the source and translated document for calculating
-the offsets of text features in the source document.
-
-\subsection{Multi-feature Plagiarism Detection}
+the offsets of text features in the source document. We have then treated
+the translated documents the same way as the source documents in English.
-Our pair-wise plagiarism detection is based on finding common passages
-of text, present both in the source and suspicious document. We call them
-{\it features}. In PAN 2010, we have used sorted word 5-grams, formed from
-words of three or more characters, as features to compare.
-Recently, other means of plagiarism detection have been explored:
-Stop-word $n$-gram detection is one of them
-\cite{stamatatos2011plagiarism}.
-
-We propose the plagiarism detection system based on detecting common
-features of various type, like word $n$-grams, stopword $n$-grams,
-translated words or word bigrams, exact common longer words from document
-pairs having each document in a different language, etc. The system
-has to be to the great extent independent of the specialities of various
-feature types. It cannot, for example, use the order of given features
-as a measure of distance between the features, as for example, several
-word 5-grams can be fully contained inside one stopword 8-gram.
+\subsection{Further discussion}
-We thus define {\it common feature} of two documents (susp and src)
-as the following tuple:
-$$\langle
-\hbox{offset}_{\hbox{susp}},
-\hbox{length}_{\hbox{susp}},
-\hbox{offset}_{\hbox{src}},
-\hbox{length}_{\hbox{src}} \rangle$$
+As in our PAN 2010 submission, we tried to make use of the intrinsic plagiarism
+detection, but despite making further improvements to the intrinsic plagiarism detector, we have again failed to reach any significant improvement
+when using it as a hint for the external plagiarism detection.
-In our final submission, we have used only the following two types
-of common features:
+In the full paper, we will also discuss the following topics:
\begin{itemize}
-\item word 5-grams, from words of three or more characters, sorted, lowercased
-\item stop-word 8-grams, from 50 most-frequent English words (including
- the possessive suffix 's), unsorted, lowercased, with 8-grams formed
- only from the seven most-frequent words ({\it the, of, a, in, to, 's})
- removed
+\item language detection and cross-language common features
+\item intrinsic plagiarism detection
+\item suitability of plagdet score\cite{potthastframework} for performance measurement
+\item feasibility of our approach in large-scale systems
+\item discussion of parameter settings
\end{itemize}
-We have gathered all the common features for a given document pair, and formed
-{\it valid intervals} from them, as described in \cite{Kasprzak2009a}
-(a similar approach is used also in \cite{stamatatos2011plagiarism}).
-The algorithm is modified for multi-feature detection to use character offsets
-only instead of feature order numbers. We have used valid intervals
-consisting of at least 5 common features, with the maximum allowed gap
-inside the interval (characters not belonging to any common feature
-of a given valid interval) set to 3,500 characters.
+\nocite{pan09stamatatos}
+\nocite{ngram}
-We have also experimented with modifying the allowed gap size using the
-intrinsic plagiarism detection: to allow only shorter gap if the common
-features around the gap belong to different passages, detected as plagiarized
-in the suspicious document by the intrinsic detector, and allow larger gap,
-if both the surrounding common features belong to the same passage,
-detected by the intrinsic detector. This approach, however, did not show
-any improvement against allowed gap of a static size, so it was omitted
-from the final submission.
+\endinput
-\subsection{Postprocessing}
+Co chci diskutovat v zaveru:
+- nebylo mozno cachovat data
+- nebylo mozno vylucovat prekryvajici se podobnosti
+- cili udaje o run-time jsou uplne nahouby
+- 669 radku kodu bez komentaru a prazdnych radku
+- hranice mezi pasazema nekdy zahrnovala whitespace a nekdy ne.
+Diskuse plagdet:
+- uzivatele chteji "aby odevzdej ukazovalo 0\% shody", nezajima je
+ co to cislo znamena
+- nezalezi na hranicich detekovane pasaze
+- false-positives jsou daleko horsi
+- granularita je zlo
-\subsection{Further discussion}
+Finalni vysledky nad testovacim korpusem:
-In the full paper, we will also discuss the following topics:
+0.7288 0.5994 0.9306 1.0007 2012-06-16 02:23 plagdt recall precis granul
+ 01-no-plagiarism 0.0000 0.0000 0.0000 1.0000
+ 02-no-obfuscation 0.9476 0.9627 0.9330 1.0000
+ 03-artificial-low 0.8726 0.8099 0.9477 1.0013
+ 04-artificial-high 0.3649 0.2255 0.9562 1.0000
+ 05-translation 0.7610 0.6662 0.8884 1.0008
+ 06-simulated-paraphr 0.5972 0.4369 0.9433 1.0000
+
+Vysledky nad souteznimi daty:
+plagdet precision recall granularity
+0.6826726 0.8931670 0.5524708 1.0000000
+
+Run-time:
+12500 sekund tokenizace vcetne sc a detekce jazyka
+2500 sekund bez sc a detekce jazyka
+14 sekund vyhodnoceni valid intervalu a postprocessing
+
+
+TODO:
+- hranici podle hustoty matchovani
+- xml tridit podle this_offset
+
+Tady je obsah souboru JOURNAL - jak jsem meril nektera vylepseni:
+=================================================================
+baseline.py
+0.1250 0.1259 0.9783 2.4460 2012-05-03 06:02 plagdt recall precis granul
+ 01_no_plagiarism 1.0000 1.0000 1.0000 1.0000
+ 02_no_obfuscation 0.8608 0.8609 0.8618 1.0009
+ 03_artificial_low 0.1006 0.1118 0.9979 2.9974
+ 04_artificial_high 0.0054 0.0029 0.9991 1.0778
+ 05_translation 0.0003 0.0002 1.0000 1.2143
+ 06_simulated_paraphr 0.0565 0.0729 0.9983 4.3075
+
+valid_intervals bez postprocessingu (takhle jsem to poprve odevzdal)
+0.3183 0.2034 0.9883 1.0850 2012-05-25 15:25 plagdt recall precis granul
+ 01_no_plagiarism 1.0000 1.0000 1.0000 1.0000
+ 02_no_obfuscation 0.9861 0.9973 0.9752 1.0000
+ 03_artificial_low 0.4127 0.3006 0.9975 1.1724
+ 04_artificial_high 0.0008 0.0004 1.0000 1.0000
+ 05_translation 0.0001 0.0000 1.0000 1.0000
+ 06_simulated_paraphr 0.3470 0.2248 0.9987 1.0812
+
+postprocessed (slucovani blizkych intervalu)
+0.3350 0.2051 0.9863 1.0188 2012-05-25 15:27 plagdt recall precis granul
+ 01_no_plagiarism 1.0000 1.0000 1.0000 1.0000
+ 02_no_obfuscation 0.9863 0.9973 0.9755 1.0000
+ 03_artificial_low 0.4541 0.3057 0.9942 1.0417
+ 04_artificial_high 0.0008 0.0004 1.0000 1.0000
+ 05_translation 0.0001 0.0000 1.0000 1.0000
+ 06_simulated_paraphr 0.3702 0.2279 0.9986 1.0032
+
+whitespace (uprava whitespaces)
+0.3353 0.2053 0.9858 1.0188 2012-05-31 17:57 plagdt recall precis granul
+ 01_no_plagiarism 1.0000 1.0000 1.0000 1.0000
+ 02_no_obfuscation 0.9865 0.9987 0.9745 1.0000
+ 03_artificial_low 0.4546 0.3061 0.9940 1.0417
+ 04_artificial_high 0.0008 0.0004 1.0000 1.0000
+ 05_translation 0.0001 0.0000 1.0000 1.0000
+ 06_simulated_paraphr 0.3705 0.2281 0.9985 1.0032
+
+gap_100: whitespace, + ve valid intervalu dovolim mezeru 100 petic misto 50
+0.3696 0.2305 0.9838 1.0148 2012-05-31 18:07 plagdt recall precis granul
+ 01_no_plagiarism 1.0000 1.0000 1.0000 1.0000
+ 02_no_obfuscation 0.9850 0.9987 0.9717 1.0000
+ 03_artificial_low 0.5423 0.3846 0.9922 1.0310
+ 04_artificial_high 0.0058 0.0029 0.9151 1.0000
+ 05_translation 0.0001 0.0000 1.0000 1.0000
+ 06_simulated_paraphr 0.4207 0.2667 0.9959 1.0000
+
+gap_200: whitespace, + ve valid intervalu dovolim mezeru 200 petic misto 50
+0.3906 0.2456 0.9769 1.0070 2012-05-31 18:09 plagdt recall precis granul
+ 01_no_plagiarism 1.0000 1.0000 1.0000 1.0000
+ 02_no_obfuscation 0.9820 0.9987 0.9659 1.0000
+ 03_artificial_low 0.5976 0.4346 0.9875 1.0139
+ 04_artificial_high 0.0087 0.0044 0.9374 1.0000
+ 05_translation 0.0001 0.0001 1.0000 1.0000
+ 06_simulated_paraphr 0.4360 0.2811 0.9708 1.0000
+
+gap_200_int_10: gap_200, + valid int. ma min. 10 petic misto 20
+0.4436 0.2962 0.9660 1.0308 2012-05-31 18:11 plagdt recall precis granul
+ 01_no_plagiarism 1.0000 1.0000 1.0000 1.0000
+ 02_no_obfuscation 0.9612 0.9987 0.9264 1.0000
+ 03_artificial_low 0.7048 0.5808 0.9873 1.0530
+ 04_artificial_high 0.0457 0.0242 0.9762 1.0465
+ 05_translation 0.0008 0.0004 1.0000 1.0000
+ 06_simulated_paraphr 0.5123 0.3485 0.9662 1.0000
+
+no_trans: gap_200_int_10, + nedetekovat preklady vubec, abych se vyhnul F-P
+0.4432 0.2959 0.9658 1.0310 2012-06-01 16:41 plagdt recall precis granul
+ 01_no_plagiarism 1.0000 1.0000 1.0000 1.0000
+ 02_no_obfuscation 0.9608 0.9980 0.9263 1.0000
+ 03_artificial_low 0.7045 0.5806 0.9872 1.0530
+ 04_artificial_high 0.0457 0.0242 0.9762 1.0465
+ 05_translation 0.0000 0.0000 0.0000 1.0000
+ 06_simulated_paraphr 0.5123 0.3485 0.9662 1.0000
+
+
+swng_unsorted se stejnym postprocessingem jako vyse "whitespace"
+0.2673 0.1584 0.9281 1.0174 2012-05-31 14:20 plagdt recall precis granul
+ 01_no_plagiarism 0.0000 0.0000 0.0000 1.0000
+ 02_no_obfuscation 0.9439 0.9059 0.9851 1.0000
+ 03_artificial_low 0.3178 0.1952 0.9954 1.0377
+ 04_artificial_high 0.0169 0.0095 0.9581 1.1707
+ 05_translation 0.0042 0.0028 0.0080 1.0000
+ 06_simulated_paraphr 0.1905 0.1060 0.9434 1.0000
+
+swng_sorted
+0.2550 0.1906 0.4067 1.0253 2012-05-30 16:07 plagdt recall precis granul
+ 01_no_plagiarism 0.0000 0.0000 0.0000 1.0000
+ 02_no_obfuscation 0.6648 0.9146 0.5222 1.0000
+ 03_artificial_low 0.4093 0.2867 0.8093 1.0483
+ 04_artificial_high 0.0454 0.0253 0.4371 1.0755
+ 05_translation 0.0030 0.0019 0.0064 1.0000
+ 06_simulated_paraphr 0.1017 0.1382 0.0814 1.0106
+
+sort_susp: gap_200_int_10 + postprocessing tridim intervaly podle offsetu v susp, nikoliv v src
+0.4437 0.2962 0.9676 1.0308 2012-06-01 18:06 plagdt recall precis granul
+ 01_no_plagiarism 1.0000 1.0000 1.0000 1.0000
+ 02_no_obfuscation 0.9641 0.9987 0.9317 1.0000
+ 03_artificial_low 0.7048 0.5809 0.9871 1.0530
+ 04_artificial_high 0.0457 0.0242 0.9762 1.0465
+ 05_translation 0.0008 0.0004 1.0000 1.0000
+ 06_simulated_paraphr 0.5123 0.3485 0.9662 1.0000
+
+post_gap2_16000: sort_susp, + sloucit dva intervaly pokud je < 16000 znaku a mezera je jen polovina velikosti tech intervalu (bylo 4000)
+0.4539 0.2983 0.9642 1.0054 2012-06-01 18:09 plagdt recall precis granul
+ 01_no_plagiarism 1.0000 1.0000 1.0000 1.0000
+ 02_no_obfuscation 0.9631 0.9987 0.9300 1.0000
+ 03_artificial_low 0.7307 0.5883 0.9814 1.0094
+ 04_artificial_high 0.0480 0.0247 0.9816 1.0078
+ 05_translation 0.0008 0.0004 1.0000 1.0000
+ 06_simulated_paraphr 0.5133 0.3487 0.9721 1.0000
+
+post_gap2_32000: sort_susp, + sloucit intervaly < 32000 znaku a mezera aspon polovina velikosti
+0.4543 0.2986 0.9638 1.0050 2012-06-01 18:12 plagdt recall precis granul
+ 01_no_plagiarism 1.0000 1.0000 1.0000 1.0000
+ 02_no_obfuscation 0.9628 0.9987 0.9294 1.0000
+ 03_artificial_low 0.7315 0.5893 0.9798 1.0085
+ 04_artificial_high 0.0480 0.0247 0.9816 1.0078
+ 05_translation 0.0008 0.0004 1.0000 1.0000
+ 06_simulated_paraphr 0.5138 0.3487 0.9763 1.0000
+
+post_gap2_64000: sort_susp, + sloucit intervaly < 32000 znaku a mezera aspon pol
+ovina velikosti
+0.4543 0.2988 0.9616 1.0050 2012-06-01 18:21 plagdt recall precis granul
+ 01_no_plagiarism 1.0000 1.0000 1.0000 1.0000
+ 02_no_obfuscation 0.9603 0.9987 0.9248 1.0000
+ 03_artificial_low 0.7316 0.5901 0.9782 1.0085
+ 04_artificial_high 0.0480 0.0247 0.9816 1.0078
+ 05_translation 0.0008 0.0004 1.0000 1.0000
+ 06_simulated_paraphr 0.5138 0.3487 0.9763 1.0000
+
+post_gap1_2000: post_gap2_32000, + spojit bez podminek veci co maji mezeru pod 2000 (bylo 600)
+0.4543 0.2986 0.9635 1.0050 2012-06-01 18:29 plagdt recall precis granul
+ 01_no_plagiarism 1.0000 1.0000 1.0000 1.0000
+ 02_no_obfuscation 0.9628 0.9987 0.9294 1.0000
+ 03_artificial_low 0.7315 0.5895 0.9794 1.0085
+ 04_artificial_high 0.0480 0.0247 0.9816 1.0078
+ 05_translation 0.0008 0.0004 1.0000 1.0000
+ 06_simulated_paraphr 0.5138 0.3487 0.9763 1.0000
-\begin{itemize}
-\item language detection
-\item suitability of plagdet score\cite{potthastframework} for performance measurement
-\item feasibility of our approach in large-scale systems
-\item other possible features to use, especially for cross-lingual detection
-\item discussion of parameter settings
-\end{itemize}