yenya: aplikovany pripominky od Simona

[pan12-paper.git] / yenya-detailed.tex
diff --git a/yenya-detailed.tex b/yenya-detailed.tex

old mode 100644 (file)

new mode 100755 (executable)

index f2dd93f..9493f75
--- a/yenya-detailed.tex
+++ b/yenya-detailed.tex
@@ -1,36 +1,36 @@
-\section{Detailed Document Comparison}
+\section{Detailed Document Comparison}~\label{yenya}
  
  \label{detailed}
  
  The detailed comparison task of PAN 2012 consisted in a comparison
  of given document pairs, with the expected output being the annotation of
  similarities found between these documents.
  
  \label{detailed}
  
  The detailed comparison task of PAN 2012 consisted in a comparison
  of given document pairs, with the expected output being the annotation of
  similarities found between these documents.
-The submitted program has been run in a controlled environment
+The submitted program was running in a controlled environment
  separately for each document pair, without the possibility of keeping any
  separately for each document pair, without the possibility of keeping any
-data between runs.
+cached data between runs.
  
  
-In this section, we describe our approach in the detailed comparison
-task. The rest of this section is organized as follows: in the next
-subsection, we summarise the differences from our previous approach.
-In subsection \ref{sec-alg-overview}, we give an overview of our approach.
-TODO napsat jak to nakonec bude.
+%In this section, we describe our approach in the detailed comparison
+%task. The rest of this section is organized as follows: in the next
+%subsection, we summarise the differences from our previous approach.
+%In subsection \ref{sec-alg-overview}, we give an overview of our approach.
+%TODO napsat jak to nakonec bude.
  
  \subsection{Differences Against PAN 2010}
  
  Our approach in this task
  
  \subsection{Differences Against PAN 2010}
  
  Our approach in this task
-is loosely based on the approach we have used in PAN 2010 \cite{Kasprzak2010}.
+is loosely based on the approach we used in PAN 2010 \cite{Kasprzak2010}.
  The main difference is that instead of looking for similarities of
  one type (for PAN 2010, we have used word 5-grams),
  The main difference is that instead of looking for similarities of
  one type (for PAN 2010, we have used word 5-grams),
-we have developed a method of evaluating multiple types of similarities
+we developed a method of evaluating multiple types of similarities
  (we call them {\it common features}) of different properties, such as
  density and length.
  
  (we call them {\it common features}) of different properties, such as
  density and length.
  
-As a proof of concept, we have used two types of common features: word
-5-grams and stop-word 8-grams, the later being based on the method described in
+As a proof of concept, we used two types of common features: word
+5-grams and stop word 8-grams, the later being based on the method described in
  \cite{stamatatos2011plagiarism}.
  
  \cite{stamatatos2011plagiarism}.
  
-In addition to the above, we have made several minor improvements to the
-algorithm, such as parameter tuning and improving the detections
+In addition to the above, we made several minor improvements to the
+algorithm such as parameter tuning and improving the detections
  merging in the post-processing stage.
  
  \subsection{Algorithm Overview}
  merging in the post-processing stage.
  
  \subsection{Algorithm Overview}
@@ -39,143 +39,201 @@ merging in the post-processing stage.
  The algorithm evaluates the document pair in several stages:
  
  \begin{itemize}
  The algorithm evaluates the document pair in several stages:
  
  \begin{itemize}
-\item intrinsic plagiarism detection
-\item language detection of the source document
+\item tokenizing both the suspicious and source documents
+\item forming {\it features} from some tokens
+\item discovering {\it common features}
+\item making {\it valid intervals} from common features
+\item postprocessing
+\end{itemize}
+
+\subsection{Tokenization}
+
+We tokenize the document into words, where word is a sequence of one
+or more characters of the {\it Letter} Unicode class.
+With each word, two additional attributes needed for further processing,
+are associated: the offset where the word begins, and the word length.
+
+The offset where the word begins is not necessarily the first letter character
+of the word itself. We discovered that in the training corpus
+some plagiarized passages were annotated including the preceding
+non-letter characters. We used the following heuristics to add 
+parts of the inter-word gap to the previous or the next adjacent word:
+
  \begin{itemize}
  \begin{itemize}
-\item cross-lingual plagiarism detection, if the source document is not in English
+\item When the inter-word gap contains interpunction (any of the dot,
+semicolon, colon, comma, exclamation mark, question mark, or quotes):
+\begin{itemize}
+\item add the characters up to and including the interpunction character
+to the previous word,
+\item ignore the space character(s) after the interpunction
+character,
+\item add the rest to the next word.
+\end{itemize}
+\item Otherwise, when the inter-word gap contains newline:
+\begin{itemize}
+\item  add the character before the first newline to the previous word,
+\item ignore the first newline character,
+\item add the rest to the next word.
  \end{itemize}
  \end{itemize}
-\item detecting intervals with common features
-\item post-processing phase, mainly serves for merging the nearby common intervals
+\item Otherwise: ignore the inter-word gap characters altogether.
  \end{itemize}
  
  \end{itemize}
  
-\subsection{Multi-feature Plagiarism Detection}
+When the detection program was given three different
+files instead of two (meaning the third one is machine-translated
+version of the second one), we tokenized the translated document instead
+of the source one. We used the line-by-line alignment of the
+source and machine-translated documents to transform the word offsets
+and lengths in the translated document to the terms of the source document.
  
  
-Our pair-wise plagiarism detection is based on finding common passages
-of text, present both in the source and in the suspicious document. We call them
-{\it common features}. In PAN 2010, we have used sorted word 5-grams, formed from
-words of three or more characters, as features to compare.
-Recently, other means of plagiarism detection have been explored:
-stopword $n$-gram detection is one of them
-\cite{stamatatos2011plagiarism}.
+\subsection{Features}
  
  
-We propose the plagiarism detection system based on detecting common
-features of various types, for example word $n$-grams, stopword $n$-grams,
-translated single words, translated word bigrams,
-exact common longer words from document pairs having each document
-in a different language, etc. The system
-has to be to the great extent independent of the specialities of various
-feature types. It cannot, for example, use the order of given features
-as a measure of distance between the features, as for example, several
-word 5-grams can be fully contained inside one stopword 8-gram.
-
-We therefore propose to describe the {\it common feature} of two documents
-(susp and src) with the following tuple:
-$\langle
-\hbox{offset}_{\hbox{susp}},
-\hbox{length}_{\hbox{susp}},
-\hbox{offset}_{\hbox{src}},
-\hbox{length}_{\hbox{src}} \rangle$. This way, the common feature is
-described purely in terms of character offsets, belonging to the feature
-in both documents. In our final submission, we have used the following two types
-of common features:
+We have used features of two types:
  
  \begin{itemize}
  
  \begin{itemize}
-\item word 5-grams, from words of three or more characters, sorted, lowercased
-\item stopword 8-grams, from 50 most-frequent English words (including
-       the possessive suffix 's), unsorted, lowercased, with 8-grams formed
-       only from the seven most-frequent words ({\it the, of, a, in, to, 's})
-       removed
+\item Lexicographically sorted word 5-grams, formed of words at least
+three characters long.
+\item Unsorted stop word 8-grams, formed from 50 most frequent words in English,
+as described in \cite{stamatatos2011plagiarism}. We have further ignored
+the 8-grams, formed solely from the six most frequent English words
+({\it the}, {\it of}, {\it and}, {\it a}, {\it in}, {\it to}), or the possessive {\it'{}s}.
  \end{itemize}
  
  \end{itemize}
  
-We have gathered all the common features of both types for a given document
-pair, and formed {\it valid intervals} from them, as described
-in \cite{Kasprzak2009a}. A similar approach is used also in
-\cite{stamatatos2011plagiarism}.
-The algorithm is modified for multi-feature detection to use character offsets
-only instead of feature order numbers. We have used valid intervals
-consisting of at least 5 common features, with the maximum allowed gap
+We represented each feature with the 32 highest-order bits of its
+MD5 digest. This is only a performance optimization targeted for
+larger systems. The number of features in a document pair is several orders
+of magnitude lower than $2^{32}$, thus the probability of hash function
+collision is low. For pair-wise comparison, it would be feasible to compare
+the features directly instead of their MD5 sums.
+
+Each feature has also two attributes: offset and length.
+Offset is taken as the offset of the first word in a given feature,
+and length is the offset of the last character in a given feature
+minus the offset of the feature itself.
+
+\subsection{Common Features}
+
+For further processing, we took into account only the features
+present both in source and suspicious document. For each such
+{\it common feature}, we created the list of
+$(\makebox{offset}, \makebox{length})$ pairs for the source document,
+and a similar list for the suspicious document. Note that a given feature
+can occur multiple times both in source and suspicious document.
+
+\subsection{Valid Intervals}
+
+To detect a plagiarized passage, we need to find a set of common features,
+which map to a dense-enough interval both in the source and suspicious
+document. In our previous work, we described the algorithm
+for discovering these {\it valid intervals} \cite{Kasprzak2009a}.
+A similar approach is used also in \cite{stamatatos2011plagiarism}.
+Both of these algorithms use features of a single type, which 
+allows to use the ordering of features as a measure of distance.
+
+When we use features of different types, there is no natural ordering
+of them: for example a stop word 8-gram can span multiple sentences,
+which can contain several word 5-grams. The assumption of both of the
+above algorithms, that the last character of the previous feature
+is before the last character of the current feature, is broken.
+
+We modified the algorithm for computing valid intervals with
+multi-feature detection to use character offsets
+only instead of feature order numbers. We used valid intervals
+consisting of at least 4 common features, with the maximum allowed gap
  inside the interval (characters not belonging to any common feature
  inside the interval (characters not belonging to any common feature
-of a given valid interval) set to 3,500 characters.
-
-We have also experimented with modifying the allowed gap size using the
-intrinsic plagiarism detection: to allow only shorter gap if the common
-features around the gap belong to different passages, detected as plagiarized
-in the suspicious document by the intrinsic detector, and allow larger gap,
-if both the surrounding common features belong to the same passage,
-detected by the intrinsic detector. This approach, however, did not show
-any improvement against allowed gap of a static size, so it was omitted
-from the final submission.
+of a given valid interval) set to 4000 characters.
  
  \subsection{Postprocessing}
  
  \subsection{Postprocessing}
+\label{postprocessing}
  
  
-In the postprocessing phase, we took the resulting valid intervals,
-and made attempt to further improve the results. We have firstly
+In the postprocessing phase we took the resulting valid intervals
+and made attempt to further improve the results. We firstly
  removed overlaps: if both overlapping intervals were
  shorter than 300 characters, we have removed both of them. Otherwise, we
  kept the longer detection (longer in terms of length in the suspicious document).
  
  removed overlaps: if both overlapping intervals were
  shorter than 300 characters, we have removed both of them. Otherwise, we
  kept the longer detection (longer in terms of length in the suspicious document).
  
-We have then joined the adjacent valid intervals into one detection,
-if at least one of the following criteria has been met:
+We then joined the adjacent valid intervals into one detection,
+if at least one of the following criteria were met:
  \begin{itemize}
  \item the gap between the intervals contained at least 4 common features,
  and it contained at least one feature per 10,000
  characters\footnote{we have computed the length of the gap as the number
  of characters between the detections in the source document, plus the
  \begin{itemize}
  \item the gap between the intervals contained at least 4 common features,
  and it contained at least one feature per 10,000
  characters\footnote{we have computed the length of the gap as the number
  of characters between the detections in the source document, plus the
-number of charaters between the detections in the suspicious document.}, or
+number of charaters between the detections in the suspicious document.}
  \item the gap was smaller than 30,000 characters and the size of the adjacent
  \item the gap was smaller than 30,000 characters and the size of the adjacent
-valid intervals was at least twice as big as the gap between them, or
+valid intervals was at least twice as big as the gap between them
  \item the gap was smaller than 30,000 characters and the number of common
  features per character in the adjacent interval was not more than three times
  bigger than number of features per character in the possible joined interval.
  \end{itemize}
  
  \item the gap was smaller than 30,000 characters and the number of common
  features per character in the adjacent interval was not more than three times
  bigger than number of features per character in the possible joined interval.
  \end{itemize}
  
-These parameters were fine-tuned to achieve the best results on the training corpus. With these parameters, our algorithm got the total plagdet score of 0.73 on the training corpus.
-
-\subsection{Other Approaches Tried}
-
-There are several other approaches we have evaluated, but which were
+\subsection{Results}
+
+These parameters were fine-tuned to achieve the best results on the training
+corpus. With these parameters, our algorithm got the total plagdet score
+of 0.7288 on the training corpus. The details of the performance of
+our algorithm are presented in Table \ref{table-final}.
+In the PAN 2012 competition, we have acchieved the plagdet score
+of 0.6827, precision 0.8932, recall 0.5525, and granularity 1.0000.
+
+\begin{table}
+\begin{center}
+\begin{tabular}{|l|r|r|r|r|}
+\hline
+&plagdet&recall&precision&granularity\\
+\hline
+whole corpus&0.7288&0.5994&0.9306&1.0007\\
+\hline
+01-no-plagiarism    &0.0000&0.0000&0.0000&1.0000\\
+02-no-obfuscation   &0.9476&0.9627&0.9330&1.0000\\
+03-artificial-low   &0.8726&0.8099&0.9477&1.0013\\
+04-artificial-high  &0.3649&0.2255&0.9562&1.0000\\
+05-translation      &0.7610&0.6662&0.8884&1.0008\\
+06-simulated-paraphrase&0.5972&0.4369&0.9433&1.0000\\
+\hline
+\end{tabular}
+\end{center}
+\caption{Performance on the training corpus}
+\label{table-final}
+\end{table}
+
+\subsection{Other Approaches Explored}
+
+There are several other approaches we evaluated, but which were
  omitted from our final submission for various reasons. We think mentioning
  omitted from our final submission for various reasons. We think mentioning
-them here is worthwhile nevertheless.
+them here is worthwhile nevertheless:
  
  \subsubsection{Intrinsic Plagiarism Detection}
  
  
  \subsubsection{Intrinsic Plagiarism Detection}
  
-Our approach is based on character $n$-gram profiles of the interval of
+We tested the approach based on character $n$-gram profiles of the interval of
  the fixed size (in terms of $n$-grams), and their differences to the
  profile of the whole document \cite{pan09stamatatos}. We have further
  enhanced the approach with using gaussian smoothing of the style-change
  the fixed size (in terms of $n$-grams), and their differences to the
  profile of the whole document \cite{pan09stamatatos}. We have further
  enhanced the approach with using gaussian smoothing of the style-change
-function \cite{Kasprzak2010}.
-
-For PAN 2012, we have experimented with using 1-, 2-, and 3-grams instead
-of only 3-grams, and using the different measure of the difference between
-the n-gram profiles. We have used an approach similar to \cite{ngram},
-where we have compute the profile as an ordered set of 400 most-frequent
-$n$-grams in a given text (the whole document or a partial window). Apart
-from ordering the set, we have ignored the actual number of occurrences
-of a given $n$-gram altogether, and used the value inveresly
-proportional to the $n$-gram order in the profile, in accordance with
-the Zipf's law \cite{zipf1935psycho}.
-
-This approach has provided more stable style-change function than
-than the one proposed in \cite{pan09stamatatos}. Because of pair-wise
-nature of the detailed comparison sub-task, we couldn't use the results
-of the intrinsic detection immediately, therefore we wanted to use them
-as hints to the external detection.
-
-\subsubsection{Language Detection}
-
-For language detection, we used the $n$-gram based categorization \cite{ngram}.
-We have computed the language profiles from the source documents of the
-training corpus (using the annotations from the corpus itself). The result
-of this approach was better than using the stopwords-based detection we have
-used in PAN 2010. However, there were still mis-detected documents,
-mainly the long lists of surnames and other tabular data. We have added
-an ad-hoc fix, where for documents having their profile too distant from all of
-English, German, and Spanish profiles, we have declared them to be in English.
+function \cite{Kasprzak2010}. For PAN 2012, we made further improvements
+to the algorithm, resulting in more stable style change function in
+both short and long documents.
+
+We tried to use the results of the intrinsic plagiarism detection
+as hint for the post-processing phase, allowing to merge larger
+intervals, if they both belong to the same passage, detected by
+the intrinsic detector. This approach did not provide improvement
+when compared to the static gap limits, as described in Section
+\ref{postprocessing}, therefore we have omitted it from our final submission.
+
+%\subsubsection{Language Detection}
+%
+%For language detection, we used the $n$-gram based categorization \cite{ngram}.
+%We computed the language profiles from the source documents of the
+%training corpus (using the annotations from the corpus itself). The result
+%of this approach was better than using the stopwords-based detection we have
+%used in PAN 2010. However, there were still mis-detected documents,
+%mainly the long lists of surnames and other tabular data. We added
+%an ad-hoc fix, where for documents having their profile too distant from all of
+%English, German, and Spanish profiles, we declared them to be in English.
  
  \subsubsection{Cross-lingual Plagiarism Detection}
  
  For cross-lingual plagiarism detection, our aim was to use the public
  
  \subsubsection{Cross-lingual Plagiarism Detection}
  
  For cross-lingual plagiarism detection, our aim was to use the public
-interface to Google translate if possible, and use the resulting document
+interface to Google Translate\footnote{\url{http://translate.google.com/}} if possible, and use the resulting document
  as the source for standard intra-lingual detector.
  Should the translation service not be available, we wanted
  to use the fall-back strategy of translating isolated words only,
  as the source for standard intra-lingual detector.
  Should the translation service not be available, we wanted
  to use the fall-back strategy of translating isolated words only,
@@ -184,46 +242,77 @@ with the additional exact matching of longer words (we have used words with
  We have supposed that these longer words can be names or specialized terms,
  present in both languages.
  
  We have supposed that these longer words can be names or specialized terms,
  present in both languages.
  
-We have used dictionaries from several sources, like
+We used dictionaries from several sources, for example
  {\it dicts.info}\footnote{\url{http://www.dicts.info/}},
  {\it omegawiki}\footnote{\url{http://www.omegawiki.org/}},
  {\it dicts.info}\footnote{\url{http://www.dicts.info/}},
  {\it omegawiki}\footnote{\url{http://www.omegawiki.org/}},
-and {\it wiktionary}\footnote{\url{http://en.wiktionary.org/}}. The source
-and translated document were aligned on a line-by-line basis.
+and {\it wiktionary}\footnote{\url{http://en.wiktionary.org/}}.
  
  
-In the final form of the detailed comparison sub-task, the results of machine
-translation of the source documents were provided to the detector programs
-by the surrounding environment, so we have discarded the language detection
-and machine translation from our submission altogether, and used only
-line-by-line alignment of the source and translated document for calculating
-the offsets of text features in the source document. We have then treated
-the translated documents the same way as the source documents in English.
-
-\subsection{Further discussion}
+In the final submission, we simply used the machine translated texts,
+which were provided to the running program from the surrounding environment.
  
  
-As in our PAN 2010 submission, we tried to make use of the intrinsic plagiarism
-detection, but despite making further improvements to the intrinsic plagiarism detector, we have again failed to reach any significant improvement
-when using it as a hint for the external plagiarism detection.
  
  
-In the full paper, we will also discuss the following topics:
-
-\begin{itemize}
-\item language detection and cross-language common features
-\item intrinsic plagiarism detection
-\item suitability of plagdet score\cite{potthastframework} for performance measurement
-\item feasibility of our approach in large-scale systems
-\item discussion of parameter settings
-\end{itemize}
+\subsection{Further discussion}
  
  
-\nocite{pan09stamatatos}
-\nocite{ngram}
+From our previous PAN submissions, we knew that the precision of our
+system was good, and because of the way how the final score is computed, we
+wanted to exchange a bit worse precision for better recall and granularity.
+So we pushed the parameters towards detecting more plagiarized passages,
+even when the number of common features was not especially high.
+
+\subsubsection{Plagdet score}
+
+Our results from tuning the parameters show that the plagdet score\cite{potthastframework}
+is not a good measure for comparing the plagiarism detection systems:
+for example, the gap of 30,000 characters, described in Section \ref{postprocessing},
+can easily mean several pages of text. And still the system with this
+parameter set so high resulted in better plagdet score.
+
+Another problem of plagdet can be
+seen in the 01-no-plagiarism part of the training corpus: the border
+between the perfect score 1 and the score 0 is a single false-positive
+detection. Plagdet does not distinguish between the system reporting this
+single false-positive, and the system reporting the whole data as plagiarized.
+Both get the score 0. However, our experience from real-world plagiarism detection systems show that
+the plagiarized documents are in a clear minority, so the performance of
+the detection system on non-plagiarized documents is very important.
+
+\subsubsection{Performance Notes}
+
+We consider comparing the CPU-time performance of PAN 2012 submissions almost
+meaningless, because any sane system would precompute features for all
+documents in a given set of suspicious and source documents, and use the
+results for pair-wise comparison, expecting that any document will take
+part in more than one pair.
+
+Also, the pair-wise comparison without caching any intermediate results
+lead to worse overall performance: in our PAN 2010 submission, one of the
+post-processing steps was to remove all the overlapping detections
+from a given suspicious documents, when these detections were from different
+source doucments, and were short enough. This removed many false-positives
+and improved the precision of our system. This kind of heuristics was
+not possible in PAN 2012.
+
+As for the performance of our system, we split the task into two parts:
+1. finding the common features, and 2. computing valid intervals and
+postprocessing. The first part is more CPU intensive, and the results
+can be cached. The second part is fast enough to allow us to evaluate
+many combinations of parameters.
+
+We did our development on a machine with four six-core AMD 8139 CPUs
+(2800 MHz), and 128 GB RAM. The first phase took about 2500 seconds
+on this host, and the second phase took 14 seconds. Computing the
+plagdet score using the official script in Python took between 120 and
+180 seconds, as there is no parallelism in this script.
+
+When we tried to use intrinsic plagiarism detection and language
+detection, the first phase took about 12500 seconds. Thus omitting these
+featurs clearly provided huge performance improvement.
+
+The code was written in Perl, and had about 669 lines of code,
+not counting comments and blank lines.
  
  \endinput
  
  
  \endinput
  
-Co chci diskutovat v zaveru:
-- nebylo mozno cachovat data
-- nebylo mozno vylucovat prekryvajici se podobnosti
-- cili udaje o run-time jsou uplne nahouby
-- 669 radku kodu bez komentaru a prazdnych radku
  - hranice mezi pasazema nekdy zahrnovala whitespace a nekdy ne.
  
  Diskuse plagdet:
  - hranice mezi pasazema nekdy zahrnovala whitespace a nekdy ne.
  
  Diskuse plagdet: