Prvni velky proofreading + dodelavky

[pan12-paper.git] / yenya-detailed.tex
diff --git a/yenya-detailed.tex b/yenya-detailed.tex

index 9f4034655fbc22a8eaddbddff11839aa70f50541..492b38acb356eb7a2f7b5f88bd9d0756e5c2b7e5 100755 (executable)
--- a/yenya-detailed.tex
+++ b/yenya-detailed.tex
@@ -5,31 +5,31 @@
  The detailed comparison task of PAN 2012 consisted in a comparison
  of given document pairs, with the expected output being the annotation of
  similarities found between these documents.
-The submitted program has been run in a controlled environment
+The submitted program was running in a controlled environment
  separately for each document pair, without the possibility of keeping any
-data between runs.
+cached data between runs.
  
-In this section, we describe our approach in the detailed comparison
-task. The rest of this section is organized as follows: in the next
-subsection, we summarise the differences from our previous approach.
-In subsection \ref{sec-alg-overview}, we give an overview of our approach.
-TODO napsat jak to nakonec bude.
+%In this section, we describe our approach in the detailed comparison
+%task. The rest of this section is organized as follows: in the next
+%subsection, we summarise the differences from our previous approach.
+%In subsection \ref{sec-alg-overview}, we give an overview of our approach.
+%TODO napsat jak to nakonec bude.
  
  \subsection{Differences Against PAN 2010}
  
  Our approach in this task
-is loosely based on the approach we have used in PAN 2010 \cite{Kasprzak2010}.
+is loosely based on the approach we used in PAN 2010 \cite{Kasprzak2010}.
  The main difference is that instead of looking for similarities of
  one type (for PAN 2010, we have used word 5-grams),
-we have developed a method of evaluating multiple types of similarities
+we developed a method of evaluating multiple types of similarities
  (we call them {\it common features}) of different properties, such as
  density and length.
  
-As a proof of concept, we have used two types of common features: word
+As a proof of concept, we used two types of common features: word
  5-grams and stop-word 8-grams, the later being based on the method described in
  \cite{stamatatos2011plagiarism}.
  
-In addition to the above, we have made several minor improvements to the
+In addition to the above, we made several minor improvements to the
  algorithm, such as parameter tuning and improving the detections
  merging in the post-processing stage.
  
@@ -39,83 +39,113 @@ merging in the post-processing stage.
  The algorithm evaluates the document pair in several stages:
  
  \begin{itemize}
-\item intrinsic plagiarism detection
-\item language detection of the source document
-\begin{itemize}
-\item cross-lingual plagiarism detection, if the source document is not in English
+\item tokenizing both the suspicious and source documents
+\item forming {\it features} from some tokens
+\item discovering {\it common features}
+\item making {\it valid intervals} from common features
+\item postprocessing
  \end{itemize}
-\item detecting intervals with common features
-\item post-processing phase, mainly serves for merging the nearby common intervals
+
+\subsection{Tokenization}
+
+We tokenize the document into words, where word is a sequence of one
+or more characters of the {\it Letter} Unicode class.
+With each word, two additional attributes, needed for further processing,
+are associated: the offset where the word begins, and the word length.
+
+The offset where the word begins is not necessarily the first letter character
+of the word itself. We discovered that in the training corpus,
+some plagiarized passages were annotated including the preceding
+non-letter characters. We used the following heuristics to add 
+parts of the inter-word gap to the previous, or the next adjacent word:
+
+\begin{itemize}
+\item When the inter-word gap contains interpunction (any of the dot,
+semicolon, colon, comma, exclamation mark, question mark, or quotes),
+add the characters up to, and including the interpunction, to the previous
+word, ignore the space character(s) after the interpunction, and add
+the rest to the next word.
+\item Otherwise, when the inter-word gap contains newline, add the character
+before the first newline to the previous word, ignore the first newline
+character, and add the rest to the next word.
+\item otherwise, ignore the inter-word gap characters altogether.
  \end{itemize}
  
-\subsection{Multi-feature Plagiarism Detection}
+When the detection program was given three different
+files instead of two (meaning the third one is machine-translated
+version of the second one), we tokenized the translated document instead
+of the source one. We used the line-by-line alignment of the
+source and machine-translated documents to transform the word offsets
+and lengths in the translated document to the terms of the source document.
  
-Our pair-wise plagiarism detection is based on finding common passages
-of text, present both in the source and in the suspicious document. We call them
-{\it common features}. In PAN 2010, we have used sorted word 5-grams, formed from
-words of three or more characters, as features to compare.
-Recently, other means of plagiarism detection have been explored:
-stopword $n$-gram detection is one of them
-\cite{stamatatos2011plagiarism}.
+\subsection{Features}
  
-We propose the plagiarism detection system based on detecting common
-features of various types, for example word $n$-grams, stopword $n$-grams,
-translated single words, translated word bigrams,
-exact common longer words from document pairs having each document
-in a different language, etc. The system
-has to be to the great extent independent of the specialities of various
-feature types. It cannot, for example, use the order of given features
-as a measure of distance between the features, as for example, several
-word 5-grams can be fully contained inside one stopword 8-gram.
-
-We therefore propose to describe the {\it common feature} of two documents
-(susp and src) with the following tuple:
-$\langle
-\hbox{offset}_{\hbox{susp}},
-\hbox{length}_{\hbox{susp}},
-\hbox{offset}_{\hbox{src}},
-\hbox{length}_{\hbox{src}} \rangle$. This way, the common feature is
-described purely in terms of character offsets, belonging to the feature
-in both documents. In our final submission, we have used the following two types
-of common features:
+We have used features of two types:
  
  \begin{itemize}
-\item word 5-grams, from words of three or more characters, sorted, lowercased
-\item stopword 8-grams, from 50 most-frequent English words (including
-       the possessive suffix 's), unsorted, lowercased, with 8-grams formed
-       only from the seven most-frequent words ({\it the, of, a, in, to, 's})
-       removed
+\item Lexicographically sorted word 5-grams, formed of words at least
+three characters long, and
+\item unsorted stop-word 8-grams, formed from 50 most frequent words in English,
+as described in \cite{stamatatos2011plagiarism}. We have further ignored
+the 8-grams, formed solely from the six most frequent English words
+(the, of, and, a, in, to), or the possessive {\it'{}s}.
  \end{itemize}
  
-We have gathered all the common features of both types for a given document
-pair, and formed {\it valid intervals} from them, as described
-in \cite{Kasprzak2009a}. A similar approach is used also in
-\cite{stamatatos2011plagiarism}.
-The algorithm is modified for multi-feature detection to use character offsets
-only instead of feature order numbers. We have used valid intervals
-consisting of at least 5 common features, with the maximum allowed gap
+We represented each feature with the 32 highest-order bits of its
+MD5 digest. This is only a performance optimization, targeted for
+larger systems. The number of features in a document pair is several orders
+of magnitude lower than $2^{32}$, so the probability of hash function
+collision is low. For pair-wise comparison, it would be feasible to compare
+the features directly instead of their MD5 sums.
+
+Each feature has also the offset and length attributes.
+Offset is taken as the offset of the first word in a given feature,
+and length is the offset of the last character in a given feature
+minus the offset of the feature itself.
+
+\subsection{Common Features}
+
+For further processing, we took into account only the features
+present both in source and suspicious document. For each such
+{\it common feature}, we created the list of
+$(\makebox{offset}, \makebox{length})$ pairs for the source document,
+and a similar list for the suspicious document. Note that a given feature
+can occur multiple times both in source and suspicious document.
+
+\subsection{Valid Intervals}
+
+To detect a plagiarized passage, we need to find a set of common features,
+which map to a dense-enough interval both in the source and suspicious
+document. In our previous work, we described the algorithm
+for discovering these {\it valid intervals} \cite{Kasprzak2009a}.
+A similar approach is used also in \cite{stamatatos2011plagiarism}.
+Both of these algorithms use features of a single type, which 
+allows to use the ordering of features as a measure of distance.
+
+When we use features of different types, there is no natural ordering
+of them: for example, a stop-word 8-gram can span multiple sentences,
+which can contain several word 5-grams. The assumption of both of the
+above algorithms, that the last character of the previous feature
+is before the last character of the current feature, is broken.
+
+We modified the algorithm for computing valid intervals with
+multi-feature detection to use character offsets
+only instead of feature order numbers. We used valid intervals
+consisting of at least 4 common features, with the maximum allowed gap
  inside the interval (characters not belonging to any common feature
-of a given valid interval) set to 3,500 characters.
-
-We have also experimented with modifying the allowed gap size using the
-intrinsic plagiarism detection: to allow only shorter gap if the common
-features around the gap belong to different passages, detected as plagiarized
-in the suspicious document by the intrinsic detector, and allow larger gap,
-if both the surrounding common features belong to the same passage,
-detected by the intrinsic detector. This approach, however, did not show
-any improvement against allowed gap of a static size, so it was omitted
-from the final submission.
+of a given valid interval) set to 4000 characters.
  
  \subsection{Postprocessing}
+\label{postprocessing}
  
  In the postprocessing phase, we took the resulting valid intervals,
-and made attempt to further improve the results. We have firstly
+and made attempt to further improve the results. We firstly
  removed overlaps: if both overlapping intervals were
  shorter than 300 characters, we have removed both of them. Otherwise, we
  kept the longer detection (longer in terms of length in the suspicious document).
  
-We have then joined the adjacent valid intervals into one detection,
-if at least one of the following criteria has been met:
+We then joined the adjacent valid intervals into one detection,
+if at least one of the following criteria were met:
  \begin{itemize}
  \item the gap between the intervals contained at least 4 common features,
  and it contained at least one feature per 10,000
@@ -129,48 +159,69 @@ features per character in the adjacent interval was not more than three times
  bigger than number of features per character in the possible joined interval.
  \end{itemize}
  
-These parameters were fine-tuned to achieve the best results on the training corpus. With these parameters, our algorithm got the total plagdet score of 0.73 on the training corpus.
-
-\subsection{Other Approaches Tried}
-
-There are several other approaches we have evaluated, but which were
+\subsection{Results}
+
+These parameters were fine-tuned to achieve the best results on the training
+corpus. With these parameters, our algorithm got the total plagdet score
+of 0.7288 on the training corpus. The details of the performance of
+our algorithm are presented in Table \ref{table-final}.
+In the PAN 2012 competition, we have acchieved the plagdet score
+of 0.6827, precision 0.8932, recall 0.5525, and granularity 1.0000.
+
+\begin{table}
+\begin{center}
+\begin{tabular}{|l|r|r|r|r|}
+\hline
+&plagdet&recall&precision&granularity\\
+\hline
+whole corpus&0.7288&0.5994&0.9306&1.0007\\
+\hline
+01-no-plagiarism    &0.0000&0.0000&0.0000&1.0000\\
+02-no-obfuscation   &0.9476&0.9627&0.9330&1.0000\\
+03-artificial-low   &0.8726&0.8099&0.9477&1.0013\\
+04-artificial-high  &0.3649&0.2255&0.9562&1.0000\\
+05-translation      &0.7610&0.6662&0.8884&1.0008\\
+06-simulated-paraphrase&0.5972&0.4369&0.9433&1.0000\\
+\hline
+\end{tabular}
+\end{center}
+\caption{Performance on the training corpus}
+\label{table-final}
+\end{table}
+
+\subsection{Other Approaches Explored}
+
+There are several other approaches we evaluated, but which were
  omitted from our final submission for various reasons. We think mentioning
-them here is worthwhile nevertheless.
+them here is worthwhile nevertheless:
  
  \subsubsection{Intrinsic Plagiarism Detection}
  
-Our approach is based on character $n$-gram profiles of the interval of
+We tested the approach based on character $n$-gram profiles of the interval of
  the fixed size (in terms of $n$-grams), and their differences to the
  profile of the whole document \cite{pan09stamatatos}. We have further
  enhanced the approach with using gaussian smoothing of the style-change
-function \cite{Kasprzak2010}.
-
-For PAN 2012, we have experimented with using 1-, 2-, and 3-grams instead
-of only 3-grams, and using the different measure of the difference between
-the n-gram profiles. We have used an approach similar to \cite{ngram},
-where we have compute the profile as an ordered set of 400 most-frequent
-$n$-grams in a given text (the whole document or a partial window). Apart
-from ordering the set, we have ignored the actual number of occurrences
-of a given $n$-gram altogether, and used the value inveresly
-proportional to the $n$-gram order in the profile, in accordance with
-the Zipf's law \cite{zipf1935psycho}.
-
-This approach has provided more stable style-change function than
-than the one proposed in \cite{pan09stamatatos}. Because of pair-wise
-nature of the detailed comparison sub-task, we couldn't use the results
-of the intrinsic detection immediately, therefore we wanted to use them
-as hints to the external detection.
-
-\subsubsection{Language Detection}
-
-For language detection, we used the $n$-gram based categorization \cite{ngram}.
-We have computed the language profiles from the source documents of the
-training corpus (using the annotations from the corpus itself). The result
-of this approach was better than using the stopwords-based detection we have
-used in PAN 2010. However, there were still mis-detected documents,
-mainly the long lists of surnames and other tabular data. We have added
-an ad-hoc fix, where for documents having their profile too distant from all of
-English, German, and Spanish profiles, we have declared them to be in English.
+function \cite{Kasprzak2010}. For PAN 2012, we made further improvements
+to the algorithm, resulting in more stable style change function in
+both short and long documents.
+
+We tried to use the results of the intrinsic plagiarism detection
+as hint for the post-processing phase, allowing to merge larger
+intervals, if they both belong to the same passage, detected by
+the intrinsic detector. This approach did not provide improvement
+when compared to the static gap limits, as described in Section
+\ref{postprocessing}, so we have omitted it from our final submission.
+
+%\subsubsection{Language Detection}
+%
+%For language detection, we used the $n$-gram based categorization \cite{ngram}.
+%We computed the language profiles from the source documents of the
+%training corpus (using the annotations from the corpus itself). The result
+%of this approach was better than using the stopwords-based detection we have
+%used in PAN 2010. However, there were still mis-detected documents,
+%mainly the long lists of surnames and other tabular data. We added
+%an ad-hoc fix, where for documents having their profile too distant from all of
+%English, German, and Spanish profiles, we declared them to be in English.
  
  \subsubsection{Cross-lingual Plagiarism Detection}
  
@@ -184,46 +235,77 @@ with the additional exact matching of longer words (we have used words with
  We have supposed that these longer words can be names or specialized terms,
  present in both languages.
  
-We have used dictionaries from several sources, like
+We used dictionaries from several sources, for example
  {\it dicts.info}\footnote{\url{http://www.dicts.info/}},
  {\it omegawiki}\footnote{\url{http://www.omegawiki.org/}},
-and {\it wiktionary}\footnote{\url{http://en.wiktionary.org/}}. The source
-and translated document were aligned on a line-by-line basis.
-
-In the final form of the detailed comparison sub-task, the results of machine
-translation of the source documents were provided to the detector programs
-by the surrounding environment, so we have discarded the language detection
-and machine translation from our submission altogether, and used only
-line-by-line alignment of the source and translated document for calculating
-the offsets of text features in the source document. We have then treated
-the translated documents the same way as the source documents in English.
-
-\subsection{Further discussion}
+and {\it wiktionary}\footnote{\url{http://en.wiktionary.org/}}.
  
-As in our PAN 2010 submission, we tried to make use of the intrinsic plagiarism
-detection, but despite making further improvements to the intrinsic plagiarism detector, we have again failed to reach any significant improvement
-when using it as a hint for the external plagiarism detection.
+In the final submission, we simply used the machine translated texts,
+which were provided to the running program from the surrounding environment.
  
-In the full paper, we will also discuss the following topics:
  
-\begin{itemize}
-\item language detection and cross-language common features
-\item intrinsic plagiarism detection
-\item suitability of plagdet score\cite{potthastframework} for performance measurement
-\item feasibility of our approach in large-scale systems
-\item discussion of parameter settings
-\end{itemize}
+\subsection{Further discussion}
  
-\nocite{pan09stamatatos}
-\nocite{ngram}
+From our previous PAN submissions, we knew that the precision of our
+system was good, and because of the way how the final score is computed, we
+wanted to exchange a bit worse precision for better recall and granularity.
+So we pushed the parameters towards detecting more plagiarized passages,
+even when the number of common features was not especially high.
+
+\subsubsection{Plagdet score}
+
+Our results from tuning the parameters show that the plagdet score\cite{potthastframework}
+is not a good measure for comparing the plagiarism detection systems:
+for example, the gap of 30,000 characters, described in Section \ref{postprocessing},
+can easily mean several pages of text. And still the system with this
+parameter set so high resulted in better plagdet score.
+
+Another problem of plagdet can be
+seen in the 01-no-plagiarism part of the training corpus: the border
+between the perfect score 1 and the score 0 is a single false-positive
+detection. Plagdet does not distinguish between the system reporting this
+single false-positive, and the system reporting the whole data as plagiarized.
+Both get the score 0. However, our experience from real-world plagiarism detection systems show that
+the plagiarized documents are in a clear minority, so the performance of
+the detection system on non-plagiarized documents is very important.
+
+\subsubsection{Performance Notes}
+
+We consider comparing the CPU-time performance of PAN 2012 submissions almost
+meaningless, because any sane system would precompute features for all
+documents in a given set of suspicious and source documents, and use the
+results for pair-wise comparison, expecting that any document will take
+part in more than one pair.
+
+Also, the pair-wise comparison without caching any intermediate results
+lead to worse overall performance: in our PAN 2010 submission, one of the
+post-processing steps was to remove all the overlapping detections
+from a given suspicious documents, when these detections were from different
+source doucments, and were short enough. This removed many false-positives
+and improved the precision of our system. This kind of heuristics was
+not possible in PAN 2012.
+
+As for the performance of our system, we split the task into two parts:
+1. finding the common features, and 2. computing valid intervals and
+postprocessing. The first part is more CPU intensive, and the results
+can be cached. The second part is fast enough to allow us to evaluate
+many combinations of parameters.
+
+We did our development on a machine with four six-core AMD 8139 CPUs
+(2800 MHz), and 128 GB RAM. The first phase took about 2500 seconds
+on this host, and the second phase took 14 seconds. Computing the
+plagdet score using the official script in Python took between 120 and
+180 seconds, as there is no parallelism in this script.
+
+When we tried to use intrinsic plagiarism detection and language
+detection, the first phase took about 12500 seconds. Thus omitting these
+featurs clearly provided huge performance improvement.
+
+The code was written in Perl, and had about 669 lines of code,
+not counting comments and blank lines.
  
  \endinput
  
-Co chci diskutovat v zaveru:
-- nebylo mozno cachovat data
-- nebylo mozno vylucovat prekryvajici se podobnosti
-- cili udaje o run-time jsou uplne nahouby
-- 669 radku kodu bez komentaru a prazdnych radku
  - hranice mezi pasazema nekdy zahrnovala whitespace a nekdy ne.
  
  Diskuse plagdet: