From 38a297b9d74461a2c2396d8c020bb81c7bd54273 Mon Sep 17 00:00:00 2001
From: "Jan \"Yenya\" Kasprzak" <kas@fi.muni.cz>
Date: Tue, 14 Aug 2012 23:54:25 +0200
Subject: [PATCH 1/1] yenya: dalsi prace na clanku

TODO: veci co jsem nepouzil trochu zkratit
pridat diskusi plagdet jak pisu v poznamkach.
---
 yenya-detailed.tex | 210 ++++++++++++++++++++++++++++++++-------------
 1 file changed, 152 insertions(+), 58 deletions(-)

diff --git a/yenya-detailed.tex b/yenya-detailed.tex
index 9f40346..46a2cd5 100755
--- a/yenya-detailed.tex
+++ b/yenya-detailed.tex
@@ -39,72 +39,101 @@ merging in the post-processing stage.
 The algorithm evaluates the document pair in several stages:
 
 \begin{itemize}
-\item intrinsic plagiarism detection
-\item language detection of the source document
-\begin{itemize}
-\item cross-lingual plagiarism detection, if the source document is not in English
+\item tokenizing both the suspicious and source documents
+\item forming {\it features} from some tokens
+\item discovering {\it common features}
+\item making {\it valid intervals} from common features
+\item postprocessing
 \end{itemize}
-\item detecting intervals with common features
-\item post-processing phase, mainly serves for merging the nearby common intervals
+
+\subsection{Tokenization}
+
+We tokenize the document into words, where word is a sequence of one
+or more characters of the {\it Letter} Unicode class.
+With each word, two additional attributes, needed for further processing,
+are associated: the offset where the word begins, and the word length.
+
+The offset where the word begins is not necessarily the first letter character
+of the word itself. We have discovered that in the training corpus,
+some plagiarized passages were annotated including the preceding
+non-letter characters. We have used the following heuristics to add 
+parts of the inter-word gap to the previous, or the next adjacent word:
+
+\begin{itemize}
+\item when the inter-word gap contains interpunction (any of the dot,
+semicolon, colon, comma, exclamation mark, question mark, or quotes),
+add the characters up to, and including the interpunction, to the previous
+word, ignore the space character(s) after the interpunction, and add
+the rest to tne next word.
+\item otherwise, when the inter-word gap contains newline, add the character
+before the first newline to the previous word, ignore the first newline
+character, and add the rest to the next word.
+\item otherwise, ignore the inter-word gap characters altogether.
 \end{itemize}
 
-\subsection{Multi-feature Plagiarism Detection}
+When the detection program has been presented three different
+files instead of two (meaning the third one is machine-translated
+version of the second one), we have tokenized the translated document instead
+of the source one. We have make use of the line-by-line alignment of the
+source and machine-translated documents to transform the word offsets
+and lengths in the translated document to the terms of the source document.
 
-Our pair-wise plagiarism detection is based on finding common passages
-of text, present both in the source and in the suspicious document. We call them
-{\it common features}. In PAN 2010, we have used sorted word 5-grams, formed from
-words of three or more characters, as features to compare.
-Recently, other means of plagiarism detection have been explored:
-stopword $n$-gram detection is one of them
-\cite{stamatatos2011plagiarism}.
+\subsection{Features}
 
-We propose the plagiarism detection system based on detecting common
-features of various types, for example word $n$-grams, stopword $n$-grams,
-translated single words, translated word bigrams,
-exact common longer words from document pairs having each document
-in a different language, etc. The system
-has to be to the great extent independent of the specialities of various
-feature types. It cannot, for example, use the order of given features
-as a measure of distance between the features, as for example, several
-word 5-grams can be fully contained inside one stopword 8-gram.
-
-We therefore propose to describe the {\it common feature} of two documents
-(susp and src) with the following tuple:
-$\langle
-\hbox{offset}_{\hbox{susp}},
-\hbox{length}_{\hbox{susp}},
-\hbox{offset}_{\hbox{src}},
-\hbox{length}_{\hbox{src}} \rangle$. This way, the common feature is
-described purely in terms of character offsets, belonging to the feature
-in both documents. In our final submission, we have used the following two types
-of common features:
+We have used features of two types:
 
 \begin{itemize}
-\item word 5-grams, from words of three or more characters, sorted, lowercased
-\item stopword 8-grams, from 50 most-frequent English words (including
-	the possessive suffix 's), unsorted, lowercased, with 8-grams formed
-	only from the seven most-frequent words ({\it the, of, a, in, to, 's})
-	removed
+\item Lexicographically sorted word 5-grams, formed of words at least
+three characters long, and
+\item unsorted stop-word 8-grams, formed from 50 most frequent words in English,
+as described in \cite{stamatatos2011plagiarism}. We have further ignored
+the 8-grams, formed solely from the six most frequent English words
+(the, of, and, a, in, to), or the possessive {\it'{}s}.
 \end{itemize}
 
-We have gathered all the common features of both types for a given document
-pair, and formed {\it valid intervals} from them, as described
-in \cite{Kasprzak2009a}. A similar approach is used also in
-\cite{stamatatos2011plagiarism}.
-The algorithm is modified for multi-feature detection to use character offsets
+We have represented each feature with the 32 highest-order bits of its
+MD5 digest. This is only a performance optimization, targeted for
+larger systems. The number of features in a document pair is several orders
+of magnitude lower than $2^{32}$, so the probability of hash function
+collision is low. For pair-wise comparison, it would be feasible to compare
+the features directly instead of their MD5 sums.
+
+Each feature has also the offset and length attributes.
+Offset is taken as the offset of the first word in a given feature,
+and length is the offset of the last character in a given feature
+minus the offset of the feature itself.
+
+\subsection{Common Features}
+
+For further processing, we have taken into account only the features
+present both in source and suspicious document. For each such
+{\it common feature}, we have created the list of
+$(\makebox{offset}, \makebox{length})$ pairs for the source document,
+and a similar list for the suspicious document. Note that a given feature
+can occur multiple times both in source and suspicious document.
+
+\subsection{Valid Intervals}
+
+To detect a plagiarized passage, we need to find a set of common features,
+which map to a dense-enough interval both in the source and suspicious
+document. In our previous work, we have presented the algorithm
+for discovering these {\it valid intervals} \cite{Kasprzak2009a}.
+A similar approach is used also in \cite{stamatatos2011plagiarism}.
+Both of these algorithms use the features of a single type, which 
+allows to use the ordering of features as a measure of distance.
+
+When we use features of different types, there is no natural ordering
+of them: for example, a stop-word 8-gram can span multiple sentences,
+which can contain several word 5-grams. The assumption of both of the
+above algorithms, that the last character of the previous feature
+is before the last character of the current feature, is broken.
+
+We have modified the algorithm for computing valid intervals with
+multi-feature detection to use character offsets
 only instead of feature order numbers. We have used valid intervals
-consisting of at least 5 common features, with the maximum allowed gap
+consisting of at least 4 common features, with the maximum allowed gap
 inside the interval (characters not belonging to any common feature
-of a given valid interval) set to 3,500 characters.
-
-We have also experimented with modifying the allowed gap size using the
-intrinsic plagiarism detection: to allow only shorter gap if the common
-features around the gap belong to different passages, detected as plagiarized
-in the suspicious document by the intrinsic detector, and allow larger gap,
-if both the surrounding common features belong to the same passage,
-detected by the intrinsic detector. This approach, however, did not show
-any improvement against allowed gap of a static size, so it was omitted
-from the final submission.
+of a given valid interval) set to 4000 characters.
 
 \subsection{Postprocessing}
 
@@ -129,9 +158,37 @@ features per character in the adjacent interval was not more than three times
 bigger than number of features per character in the possible joined interval.
 \end{itemize}
 
-These parameters were fine-tuned to achieve the best results on the training corpus. With these parameters, our algorithm got the total plagdet score of 0.73 on the training corpus.
-
-\subsection{Other Approaches Tried}
+\subsection{Results}
+
+These parameters were fine-tuned to achieve the best results on the training
+corpus. With these parameters, our algorithm got the total plagdet score
+of 0.7288 on the training corpus. The details of the performance of
+our algorithm are presented in Table \ref{table-final}.
+In the PAN 2012 competition, we have acchieved the plagdet score
+of 0.6827, precision 0.8932, recall 0.5525, and granularity 1.0000.
+
+\begin{table}
+\begin{center}
+\begin{tabular}{|l|r|r|r|r|}
+\hline
+&plagdet&recall&precision&granularity\\
+\hline
+whole corpus&0.7288&0.5994&0.9306&1.0007\\
+\hline
+01-no-plagiarism    &0.0000&0.0000&0.0000&1.0000\\
+02-no-obfuscation   &0.9476&0.9627&0.9330&1.0000\\
+03-artificial-low   &0.8726&0.8099&0.9477&1.0013\\
+04-artificial-high  &0.3649&0.2255&0.9562&1.0000\\
+05-translation      &0.7610&0.6662&0.8884&1.0008\\
+06-simulated-paraphrase&0.5972&0.4369&0.9433&1.0000\\
+\hline
+\end{tabular}
+\end{center}
+\caption{Performance on the training corpus}
+\label{table-final}
+\end{table}
+
+\subsection{Other Approaches Explored}
 
 There are several other approaches we have evaluated, but which were
 omitted from our final submission for various reasons. We think mentioning
@@ -161,6 +218,15 @@ nature of the detailed comparison sub-task, we couldn't use the results
 of the intrinsic detection immediately, therefore we wanted to use them
 as hints to the external detection.
 
+We have also experimented with modifying the allowed gap size using the
+intrinsic plagiarism detection: to allow only shorter gap if the common
+features around the gap belong to different passages, detected as plagiarized
+in the suspicious document by the intrinsic detector, and allow larger gap,
+if both the surrounding common features belong to the same passage,
+detected by the intrinsic detector. This approach, however, did not show
+any improvement against allowed gap of a static size, so it was omitted
+from the final submission.
+
 \subsubsection{Language Detection}
 
 For language detection, we used the $n$-gram based categorization \cite{ngram}.
@@ -198,6 +264,34 @@ line-by-line alignment of the source and translated document for calculating
 the offsets of text features in the source document. We have then treated
 the translated documents the same way as the source documents in English.
 
+\subsection{Performance Notes}
+
+We consider comparing the performance of PAN 2012 submissions almost
+meaningless, because any sane system would precompute features for all
+documents in a given set of suspicious and source documents, and use the
+results for pair-wise comparison, expecting that any document will take
+part in more than one pair.
+
+We did not use this exact split in our submission, but in order to be able
+to evaluate various approaches faster, we have split our computation into
+the following two parts: in the first part, common features have been
+computed, and the results stored into a file\footnote{We have use the
+{\tt Storable.pm} storage available in Perl.}. The second part
+then used this data to compute valid intervals and do post-processing.
+
+We did our development on a machine with four six-core AMD 8139 CPUs
+(2800 MHz), and 128 GB RAM. The first phase took about 2500 seconds
+on this host, and the second phase took 14 seconds. Computing the
+plagdet score using the official script in Python took between 120 and
+180 seconds, as there is no parallelism in this script.
+
+When we have tried to use intrinsic plagiarism detection and language
+detection, the first phase took about 12500 seconds. Thus omitting these
+featurs clearly provided huge performance improvement.
+
+The code has been written in Perl, and had about 669 lines of code,
+not counting comments and blank lines.
+
 \subsection{Further discussion}
 
 As in our PAN 2010 submission, we tried to make use of the intrinsic plagiarism
-- 
2.43.5