X-Git-Url: https://www.fi.muni.cz/~kas/git//home/kas/public_html/git/?a=blobdiff_plain;f=yenya-detailed.tex;fp=yenya-detailed.tex;h=3615ab9b45e364f79d2562566d9488569701ab40;hb=fb012b2add74c5aeee11754a3ae6df1394bfee25;hp=0000000000000000000000000000000000000000;hpb=75de68e9bb081ce2167cc2fc8aa4ae63b45764b4;p=pan12-paper.git

diff --git a/yenya-detailed.tex b/yenya-detailed.tex
new file mode 100644
index 0000000..3615ab9
--- /dev/null
+++ b/yenya-detailed.tex
@@ -0,0 +1,150 @@
+\section{Detailed Document Comparison}
+
+\subsection{General Approach}
+
+The approach Masaryk University team has used in PAN 2012 Plagiarism
+detection---detailed comparison sub-task is based on the same approach
+that we have used in PAN 2010 \cite{Kasprzak2010}.  This time, we have
+used a similar approach, enhanced by several means
+
+The algorithm evaluates the document pair in several stages:
+
+\begin{itemize}
+\item intrinsic plagiarism detection
+\item language detection of the source document
+\begin{itemize}
+\item cross-lingual plagiarism detection, if the source document is not in English
+\end{itemize}
+\item detecting intervals with common features
+\item post-processing phase, mainly serves for merging the nearby common intervals
+\end{itemize}
+
+\subsection{Intrinsic plagiarism detection}
+
+Our approach is based on character $n$-gram profiles of the interval of
+the fixed size (in terms of $n$-grams), and their differences to the
+profile of the whole document \cite{pan09stamatatos}. We have further
+enhanced the approach with using gaussian smoothing of the style-change
+function \cite{Kasprzak2010}.
+
+For PAN 2012, we have experimented with using 1-, 2-, and 3-grams instead
+of only 3-grams, and using the different measure of the difference between
+the n-gram profiles. We have used an approach similar to \cite{ngram},
+where we have compute the profile as an ordered set of 400 most-frequent
+$n$-grams in a given text (the whole document or a partial window). Apart
+from ordering the set we have ignored the actual number of occurrences
+of a given $n$-gram altogether, and used the value inveresly
+proportional to the $n$-gram order in the profile, in accordance with
+the Zipf's law \cite{zipf1935psycho}.
+
+This approach has provided more stable style-change function than
+than the one proposed in \cite{pan09stamatatos}. Because of pair-wise
+nature of the detailed comparison sub-task, we couldn't use the results
+of the intrinsic detection immediately, so we wanted to use them
+as hints to the external detection.
+
+\subsection{Cross-lingual detection}
+
+%For language detection, we used the $n$-gram based categorization \cite{ngram}.
+%We have computed the language profiles from the source documents of the
+%training corpus (using the annotations from the corpus itself). The result
+%of this approach was better than using the stopwords-based detection we have
+%used in PAN 2010. However, there were still mis-detected documents,
+%mainly the long lists of surnames and other tabular data. We have added
+%an ad-hoc fix, where for documents having their profile too distant from all of
+%English, German, and Spanish profiles, we have declared them to be in English.
+
+For cross-lingual plagiarism detection, our aim was to use the public
+interface to Google translate if possible, and use the resulting document
+as the source for standard intra-lingual detector.
+Should the translation service not be available, we wanted
+to use the fall-back strategy of translating isolated words only,
+with the additional exact matching of longer words (we have used words with
+5 characters or longer).
+We have supposed these longer words can be names or specialized terms,
+present in both languages.
+
+We have used dictionaries from several sources, like
+{\tt dicts.info\footnote{\url{http://www.dicts.info/}}},
+{\tt omegawiki\footnote{\url{http://www.omegawiki.org/}}},
+and {\tt wiktionary\footnote{\url{http://en.wiktionary.org/}}}. The source
+and translated document were aligned on a line-by-line basis.
+
+In the final form of the detailed comparison sub-task, the results of machine
+translation of the source documents were provided to the detector programs
+by the surrounding environment, so we have discarded the language detection
+and machine translation from our submission altogether, and used only
+line-by-line alignment of the source and translated document for calculating
+the offsets of text features in the source document.
+
+\subsection{Multi-feature Plagiarism Detection}
+
+Our pair-wise plagiarism detection is based on finding common passages
+of text, present both in the source and suspicious document. We call them
+{\it features}. In PAN 2010, we have used sorted word 5-grams, formed from
+words of three or more characters, as features to compare.
+Recently, other means of plagiarism detection have been explored:
+Stop-word $n$-gram detection is one of them
+\cite{stamatatos2011plagiarism}.
+
+We propose the plagiarism detection system based on detecting common
+features of various type, like word $n$-grams, stopword $n$-grams,
+translated words or word bigrams, exact common longer words from document
+pairs having each document in a different language, etc. The system
+has to be to the great extent independent of the specialities of various
+feature types. It cannot, for example, use the order of given features
+as a measure of distance between the features, as for example, several
+word 5-grams can be fully contained inside one stopword 8-gram.
+
+We thus define {\it common feature} of two documents (susp and src)
+as the following tuple:
+$$\langle
+\hbox{offset}_{\hbox{susp}},
+\hbox{length}_{\hbox{susp}},
+\hbox{offset}_{\hbox{src}},
+\hbox{length}_{\hbox{src}} \rangle$$
+
+In our final submission, we have used only the following two types
+of common features:
+
+\begin{itemize}
+\item word 5-grams, from words of three or more characters, sorted, lowercased
+\item stop-word 8-grams, from 50 most-frequent English words (including
+	the possessive suffix 's), unsorted, lowercased, with 8-grams formed
+	only from the seven most-frequent words ({\it the, of, a, in, to, 's})
+	removed
+\end{itemize}
+
+We have gathered all the common features for a given document pair, and formed
+{\it valid intervals} from them, as described in \cite{Kasprzak2009a}
+(a similar approach is used also in \cite{stamatatos2011plagiarism}).
+The algorithm is modified for multi-feature detection to use character offsets
+only instead of feature order numbers. We have used valid intervals
+consisting of at least 5 common features, with the maximum allowed gap
+inside the interval (characters not belonging to any common feature
+of a given valid interval) set to 3,500 characters.
+
+We have also experimented with modifying the allowed gap size using the
+intrinsic plagiarism detection: to allow only shorter gap if the common
+features around the gap belong to different passages, detected as plagiarized
+in the suspicious document by the intrinsic detector, and allow larger gap,
+if both the surrounding common features belong to the same passage,
+detected by the intrinsic detector. This approach, however, did not show
+any improvement against allowed gap of a static size, so it was omitted
+from the final submission.
+
+\subsection{Postprocessing}
+
+
+\subsection{Further discussion}
+
+In the full paper, we will also discuss the following topics:
+
+\begin{itemize}
+\item language detection
+\item suitability of plagdet score\cite{potthastframework} for performance measurement
+\item feasibility of our approach in large-scale systems
+\item other possible features to use, especially for cross-lingual detection
+\item discussion of parameter settings
+\end{itemize}
+