yenya: aplikovany pripominky od Simona

[pan12-paper.git] / extended-abstract.tex
diff --git a/extended-abstract.tex b/extended-abstract.tex

index 301338d5982c9d3735f5d46e09d6c62c6a9e01da..675485234f1b11b54e4d86273cea7d2b8823948a 100644 (file)
--- a/extended-abstract.tex
+++ b/extended-abstract.tex
@@ -16,19 +16,187 @@
  
  \maketitle
  
-- odkaz na predchozi prace
+\section{General Approach}
  
-- parametry ktere jsme pouzili
+Our approach in PAN 2012 Plagiarism detection---Detailed comparison sub-task
+is loosely based on the approach we have used in PAN 2010 \cite{Kasprzak2010}.
  
-- multi-features
+%The algorithm evaluates the document pair in several stages:
+%
+%\begin{itemize}
+%\item intrinsic plagiarism detection
+%\item language detection of the source document
+%\begin{itemize}
+%\item cross-lingual plagiarism detection, if the source document is not in English
+%\end{itemize}
+%\item detecting intervals with common features
+%\item post-processing phase, mainly serves for merging the nearby common intervals
+%\end{itemize}
  
-- post-processing
+%\section{Intrinsic plagiarism detection}
+%
+%Our approach is based on character $n$-gram profiles of the interval of
+%the fixed size (in terms of $n$-grams), and their differences to the
+%profile of the whole document \cite{pan09stamatatos}. We have further
+%enhanced the approach with using gaussian smoothing of the style-change
+%function \cite{Kasprzak2010}.
+%
+%For PAN 2012, we have experimented with using 1-, 2-, and 3-grams instead
+%of only 3-grams, and using the different measure of the difference between
+%the n-gram profiles. We have used an approach similar to \cite{ngram},
+%where we have compute the profile as an ordered set of 400 most-frequent
+%$n$-grams in a given text (the whole document or a partial window). Apart
+%from ordering the set, we have ignored the actual number of occurrences
+%of a given $n$-gram altogether, and used the value inveresly
+%proportional to the $n$-gram order in the profile, in accordance with
+%the Zipf's law \cite{zipf1935psycho}.
+%
+%This approach has provided more stable style-change function than
+%than the one proposed in \cite{pan09stamatatos}. Because of pair-wise
+%nature of the detailed comparison sub-task, we couldn't use the results
+%of the intrinsic detection immediately, therefore we wanted to use them
+%as hints to the external detection.
  
-- kritika plagdet?
+\section{Cross-lingual Plagiarism Detection}
+
+%For language detection, we used the $n$-gram based categorization \cite{ngram}.
+%We have computed the language profiles from the source documents of the
+%training corpus (using the annotations from the corpus itself). The result
+%of this approach was better than using the stopwords-based detection we have
+%used in PAN 2010. However, there were still mis-detected documents,
+%mainly the long lists of surnames and other tabular data. We have added
+%an ad-hoc fix, where for documents having their profile too distant from all of
+%English, German, and Spanish profiles, we have declared them to be in English.
+
+%For cross-lingual plagiarism detection, our aim was to use the public
+%interface to Google translate if possible, and use the resulting document
+%as the source for standard intra-lingual detector.
+%Should the translation service not be available, we wanted
+%to use the fall-back strategy of translating isolated words only,
+%with the additional exact matching of longer words (we have used words with
+%5 characters or longer).
+%We have supposed that these longer words can be names or specialized terms,
+%present in both languages.
+
+%We have used dictionaries from several sources, like
+%{\it dicts.info}\footnote{\url{http://www.dicts.info/}},
+%{\it omegawiki}\footnote{\url{http://www.omegawiki.org/}},
+%and {\it wiktionary}\footnote{\url{http://en.wiktionary.org/}}. The source
+%and translated document were aligned on a line-by-line basis.
+
+In the final form of the detailed comparison sub-task, the results of machine
+translation of the source documents were provided to the detector programs
+by the surrounding environment, so we have discarded the language detection
+and machine translation from our submission altogether, and used only
+line-by-line alignment of the source and translated document for calculating
+the offsets of text features in the source document. We have then treated
+the translated documents the same way as the source documents in English.
+
+\section{Multi-feature Plagiarism Detection}
+
+Our pair-wise plagiarism detection is based on finding common passages
+of text, present both in the source and in the suspicious document. We call them
+{\it common features}. In PAN 2010, we have used sorted word 5-grams, formed from
+words of three or more characters, as features to compare.
+Recently, other means of plagiarism detection have been explored:
+stopword $n$-gram detection is one of them
+\cite{stamatatos2011plagiarism}.
+
+We propose the plagiarism detection system based on detecting common
+features of various types, for example word $n$-grams, stopword $n$-grams,
+translated single words, translated word bigrams,
+exact common longer words from document pairs having each document
+in a different language, etc. The system
+has to be to the great extent independent of the specialities of various
+feature types. It cannot, for example, use the order of given features
+as a measure of distance between the features, as for example, several
+word 5-grams can be fully contained inside one stopword 8-gram.
+
+We therefore propose to describe the {\it common feature} of two documents
+(susp and src) with the following tuple:
+$\langle
+\hbox{offset}_{\hbox{susp}},
+\hbox{length}_{\hbox{susp}},
+\hbox{offset}_{\hbox{src}},
+\hbox{length}_{\hbox{src}} \rangle$. This way, the common feature is
+described purely in terms of character offsets, belonging to the feature
+in both documents. In our final submission, we have used the following two types
+of common features:
+
+\begin{itemize}
+\item word 5-grams, from words of three or more characters, sorted, lowercased
+\item stopword 8-grams, from 50 most-frequent English words (including
+       the possessive suffix 's), unsorted, lowercased, with 8-grams formed
+       only from the seven most-frequent words ({\it the, of, a, in, to, 's})
+       removed
+\end{itemize}
+
+We have gathered all the common features of both types for a given document
+pair, and formed {\it valid intervals} from them, as described
+in \cite{Kasprzak2009a}. A similar approach is used also in
+\cite{stamatatos2011plagiarism}.
+The algorithm is modified for multi-feature detection to use character offsets
+only instead of feature order numbers. We have used valid intervals
+consisting of at least 5 common features, with the maximum allowed gap
+inside the interval (characters not belonging to any common feature
+of a given valid interval) set to 3,500 characters.
+
+%We have also experimented with modifying the allowed gap size using the
+%intrinsic plagiarism detection: to allow only shorter gap if the common
+%features around the gap belong to different passages, detected as plagiarized
+%in the suspicious document by the intrinsic detector, and allow larger gap,
+%if both the surrounding common features belong to the same passage,
+%detected by the intrinsic detector. This approach, however, did not show
+%any improvement against allowed gap of a static size, so it was omitted
+%from the final submission.
+
+\section{Postprocessing}
+
+In the postprocessing phase, we took the resulting valid intervals,
+and made attempt to further improve the results. We have firstly
+removed overlaps: if both overlapping intervals were
+shorter than 300 characters, we have removed both of them. Otherwise, we
+kept the longer detection (longer in terms of length in the suspicious document).
+
+We have then joined the adjacent valid intervals into one detection,
+if at least one of the following criteria has been met:
+\begin{itemize}
+\item the gap between the intervals contained at least 4 common features,
+and it contained at least one feature per 10,000
+characters\footnote{we have computed the length of the gap as the number
+of characters between the detections in the source document, plus the
+number of charaters between the detections in the suspicious document.}, or
+\item the gap was smaller than 30,000 characters and the size of the adjacent
+valid intervals was at least twice as big as the gap between them, or
+\item the gap was smaller than 30,000 characters and the number of common
+features per character in the adjacent interval was not more than three times
+bigger than number of features per character in the possible joined interval.
+\end{itemize}
+
+These parameters were fine-tuned to achieve the best results on the training corpus. With these parameters, our algorithm got the total plagdet score of 0.73 on the training corpus.
+
+\section{Further discussion}
+
+As in our PAN 2010 submission, we tried to make use of the intrinsic plagiarism
+detection, but despite making further improvements to the intrinsic plagiarism detector, we have again failed to reach any significant improvement
+when using it as a hint for the external plagiarism detection.
+
+In the full paper, we will also discuss the following topics:
+
+\begin{itemize}
+\item language detection and cross-language common features
+\item intrinsic plagiarism detection
+\item suitability of plagdet score\cite{potthastframework} for performance measurement
+\item feasibility of our approach in large-scale systems
+\item discussion of parameter settings
+\end{itemize}
+
+\nocite{pan09stamatatos}
+%\nocite{ngram}
  
  \bibliographystyle{splncs03}
  \begin{raggedright}
-\bibliography{}
+\bibliography{paper}
  \end{raggedright}
  
  \end{document}