X-Git-Url: https://www.fi.muni.cz/~kas/git//home/kas/public_html/git/?a=blobdiff_plain;f=extended-abstract.tex;h=675485234f1b11b54e4d86273cea7d2b8823948a;hb=HEAD;hp=301338d5982c9d3735f5d46e09d6c62c6a9e01da;hpb=8370b09a4d2789d55381b37f6f0750bdde14392b;p=pan12-paper.git diff --git a/extended-abstract.tex b/extended-abstract.tex index 301338d..6754852 100644 --- a/extended-abstract.tex +++ b/extended-abstract.tex @@ -16,19 +16,187 @@ \maketitle -- odkaz na predchozi prace +\section{General Approach} -- parametry ktere jsme pouzili +Our approach in PAN 2012 Plagiarism detection---Detailed comparison sub-task +is loosely based on the approach we have used in PAN 2010 \cite{Kasprzak2010}. -- multi-features +%The algorithm evaluates the document pair in several stages: +% +%\begin{itemize} +%\item intrinsic plagiarism detection +%\item language detection of the source document +%\begin{itemize} +%\item cross-lingual plagiarism detection, if the source document is not in English +%\end{itemize} +%\item detecting intervals with common features +%\item post-processing phase, mainly serves for merging the nearby common intervals +%\end{itemize} -- post-processing +%\section{Intrinsic plagiarism detection} +% +%Our approach is based on character $n$-gram profiles of the interval of +%the fixed size (in terms of $n$-grams), and their differences to the +%profile of the whole document \cite{pan09stamatatos}. We have further +%enhanced the approach with using gaussian smoothing of the style-change +%function \cite{Kasprzak2010}. +% +%For PAN 2012, we have experimented with using 1-, 2-, and 3-grams instead +%of only 3-grams, and using the different measure of the difference between +%the n-gram profiles. We have used an approach similar to \cite{ngram}, +%where we have compute the profile as an ordered set of 400 most-frequent +%$n$-grams in a given text (the whole document or a partial window). Apart +%from ordering the set, we have ignored the actual number of occurrences +%of a given $n$-gram altogether, and used the value inveresly +%proportional to the $n$-gram order in the profile, in accordance with +%the Zipf's law \cite{zipf1935psycho}. +% +%This approach has provided more stable style-change function than +%than the one proposed in \cite{pan09stamatatos}. Because of pair-wise +%nature of the detailed comparison sub-task, we couldn't use the results +%of the intrinsic detection immediately, therefore we wanted to use them +%as hints to the external detection. -- kritika plagdet? +\section{Cross-lingual Plagiarism Detection} + +%For language detection, we used the $n$-gram based categorization \cite{ngram}. +%We have computed the language profiles from the source documents of the +%training corpus (using the annotations from the corpus itself). The result +%of this approach was better than using the stopwords-based detection we have +%used in PAN 2010. However, there were still mis-detected documents, +%mainly the long lists of surnames and other tabular data. We have added +%an ad-hoc fix, where for documents having their profile too distant from all of +%English, German, and Spanish profiles, we have declared them to be in English. + +%For cross-lingual plagiarism detection, our aim was to use the public +%interface to Google translate if possible, and use the resulting document +%as the source for standard intra-lingual detector. +%Should the translation service not be available, we wanted +%to use the fall-back strategy of translating isolated words only, +%with the additional exact matching of longer words (we have used words with +%5 characters or longer). +%We have supposed that these longer words can be names or specialized terms, +%present in both languages. + +%We have used dictionaries from several sources, like +%{\it dicts.info}\footnote{\url{http://www.dicts.info/}}, +%{\it omegawiki}\footnote{\url{http://www.omegawiki.org/}}, +%and {\it wiktionary}\footnote{\url{http://en.wiktionary.org/}}. The source +%and translated document were aligned on a line-by-line basis. + +In the final form of the detailed comparison sub-task, the results of machine +translation of the source documents were provided to the detector programs +by the surrounding environment, so we have discarded the language detection +and machine translation from our submission altogether, and used only +line-by-line alignment of the source and translated document for calculating +the offsets of text features in the source document. We have then treated +the translated documents the same way as the source documents in English. + +\section{Multi-feature Plagiarism Detection} + +Our pair-wise plagiarism detection is based on finding common passages +of text, present both in the source and in the suspicious document. We call them +{\it common features}. In PAN 2010, we have used sorted word 5-grams, formed from +words of three or more characters, as features to compare. +Recently, other means of plagiarism detection have been explored: +stopword $n$-gram detection is one of them +\cite{stamatatos2011plagiarism}. + +We propose the plagiarism detection system based on detecting common +features of various types, for example word $n$-grams, stopword $n$-grams, +translated single words, translated word bigrams, +exact common longer words from document pairs having each document +in a different language, etc. The system +has to be to the great extent independent of the specialities of various +feature types. It cannot, for example, use the order of given features +as a measure of distance between the features, as for example, several +word 5-grams can be fully contained inside one stopword 8-gram. + +We therefore propose to describe the {\it common feature} of two documents +(susp and src) with the following tuple: +$\langle +\hbox{offset}_{\hbox{susp}}, +\hbox{length}_{\hbox{susp}}, +\hbox{offset}_{\hbox{src}}, +\hbox{length}_{\hbox{src}} \rangle$. This way, the common feature is +described purely in terms of character offsets, belonging to the feature +in both documents. In our final submission, we have used the following two types +of common features: + +\begin{itemize} +\item word 5-grams, from words of three or more characters, sorted, lowercased +\item stopword 8-grams, from 50 most-frequent English words (including + the possessive suffix 's), unsorted, lowercased, with 8-grams formed + only from the seven most-frequent words ({\it the, of, a, in, to, 's}) + removed +\end{itemize} + +We have gathered all the common features of both types for a given document +pair, and formed {\it valid intervals} from them, as described +in \cite{Kasprzak2009a}. A similar approach is used also in +\cite{stamatatos2011plagiarism}. +The algorithm is modified for multi-feature detection to use character offsets +only instead of feature order numbers. We have used valid intervals +consisting of at least 5 common features, with the maximum allowed gap +inside the interval (characters not belonging to any common feature +of a given valid interval) set to 3,500 characters. + +%We have also experimented with modifying the allowed gap size using the +%intrinsic plagiarism detection: to allow only shorter gap if the common +%features around the gap belong to different passages, detected as plagiarized +%in the suspicious document by the intrinsic detector, and allow larger gap, +%if both the surrounding common features belong to the same passage, +%detected by the intrinsic detector. This approach, however, did not show +%any improvement against allowed gap of a static size, so it was omitted +%from the final submission. + +\section{Postprocessing} + +In the postprocessing phase, we took the resulting valid intervals, +and made attempt to further improve the results. We have firstly +removed overlaps: if both overlapping intervals were +shorter than 300 characters, we have removed both of them. Otherwise, we +kept the longer detection (longer in terms of length in the suspicious document). + +We have then joined the adjacent valid intervals into one detection, +if at least one of the following criteria has been met: +\begin{itemize} +\item the gap between the intervals contained at least 4 common features, +and it contained at least one feature per 10,000 +characters\footnote{we have computed the length of the gap as the number +of characters between the detections in the source document, plus the +number of charaters between the detections in the suspicious document.}, or +\item the gap was smaller than 30,000 characters and the size of the adjacent +valid intervals was at least twice as big as the gap between them, or +\item the gap was smaller than 30,000 characters and the number of common +features per character in the adjacent interval was not more than three times +bigger than number of features per character in the possible joined interval. +\end{itemize} + +These parameters were fine-tuned to achieve the best results on the training corpus. With these parameters, our algorithm got the total plagdet score of 0.73 on the training corpus. + +\section{Further discussion} + +As in our PAN 2010 submission, we tried to make use of the intrinsic plagiarism +detection, but despite making further improvements to the intrinsic plagiarism detector, we have again failed to reach any significant improvement +when using it as a hint for the external plagiarism detection. + +In the full paper, we will also discuss the following topics: + +\begin{itemize} +\item language detection and cross-language common features +\item intrinsic plagiarism detection +\item suitability of plagdet score\cite{potthastframework} for performance measurement +\item feasibility of our approach in large-scale systems +\item discussion of parameter settings +\end{itemize} + +\nocite{pan09stamatatos} +%\nocite{ngram} \bibliographystyle{splncs03} \begin{raggedright} -\bibliography{} +\bibliography{paper} \end{raggedright} \end{document}