From addd34deb1859f19a1978e77878a4247868d6d67 Mon Sep 17 00:00:00 2001 From: "Jan \"Yenya\" Kasprzak" Date: Fri, 10 Aug 2012 16:31:03 +0200 Subject: [PATCH] yenya: odkomentovany odstavce, ktere jsem zakomentoval kvuli zkraceni extended abstractu --- yenya-detailed.tex | 136 ++++++++++++++++++++++----------------------- 1 file changed, 68 insertions(+), 68 deletions(-) diff --git a/yenya-detailed.tex b/yenya-detailed.tex index 6e4ef03..525a1d3 100644 --- a/yenya-detailed.tex +++ b/yenya-detailed.tex @@ -6,68 +6,68 @@ Our approach in PAN 2012 Plagiarism detection---Detailed comparison sub-task is loosely based on the approach we have used in PAN 2010 \cite{Kasprzak2010}. -%The algorithm evaluates the document pair in several stages: -% -%\begin{itemize} -%\item intrinsic plagiarism detection -%\item language detection of the source document -%\begin{itemize} -%\item cross-lingual plagiarism detection, if the source document is not in English -%\end{itemize} -%\item detecting intervals with common features -%\item post-processing phase, mainly serves for merging the nearby common intervals -%\end{itemize} - -%\subsection{Intrinsic plagiarism detection} -% -%Our approach is based on character $n$-gram profiles of the interval of -%the fixed size (in terms of $n$-grams), and their differences to the -%profile of the whole document \cite{pan09stamatatos}. We have further -%enhanced the approach with using gaussian smoothing of the style-change -%function \cite{Kasprzak2010}. -% -%For PAN 2012, we have experimented with using 1-, 2-, and 3-grams instead -%of only 3-grams, and using the different measure of the difference between -%the n-gram profiles. We have used an approach similar to \cite{ngram}, -%where we have compute the profile as an ordered set of 400 most-frequent -%$n$-grams in a given text (the whole document or a partial window). Apart -%from ordering the set, we have ignored the actual number of occurrences -%of a given $n$-gram altogether, and used the value inveresly -%proportional to the $n$-gram order in the profile, in accordance with -%the Zipf's law \cite{zipf1935psycho}. -% -%This approach has provided more stable style-change function than -%than the one proposed in \cite{pan09stamatatos}. Because of pair-wise -%nature of the detailed comparison sub-task, we couldn't use the results -%of the intrinsic detection immediately, therefore we wanted to use them -%as hints to the external detection. +The algorithm evaluates the document pair in several stages: + +\begin{itemize} +\item intrinsic plagiarism detection +\item language detection of the source document +\begin{itemize} +\item cross-lingual plagiarism detection, if the source document is not in English +\end{itemize} +\item detecting intervals with common features +\item post-processing phase, mainly serves for merging the nearby common intervals +\end{itemize} + +\subsection{Intrinsic plagiarism detection} + +Our approach is based on character $n$-gram profiles of the interval of +the fixed size (in terms of $n$-grams), and their differences to the +profile of the whole document \cite{pan09stamatatos}. We have further +enhanced the approach with using gaussian smoothing of the style-change +function \cite{Kasprzak2010}. + +For PAN 2012, we have experimented with using 1-, 2-, and 3-grams instead +of only 3-grams, and using the different measure of the difference between +the n-gram profiles. We have used an approach similar to \cite{ngram}, +where we have compute the profile as an ordered set of 400 most-frequent +$n$-grams in a given text (the whole document or a partial window). Apart +from ordering the set, we have ignored the actual number of occurrences +of a given $n$-gram altogether, and used the value inveresly +proportional to the $n$-gram order in the profile, in accordance with +the Zipf's law \cite{zipf1935psycho}. + +This approach has provided more stable style-change function than +than the one proposed in \cite{pan09stamatatos}. Because of pair-wise +nature of the detailed comparison sub-task, we couldn't use the results +of the intrinsic detection immediately, therefore we wanted to use them +as hints to the external detection. \subsection{Cross-lingual Plagiarism Detection} -%For language detection, we used the $n$-gram based categorization \cite{ngram}. -%We have computed the language profiles from the source documents of the -%training corpus (using the annotations from the corpus itself). The result -%of this approach was better than using the stopwords-based detection we have -%used in PAN 2010. However, there were still mis-detected documents, -%mainly the long lists of surnames and other tabular data. We have added -%an ad-hoc fix, where for documents having their profile too distant from all of -%English, German, and Spanish profiles, we have declared them to be in English. - -%For cross-lingual plagiarism detection, our aim was to use the public -%interface to Google translate if possible, and use the resulting document -%as the source for standard intra-lingual detector. -%Should the translation service not be available, we wanted -%to use the fall-back strategy of translating isolated words only, -%with the additional exact matching of longer words (we have used words with -%5 characters or longer). -%We have supposed that these longer words can be names or specialized terms, -%present in both languages. - -%We have used dictionaries from several sources, like -%{\it dicts.info}\footnote{\url{http://www.dicts.info/}}, -%{\it omegawiki}\footnote{\url{http://www.omegawiki.org/}}, -%and {\it wiktionary}\footnote{\url{http://en.wiktionary.org/}}. The source -%and translated document were aligned on a line-by-line basis. +For language detection, we used the $n$-gram based categorization \cite{ngram}. +We have computed the language profiles from the source documents of the +training corpus (using the annotations from the corpus itself). The result +of this approach was better than using the stopwords-based detection we have +used in PAN 2010. However, there were still mis-detected documents, +mainly the long lists of surnames and other tabular data. We have added +an ad-hoc fix, where for documents having their profile too distant from all of +English, German, and Spanish profiles, we have declared them to be in English. + +For cross-lingual plagiarism detection, our aim was to use the public +interface to Google translate if possible, and use the resulting document +as the source for standard intra-lingual detector. +Should the translation service not be available, we wanted +to use the fall-back strategy of translating isolated words only, +with the additional exact matching of longer words (we have used words with +5 characters or longer). +We have supposed that these longer words can be names or specialized terms, +present in both languages. + +We have used dictionaries from several sources, like +{\it dicts.info}\footnote{\url{http://www.dicts.info/}}, +{\it omegawiki}\footnote{\url{http://www.omegawiki.org/}}, +and {\it wiktionary}\footnote{\url{http://en.wiktionary.org/}}. The source +and translated document were aligned on a line-by-line basis. In the final form of the detailed comparison sub-task, the results of machine translation of the source documents were provided to the detector programs @@ -126,14 +126,14 @@ consisting of at least 5 common features, with the maximum allowed gap inside the interval (characters not belonging to any common feature of a given valid interval) set to 3,500 characters. -%We have also experimented with modifying the allowed gap size using the -%intrinsic plagiarism detection: to allow only shorter gap if the common -%features around the gap belong to different passages, detected as plagiarized -%in the suspicious document by the intrinsic detector, and allow larger gap, -%if both the surrounding common features belong to the same passage, -%detected by the intrinsic detector. This approach, however, did not show -%any improvement against allowed gap of a static size, so it was omitted -%from the final submission. +We have also experimented with modifying the allowed gap size using the +intrinsic plagiarism detection: to allow only shorter gap if the common +features around the gap belong to different passages, detected as plagiarized +in the suspicious document by the intrinsic detector, and allow larger gap, +if both the surrounding common features belong to the same passage, +detected by the intrinsic detector. This approach, however, did not show +any improvement against allowed gap of a static size, so it was omitted +from the final submission. \subsection{Postprocessing} @@ -177,6 +177,6 @@ In the full paper, we will also discuss the following topics: \end{itemize} \nocite{pan09stamatatos} -%\nocite{ngram} +\nocite{ngram} -- 2.43.0