yenya: dalsi verze

author Jan "Yenya" Kasprzak <kas@fi.muni.cz>

Wed, 15 Aug 2012 20:36:24 +0000 (22:36 +0200)

committer Jan "Yenya" Kasprzak <kas@fi.muni.cz>

Wed, 15 Aug 2012 20:36:24 +0000 (22:36 +0200)
author Jan "Yenya" Kasprzak <kas@fi.muni.cz>
Wed, 15 Aug 2012 20:36:24 +0000 (22:36 +0200)
committer Jan "Yenya" Kasprzak <kas@fi.muni.cz>
Wed, 15 Aug 2012 20:36:24 +0000 (22:36 +0200)
diff --git a/yenya-detailed.tex b/yenya-detailed.tex

index 46a2cd5be7a6c041e974d1a71e1ed449972a7979..dd28b4dc2a1be5a038ba996ca4058aa7d07945d9 100755 (executable)
--- a/yenya-detailed.tex
+++ b/yenya-detailed.tex
@@ -9,11 +9,11 @@ The submitted program has been run in a controlled environment
  separately for each document pair, without the possibility of keeping any
  data between runs.
  
-In this section, we describe our approach in the detailed comparison
-task. The rest of this section is organized as follows: in the next
-subsection, we summarise the differences from our previous approach.
-In subsection \ref{sec-alg-overview}, we give an overview of our approach.
-TODO napsat jak to nakonec bude.
+%In this section, we describe our approach in the detailed comparison
+%task. The rest of this section is organized as follows: in the next
+%subsection, we summarise the differences from our previous approach.
+%In subsection \ref{sec-alg-overview}, we give an overview of our approach.
+%TODO napsat jak to nakonec bude.
  
  \subsection{Differences Against PAN 2010}
  
@@ -136,6 +136,7 @@ inside the interval (characters not belonging to any common feature
  of a given valid interval) set to 4000 characters.
  
  \subsection{Postprocessing}
+\label{postprocessing}
  
  In the postprocessing phase, we took the resulting valid intervals,
  and made attempt to further improve the results. We have firstly
@@ -196,47 +197,31 @@ them here is worthwhile nevertheless.
  
  \subsubsection{Intrinsic Plagiarism Detection}
  
-Our approach is based on character $n$-gram profiles of the interval of
+We tested the approach based on character $n$-gram profiles of the interval of
  the fixed size (in terms of $n$-grams), and their differences to the
  profile of the whole document \cite{pan09stamatatos}. We have further
  enhanced the approach with using gaussian smoothing of the style-change
-function \cite{Kasprzak2010}.
-
-For PAN 2012, we have experimented with using 1-, 2-, and 3-grams instead
-of only 3-grams, and using the different measure of the difference between
-the n-gram profiles. We have used an approach similar to \cite{ngram},
-where we have compute the profile as an ordered set of 400 most-frequent
-$n$-grams in a given text (the whole document or a partial window). Apart
-from ordering the set, we have ignored the actual number of occurrences
-of a given $n$-gram altogether, and used the value inveresly
-proportional to the $n$-gram order in the profile, in accordance with
-the Zipf's law \cite{zipf1935psycho}.
-
-This approach has provided more stable style-change function than
-than the one proposed in \cite{pan09stamatatos}. Because of pair-wise
-nature of the detailed comparison sub-task, we couldn't use the results
-of the intrinsic detection immediately, therefore we wanted to use them
-as hints to the external detection.
-
-We have also experimented with modifying the allowed gap size using the
-intrinsic plagiarism detection: to allow only shorter gap if the common
-features around the gap belong to different passages, detected as plagiarized
-in the suspicious document by the intrinsic detector, and allow larger gap,
-if both the surrounding common features belong to the same passage,
-detected by the intrinsic detector. This approach, however, did not show
-any improvement against allowed gap of a static size, so it was omitted
-from the final submission.
-
-\subsubsection{Language Detection}
-
-For language detection, we used the $n$-gram based categorization \cite{ngram}.
-We have computed the language profiles from the source documents of the
-training corpus (using the annotations from the corpus itself). The result
-of this approach was better than using the stopwords-based detection we have
-used in PAN 2010. However, there were still mis-detected documents,
-mainly the long lists of surnames and other tabular data. We have added
-an ad-hoc fix, where for documents having their profile too distant from all of
-English, German, and Spanish profiles, we have declared them to be in English.
+function \cite{Kasprzak2010}. For PAN 2012, we made further improvements
+to the algorithm, resulting in more stable style change function in
+both short and long documents.
+
+We tried to use the results of the intrinsic plagiarism detection
+as hint for the post-processing phase, allowing to merge larger
+intervals, if they both belong to the same passage, detected by
+the intrinsic detector. This approach did not provide improvement
+when compared to the static gap limits, as described in Section
+\ref{postprocessing}, so we have omitted it from our final submission.
+
+%\subsubsection{Language Detection}
+%
+%For language detection, we used the $n$-gram based categorization \cite{ngram}.
+%We computed the language profiles from the source documents of the
+%training corpus (using the annotations from the corpus itself). The result
+%of this approach was better than using the stopwords-based detection we have
+%used in PAN 2010. However, there were still mis-detected documents,
+%mainly the long lists of surnames and other tabular data. We added
+%an ad-hoc fix, where for documents having their profile too distant from all of
+%English, German, and Spanish profiles, we declared them to be in English.
  
  \subsubsection{Cross-lingual Plagiarism Detection}
  
@@ -250,34 +235,61 @@ with the additional exact matching of longer words (we have used words with
  We have supposed that these longer words can be names or specialized terms,
  present in both languages.
  
-We have used dictionaries from several sources, like
+We used dictionaries from several sources, for example
  {\it dicts.info}\footnote{\url{http://www.dicts.info/}},
  {\it omegawiki}\footnote{\url{http://www.omegawiki.org/}},
-and {\it wiktionary}\footnote{\url{http://en.wiktionary.org/}}. The source
-and translated document were aligned on a line-by-line basis.
+and {\it wiktionary}\footnote{\url{http://en.wiktionary.org/}}.
  
-In the final form of the detailed comparison sub-task, the results of machine
-translation of the source documents were provided to the detector programs
-by the surrounding environment, so we have discarded the language detection
-and machine translation from our submission altogether, and used only
-line-by-line alignment of the source and translated document for calculating
-the offsets of text features in the source document. We have then treated
-the translated documents the same way as the source documents in English.
+In the final submission, we simply used the machine translated texts,
+which were provided to the running program from the surrounding environment.
  
-\subsection{Performance Notes}
  
-We consider comparing the performance of PAN 2012 submissions almost
+\subsection{Further discussion}
+
+From our previous PAN submissions, we knew that the precision of our
+system was good, and because of the way how the final score is computed, we
+wanted to exchange a bit worse precision for better recall and granularity.
+So we pushed the parameters towards detecting more plagiarized passages,
+even when the number of common features was not especially high.
+
+\subsubsection{Plagdet score}
+
+Our results from tuning the parameters show that the plagdet score\cite{potthastfamework}
+is not a good measure for comparing the plagiarism detection systems:
+for example, the gap of 30,000 characters, described in Section \ref{postprocessing},
+can easily mean several pages of text. And still the system with this
+parameter set so high resulted in better plagdet score.
+
+Another problem of plagdet can be
+seen in the 01-no-plagiarism part of the training corpus: the border
+between the perfect score 1 and the score 0 is a single false-positive
+detection. Plagdet does not distinguish between the system reporting this
+single false-positive, and the system reporting the whole data as plagiarized.
+Both get the score 0. However, our experience from real-world plagiarism detection systems show that
+the plagiarized documents are in a clear minority, so the performance of
+the detection system on non-plagiarized documents is very important.
+
+\subsubsection{Performance Notes}
+
+We consider comparing the CPU-time performance of PAN 2012 submissions almost
  meaningless, because any sane system would precompute features for all
  documents in a given set of suspicious and source documents, and use the
  results for pair-wise comparison, expecting that any document will take
  part in more than one pair.
  
-We did not use this exact split in our submission, but in order to be able
-to evaluate various approaches faster, we have split our computation into
-the following two parts: in the first part, common features have been
-computed, and the results stored into a file\footnote{We have use the
-{\tt Storable.pm} storage available in Perl.}. The second part
-then used this data to compute valid intervals and do post-processing.
+Also, the pair-wise comparison without caching any intermediate results
+lead to worse overall performance: in our PAN 2010 submission, one of the
+post-processing steps was to remove all the overlapping detections
+from a given suspicious documents, when these detections were from different
+source doucments, and were short enough. This removed many false-positives
+and improved the precision of our system. This kind of heuristics was
+not possible in PAN 2012.
+
+As for the performance of our system, we split the task into two parts:
+1. finding the common features, and 2. computing valid intervals and
+postprocessing. The first part is more CPU intensive, and the results
+can be cached. The second part is fast enough to allow us to evaluate
+many combinations of parameters.
  
  We did our development on a machine with four six-core AMD 8139 CPUs
  (2800 MHz), and 128 GB RAM. The first phase took about 2500 seconds
@@ -289,35 +301,11 @@ When we have tried to use intrinsic plagiarism detection and language
  detection, the first phase took about 12500 seconds. Thus omitting these
  featurs clearly provided huge performance improvement.
  
-The code has been written in Perl, and had about 669 lines of code,
+The code was written in Perl, and had about 669 lines of code,
  not counting comments and blank lines.
  
-\subsection{Further discussion}
-
-As in our PAN 2010 submission, we tried to make use of the intrinsic plagiarism
-detection, but despite making further improvements to the intrinsic plagiarism detector, we have again failed to reach any significant improvement
-when using it as a hint for the external plagiarism detection.
-
-In the full paper, we will also discuss the following topics:
-
-\begin{itemize}
-\item language detection and cross-language common features
-\item intrinsic plagiarism detection
-\item suitability of plagdet score\cite{potthastframework} for performance measurement
-\item feasibility of our approach in large-scale systems
-\item discussion of parameter settings
-\end{itemize}
-
-\nocite{pan09stamatatos}
-\nocite{ngram}
-
  \endinput
  
-Co chci diskutovat v zaveru:
-- nebylo mozno cachovat data
-- nebylo mozno vylucovat prekryvajici se podobnosti
-- cili udaje o run-time jsou uplne nahouby
-- 669 radku kodu bez komentaru a prazdnych radku
  - hranice mezi pasazema nekdy zahrnovala whitespace a nekdy ne.
  
  Diskuse plagdet:
author	Jan "Yenya" Kasprzak <kas@fi.muni.cz>
	Wed, 15 Aug 2012 20:36:24 +0000 (22:36 +0200)
committer	Jan "Yenya" Kasprzak <kas@fi.muni.cz>
	Wed, 15 Aug 2012 20:36:24 +0000 (22:36 +0200)