yenya: uvod a data ktera sem chci zahrnout

author Jan "Yenya" Kasprzak <kas@fi.muni.cz>

Fri, 10 Aug 2012 16:18:33 +0000 (18:18 +0200)

committer Jan "Yenya" Kasprzak <kas@fi.muni.cz>

Fri, 10 Aug 2012 16:18:33 +0000 (18:18 +0200)
author Jan "Yenya" Kasprzak <kas@fi.muni.cz>
Fri, 10 Aug 2012 16:18:33 +0000 (18:18 +0200)
committer Jan "Yenya" Kasprzak <kas@fi.muni.cz>
Fri, 10 Aug 2012 16:18:33 +0000 (18:18 +0200)
diff --git a/yenya-detailed.tex b/yenya-detailed.tex

index 525a1d3e5fc9bf0e3a92b43da321a4bb58c933e1..f2dd93f257c33b7a5a9672999da5e930d97d3276 100644 (file)
--- a/yenya-detailed.tex
+++ b/yenya-detailed.tex
@@ -1,10 +1,40 @@
  \section{Detailed Document Comparison}
  
+\label{detailed}
  
-\subsection{General Approach}
+The detailed comparison task of PAN 2012 consisted in a comparison
+of given document pairs, with the expected output being the annotation of
+similarities found between these documents.
+The submitted program has been run in a controlled environment
+separately for each document pair, without the possibility of keeping any
+data between runs.
  
-Our approach in PAN 2012 Plagiarism detection---Detailed comparison sub-task
+In this section, we describe our approach in the detailed comparison
+task. The rest of this section is organized as follows: in the next
+subsection, we summarise the differences from our previous approach.
+In subsection \ref{sec-alg-overview}, we give an overview of our approach.
+TODO napsat jak to nakonec bude.
+
+\subsection{Differences Against PAN 2010}
+
+Our approach in this task
  is loosely based on the approach we have used in PAN 2010 \cite{Kasprzak2010}.
+The main difference is that instead of looking for similarities of
+one type (for PAN 2010, we have used word 5-grams),
+we have developed a method of evaluating multiple types of similarities
+(we call them {\it common features}) of different properties, such as
+density and length.
+
+As a proof of concept, we have used two types of common features: word
+5-grams and stop-word 8-grams, the later being based on the method described in
+\cite{stamatatos2011plagiarism}.
+
+In addition to the above, we have made several minor improvements to the
+algorithm, such as parameter tuning and improving the detections
+merging in the post-processing stage.
+
+\subsection{Algorithm Overview}
+\label{sec-alg-overview}
  
  The algorithm evaluates the document pair in several stages:
  
@@ -18,65 +48,6 @@ The algorithm evaluates the document pair in several stages:
  \item post-processing phase, mainly serves for merging the nearby common intervals
  \end{itemize}
  
-\subsection{Intrinsic plagiarism detection}
-
-Our approach is based on character $n$-gram profiles of the interval of
-the fixed size (in terms of $n$-grams), and their differences to the
-profile of the whole document \cite{pan09stamatatos}. We have further
-enhanced the approach with using gaussian smoothing of the style-change
-function \cite{Kasprzak2010}.
-
-For PAN 2012, we have experimented with using 1-, 2-, and 3-grams instead
-of only 3-grams, and using the different measure of the difference between
-the n-gram profiles. We have used an approach similar to \cite{ngram},
-where we have compute the profile as an ordered set of 400 most-frequent
-$n$-grams in a given text (the whole document or a partial window). Apart
-from ordering the set, we have ignored the actual number of occurrences
-of a given $n$-gram altogether, and used the value inveresly
-proportional to the $n$-gram order in the profile, in accordance with
-the Zipf's law \cite{zipf1935psycho}.
-
-This approach has provided more stable style-change function than
-than the one proposed in \cite{pan09stamatatos}. Because of pair-wise
-nature of the detailed comparison sub-task, we couldn't use the results
-of the intrinsic detection immediately, therefore we wanted to use them
-as hints to the external detection.
-
-\subsection{Cross-lingual Plagiarism Detection}
-
-For language detection, we used the $n$-gram based categorization \cite{ngram}.
-We have computed the language profiles from the source documents of the
-training corpus (using the annotations from the corpus itself). The result
-of this approach was better than using the stopwords-based detection we have
-used in PAN 2010. However, there were still mis-detected documents,
-mainly the long lists of surnames and other tabular data. We have added
-an ad-hoc fix, where for documents having their profile too distant from all of
-English, German, and Spanish profiles, we have declared them to be in English.
-
-For cross-lingual plagiarism detection, our aim was to use the public
-interface to Google translate if possible, and use the resulting document
-as the source for standard intra-lingual detector.
-Should the translation service not be available, we wanted
-to use the fall-back strategy of translating isolated words only,
-with the additional exact matching of longer words (we have used words with
-5 characters or longer).
-We have supposed that these longer words can be names or specialized terms,
-present in both languages.
-
-We have used dictionaries from several sources, like
-{\it dicts.info}\footnote{\url{http://www.dicts.info/}},
-{\it omegawiki}\footnote{\url{http://www.omegawiki.org/}},
-and {\it wiktionary}\footnote{\url{http://en.wiktionary.org/}}. The source
-and translated document were aligned on a line-by-line basis.
-
-In the final form of the detailed comparison sub-task, the results of machine
-translation of the source documents were provided to the detector programs
-by the surrounding environment, so we have discarded the language detection
-and machine translation from our submission altogether, and used only
-line-by-line alignment of the source and translated document for calculating
-the offsets of text features in the source document. We have then treated
-the translated documents the same way as the source documents in English.
-
  \subsection{Multi-feature Plagiarism Detection}
  
  Our pair-wise plagiarism detection is based on finding common passages
@@ -160,6 +131,73 @@ bigger than number of features per character in the possible joined interval.
  
  These parameters were fine-tuned to achieve the best results on the training corpus. With these parameters, our algorithm got the total plagdet score of 0.73 on the training corpus.
  
+\subsection{Other Approaches Tried}
+
+There are several other approaches we have evaluated, but which were
+omitted from our final submission for various reasons. We think mentioning
+them here is worthwhile nevertheless.
+
+\subsubsection{Intrinsic Plagiarism Detection}
+
+Our approach is based on character $n$-gram profiles of the interval of
+the fixed size (in terms of $n$-grams), and their differences to the
+profile of the whole document \cite{pan09stamatatos}. We have further
+enhanced the approach with using gaussian smoothing of the style-change
+function \cite{Kasprzak2010}.
+
+For PAN 2012, we have experimented with using 1-, 2-, and 3-grams instead
+of only 3-grams, and using the different measure of the difference between
+the n-gram profiles. We have used an approach similar to \cite{ngram},
+where we have compute the profile as an ordered set of 400 most-frequent
+$n$-grams in a given text (the whole document or a partial window). Apart
+from ordering the set, we have ignored the actual number of occurrences
+of a given $n$-gram altogether, and used the value inveresly
+proportional to the $n$-gram order in the profile, in accordance with
+the Zipf's law \cite{zipf1935psycho}.
+
+This approach has provided more stable style-change function than
+than the one proposed in \cite{pan09stamatatos}. Because of pair-wise
+nature of the detailed comparison sub-task, we couldn't use the results
+of the intrinsic detection immediately, therefore we wanted to use them
+as hints to the external detection.
+
+\subsubsection{Language Detection}
+
+For language detection, we used the $n$-gram based categorization \cite{ngram}.
+We have computed the language profiles from the source documents of the
+training corpus (using the annotations from the corpus itself). The result
+of this approach was better than using the stopwords-based detection we have
+used in PAN 2010. However, there were still mis-detected documents,
+mainly the long lists of surnames and other tabular data. We have added
+an ad-hoc fix, where for documents having their profile too distant from all of
+English, German, and Spanish profiles, we have declared them to be in English.
+
+\subsubsection{Cross-lingual Plagiarism Detection}
+
+For cross-lingual plagiarism detection, our aim was to use the public
+interface to Google translate if possible, and use the resulting document
+as the source for standard intra-lingual detector.
+Should the translation service not be available, we wanted
+to use the fall-back strategy of translating isolated words only,
+with the additional exact matching of longer words (we have used words with
+5 characters or longer).
+We have supposed that these longer words can be names or specialized terms,
+present in both languages.
+
+We have used dictionaries from several sources, like
+{\it dicts.info}\footnote{\url{http://www.dicts.info/}},
+{\it omegawiki}\footnote{\url{http://www.omegawiki.org/}},
+and {\it wiktionary}\footnote{\url{http://en.wiktionary.org/}}. The source
+and translated document were aligned on a line-by-line basis.
+
+In the final form of the detailed comparison sub-task, the results of machine
+translation of the source documents were provided to the detector programs
+by the surrounding environment, so we have discarded the language detection
+and machine translation from our submission altogether, and used only
+line-by-line alignment of the source and translated document for calculating
+the offsets of text features in the source document. We have then treated
+the translated documents the same way as the source documents in English.
+
  \subsection{Further discussion}
  
  As in our PAN 2010 submission, we tried to make use of the intrinsic plagiarism
@@ -179,4 +217,183 @@ In the full paper, we will also discuss the following topics:
  \nocite{pan09stamatatos}
  \nocite{ngram}
  
+\endinput
+
+Co chci diskutovat v zaveru:
+- nebylo mozno cachovat data
+- nebylo mozno vylucovat prekryvajici se podobnosti
+- cili udaje o run-time jsou uplne nahouby
+- 669 radku kodu bez komentaru a prazdnych radku
+- hranice mezi pasazema nekdy zahrnovala whitespace a nekdy ne.
+
+Diskuse plagdet:
+- uzivatele chteji "aby odevzdej ukazovalo 0\% shody", nezajima je
+       co to cislo znamena
+- nezalezi na hranicich detekovane pasaze
+- false-positives jsou daleko horsi
+- granularita je zlo
+
+Finalni vysledky nad testovacim korpusem:
+
+0.7288 0.5994 0.9306 1.0007   2012-06-16 02:23   plagdt recall precis granul
+                            01-no-plagiarism     0.0000 0.0000 0.0000 1.0000
+                            02-no-obfuscation    0.9476 0.9627 0.9330 1.0000
+                            03-artificial-low    0.8726 0.8099 0.9477 1.0013
+                            04-artificial-high   0.3649 0.2255 0.9562 1.0000
+                            05-translation       0.7610 0.6662 0.8884 1.0008
+                            06-simulated-paraphr 0.5972 0.4369 0.9433 1.0000
+
+Vysledky nad souteznimi daty:
+plagdet         precision       recall          granularity
+0.6826726      0.8931670       0.5524708       1.0000000
+
+Run-time:
+12500 sekund tokenizace vcetne sc a detekce jazyka
+2500 sekund bez sc a detekce jazyka
+14 sekund vyhodnoceni valid intervalu a postprocessing
+
+
+TODO:
+- hranici podle hustoty matchovani
+- xml tridit podle this_offset
+
+Tady je obsah souboru JOURNAL - jak jsem meril nektera vylepseni:
+=================================================================
+baseline.py
+0.1250 0.1259 0.9783 2.4460   2012-05-03 06:02   plagdt recall precis granul
+                            01_no_plagiarism     1.0000 1.0000 1.0000 1.0000
+                            02_no_obfuscation    0.8608 0.8609 0.8618 1.0009
+                            03_artificial_low    0.1006 0.1118 0.9979 2.9974
+                            04_artificial_high   0.0054 0.0029 0.9991 1.0778
+                            05_translation       0.0003 0.0002 1.0000 1.2143
+                            06_simulated_paraphr 0.0565 0.0729 0.9983 4.3075
+
+valid_intervals bez postprocessingu (takhle jsem to poprve odevzdal)
+0.3183 0.2034 0.9883 1.0850   2012-05-25 15:25   plagdt recall precis granul
+                            01_no_plagiarism     1.0000 1.0000 1.0000 1.0000
+                            02_no_obfuscation    0.9861 0.9973 0.9752 1.0000
+                            03_artificial_low    0.4127 0.3006 0.9975 1.1724
+                            04_artificial_high   0.0008 0.0004 1.0000 1.0000
+                            05_translation       0.0001 0.0000 1.0000 1.0000
+                            06_simulated_paraphr 0.3470 0.2248 0.9987 1.0812
+
+postprocessed (slucovani blizkych intervalu)
+0.3350 0.2051 0.9863 1.0188   2012-05-25 15:27   plagdt recall precis granul
+                            01_no_plagiarism     1.0000 1.0000 1.0000 1.0000
+                            02_no_obfuscation    0.9863 0.9973 0.9755 1.0000
+                            03_artificial_low    0.4541 0.3057 0.9942 1.0417
+                            04_artificial_high   0.0008 0.0004 1.0000 1.0000
+                            05_translation       0.0001 0.0000 1.0000 1.0000
+                            06_simulated_paraphr 0.3702 0.2279 0.9986 1.0032
+
+whitespace (uprava whitespaces)
+0.3353 0.2053 0.9858 1.0188   2012-05-31 17:57   plagdt recall precis granul
+                            01_no_plagiarism     1.0000 1.0000 1.0000 1.0000
+                            02_no_obfuscation    0.9865 0.9987 0.9745 1.0000
+                            03_artificial_low    0.4546 0.3061 0.9940 1.0417
+                            04_artificial_high   0.0008 0.0004 1.0000 1.0000
+                            05_translation       0.0001 0.0000 1.0000 1.0000
+                            06_simulated_paraphr 0.3705 0.2281 0.9985 1.0032
+
+gap_100: whitespace, + ve valid intervalu dovolim mezeru 100 petic misto 50
+0.3696 0.2305 0.9838 1.0148   2012-05-31 18:07   plagdt recall precis granul
+                            01_no_plagiarism     1.0000 1.0000 1.0000 1.0000
+                            02_no_obfuscation    0.9850 0.9987 0.9717 1.0000
+                            03_artificial_low    0.5423 0.3846 0.9922 1.0310
+                            04_artificial_high   0.0058 0.0029 0.9151 1.0000
+                            05_translation       0.0001 0.0000 1.0000 1.0000
+                            06_simulated_paraphr 0.4207 0.2667 0.9959 1.0000
+
+gap_200: whitespace, + ve valid intervalu dovolim mezeru 200 petic misto 50
+0.3906 0.2456 0.9769 1.0070   2012-05-31 18:09   plagdt recall precis granul
+                            01_no_plagiarism     1.0000 1.0000 1.0000 1.0000
+                            02_no_obfuscation    0.9820 0.9987 0.9659 1.0000
+                            03_artificial_low    0.5976 0.4346 0.9875 1.0139
+                            04_artificial_high   0.0087 0.0044 0.9374 1.0000
+                            05_translation       0.0001 0.0001 1.0000 1.0000
+                            06_simulated_paraphr 0.4360 0.2811 0.9708 1.0000
+
+gap_200_int_10: gap_200, + valid int. ma min. 10 petic misto 20
+0.4436 0.2962 0.9660 1.0308   2012-05-31 18:11   plagdt recall precis granul
+                            01_no_plagiarism     1.0000 1.0000 1.0000 1.0000
+                            02_no_obfuscation    0.9612 0.9987 0.9264 1.0000
+                            03_artificial_low    0.7048 0.5808 0.9873 1.0530
+                            04_artificial_high   0.0457 0.0242 0.9762 1.0465
+                            05_translation       0.0008 0.0004 1.0000 1.0000
+                            06_simulated_paraphr 0.5123 0.3485 0.9662 1.0000
+
+no_trans: gap_200_int_10, + nedetekovat preklady vubec, abych se vyhnul F-P
+0.4432 0.2959 0.9658 1.0310   2012-06-01 16:41   plagdt recall precis granul
+                            01_no_plagiarism     1.0000 1.0000 1.0000 1.0000
+                            02_no_obfuscation    0.9608 0.9980 0.9263 1.0000
+                            03_artificial_low    0.7045 0.5806 0.9872 1.0530
+                            04_artificial_high   0.0457 0.0242 0.9762 1.0465
+                            05_translation       0.0000 0.0000 0.0000 1.0000
+                            06_simulated_paraphr 0.5123 0.3485 0.9662 1.0000
+
+
+swng_unsorted se stejnym postprocessingem jako vyse "whitespace"
+0.2673 0.1584 0.9281 1.0174   2012-05-31 14:20   plagdt recall precis granul
+                            01_no_plagiarism     0.0000 0.0000 0.0000 1.0000
+                            02_no_obfuscation    0.9439 0.9059 0.9851 1.0000
+                            03_artificial_low    0.3178 0.1952 0.9954 1.0377
+                            04_artificial_high   0.0169 0.0095 0.9581 1.1707
+                            05_translation       0.0042 0.0028 0.0080 1.0000
+                            06_simulated_paraphr 0.1905 0.1060 0.9434 1.0000
+
+swng_sorted
+0.2550 0.1906 0.4067 1.0253   2012-05-30 16:07   plagdt recall precis granul
+                            01_no_plagiarism     0.0000 0.0000 0.0000 1.0000
+                            02_no_obfuscation    0.6648 0.9146 0.5222 1.0000
+                            03_artificial_low    0.4093 0.2867 0.8093 1.0483
+                            04_artificial_high   0.0454 0.0253 0.4371 1.0755
+                            05_translation       0.0030 0.0019 0.0064 1.0000
+                            06_simulated_paraphr 0.1017 0.1382 0.0814 1.0106
+
+sort_susp: gap_200_int_10 + postprocessing tridim intervaly podle offsetu v susp, nikoliv v src
+0.4437 0.2962 0.9676 1.0308   2012-06-01 18:06   plagdt recall precis granul
+                            01_no_plagiarism     1.0000 1.0000 1.0000 1.0000
+                            02_no_obfuscation    0.9641 0.9987 0.9317 1.0000
+                            03_artificial_low    0.7048 0.5809 0.9871 1.0530
+                            04_artificial_high   0.0457 0.0242 0.9762 1.0465
+                            05_translation       0.0008 0.0004 1.0000 1.0000
+                            06_simulated_paraphr 0.5123 0.3485 0.9662 1.0000
+
+post_gap2_16000: sort_susp, + sloucit dva intervaly pokud je < 16000 znaku a mezera je jen polovina velikosti tech intervalu (bylo 4000)
+0.4539 0.2983 0.9642 1.0054   2012-06-01 18:09   plagdt recall precis granul
+                            01_no_plagiarism     1.0000 1.0000 1.0000 1.0000
+                            02_no_obfuscation    0.9631 0.9987 0.9300 1.0000
+                            03_artificial_low    0.7307 0.5883 0.9814 1.0094
+                            04_artificial_high   0.0480 0.0247 0.9816 1.0078
+                            05_translation       0.0008 0.0004 1.0000 1.0000
+                            06_simulated_paraphr 0.5133 0.3487 0.9721 1.0000
+
+post_gap2_32000: sort_susp, + sloucit intervaly < 32000 znaku a mezera aspon polovina velikosti
+0.4543 0.2986 0.9638 1.0050   2012-06-01 18:12   plagdt recall precis granul
+                            01_no_plagiarism     1.0000 1.0000 1.0000 1.0000
+                            02_no_obfuscation    0.9628 0.9987 0.9294 1.0000
+                            03_artificial_low    0.7315 0.5893 0.9798 1.0085
+                            04_artificial_high   0.0480 0.0247 0.9816 1.0078
+                            05_translation       0.0008 0.0004 1.0000 1.0000
+                            06_simulated_paraphr 0.5138 0.3487 0.9763 1.0000
+
+post_gap2_64000: sort_susp, + sloucit intervaly < 32000 znaku a mezera aspon pol
+ovina velikosti
+0.4543 0.2988 0.9616 1.0050   2012-06-01 18:21   plagdt recall precis granul
+                            01_no_plagiarism     1.0000 1.0000 1.0000 1.0000
+                            02_no_obfuscation    0.9603 0.9987 0.9248 1.0000
+                            03_artificial_low    0.7316 0.5901 0.9782 1.0085
+                            04_artificial_high   0.0480 0.0247 0.9816 1.0078
+                            05_translation       0.0008 0.0004 1.0000 1.0000
+                            06_simulated_paraphr 0.5138 0.3487 0.9763 1.0000
+
+post_gap1_2000: post_gap2_32000, + spojit bez podminek veci co maji mezeru pod 2000 (bylo 600)
+0.4543 0.2986 0.9635 1.0050   2012-06-01 18:29   plagdt recall precis granul
+                            01_no_plagiarism     1.0000 1.0000 1.0000 1.0000
+                            02_no_obfuscation    0.9628 0.9987 0.9294 1.0000
+                            03_artificial_low    0.7315 0.5895 0.9794 1.0085
+                            04_artificial_high   0.0480 0.0247 0.9816 1.0078
+                            05_translation       0.0008 0.0004 1.0000 1.0000
+                            06_simulated_paraphr 0.5138 0.3487 0.9763 1.0000
+
author	Jan "Yenya" Kasprzak <kas@fi.muni.cz>
	Fri, 10 Aug 2012 16:18:33 +0000 (18:18 +0200)
committer	Jan "Yenya" Kasprzak <kas@fi.muni.cz>
	Fri, 10 Aug 2012 16:18:33 +0000 (18:18 +0200)