From: Jan "Yenya" Kasprzak Date: Fri, 10 Aug 2012 16:18:33 +0000 (+0200) Subject: yenya: uvod a data ktera sem chci zahrnout X-Git-Url: https://www.fi.muni.cz/~kas/git//home/kas/public_html/git/?p=pan12-paper.git;a=commitdiff_plain;h=156645b9e5fe38063870bb9e6447a7167ab44a28 yenya: uvod a data ktera sem chci zahrnout --- diff --git a/yenya-detailed.tex b/yenya-detailed.tex index 525a1d3..f2dd93f 100644 --- a/yenya-detailed.tex +++ b/yenya-detailed.tex @@ -1,10 +1,40 @@ \section{Detailed Document Comparison} +\label{detailed} -\subsection{General Approach} +The detailed comparison task of PAN 2012 consisted in a comparison +of given document pairs, with the expected output being the annotation of +similarities found between these documents. +The submitted program has been run in a controlled environment +separately for each document pair, without the possibility of keeping any +data between runs. -Our approach in PAN 2012 Plagiarism detection---Detailed comparison sub-task +In this section, we describe our approach in the detailed comparison +task. The rest of this section is organized as follows: in the next +subsection, we summarise the differences from our previous approach. +In subsection \ref{sec-alg-overview}, we give an overview of our approach. +TODO napsat jak to nakonec bude. + +\subsection{Differences Against PAN 2010} + +Our approach in this task is loosely based on the approach we have used in PAN 2010 \cite{Kasprzak2010}. +The main difference is that instead of looking for similarities of +one type (for PAN 2010, we have used word 5-grams), +we have developed a method of evaluating multiple types of similarities +(we call them {\it common features}) of different properties, such as +density and length. + +As a proof of concept, we have used two types of common features: word +5-grams and stop-word 8-grams, the later being based on the method described in +\cite{stamatatos2011plagiarism}. + +In addition to the above, we have made several minor improvements to the +algorithm, such as parameter tuning and improving the detections +merging in the post-processing stage. + +\subsection{Algorithm Overview} +\label{sec-alg-overview} The algorithm evaluates the document pair in several stages: @@ -18,65 +48,6 @@ The algorithm evaluates the document pair in several stages: \item post-processing phase, mainly serves for merging the nearby common intervals \end{itemize} -\subsection{Intrinsic plagiarism detection} - -Our approach is based on character $n$-gram profiles of the interval of -the fixed size (in terms of $n$-grams), and their differences to the -profile of the whole document \cite{pan09stamatatos}. We have further -enhanced the approach with using gaussian smoothing of the style-change -function \cite{Kasprzak2010}. - -For PAN 2012, we have experimented with using 1-, 2-, and 3-grams instead -of only 3-grams, and using the different measure of the difference between -the n-gram profiles. We have used an approach similar to \cite{ngram}, -where we have compute the profile as an ordered set of 400 most-frequent -$n$-grams in a given text (the whole document or a partial window). Apart -from ordering the set, we have ignored the actual number of occurrences -of a given $n$-gram altogether, and used the value inveresly -proportional to the $n$-gram order in the profile, in accordance with -the Zipf's law \cite{zipf1935psycho}. - -This approach has provided more stable style-change function than -than the one proposed in \cite{pan09stamatatos}. Because of pair-wise -nature of the detailed comparison sub-task, we couldn't use the results -of the intrinsic detection immediately, therefore we wanted to use them -as hints to the external detection. - -\subsection{Cross-lingual Plagiarism Detection} - -For language detection, we used the $n$-gram based categorization \cite{ngram}. -We have computed the language profiles from the source documents of the -training corpus (using the annotations from the corpus itself). The result -of this approach was better than using the stopwords-based detection we have -used in PAN 2010. However, there were still mis-detected documents, -mainly the long lists of surnames and other tabular data. We have added -an ad-hoc fix, where for documents having their profile too distant from all of -English, German, and Spanish profiles, we have declared them to be in English. - -For cross-lingual plagiarism detection, our aim was to use the public -interface to Google translate if possible, and use the resulting document -as the source for standard intra-lingual detector. -Should the translation service not be available, we wanted -to use the fall-back strategy of translating isolated words only, -with the additional exact matching of longer words (we have used words with -5 characters or longer). -We have supposed that these longer words can be names or specialized terms, -present in both languages. - -We have used dictionaries from several sources, like -{\it dicts.info}\footnote{\url{http://www.dicts.info/}}, -{\it omegawiki}\footnote{\url{http://www.omegawiki.org/}}, -and {\it wiktionary}\footnote{\url{http://en.wiktionary.org/}}. The source -and translated document were aligned on a line-by-line basis. - -In the final form of the detailed comparison sub-task, the results of machine -translation of the source documents were provided to the detector programs -by the surrounding environment, so we have discarded the language detection -and machine translation from our submission altogether, and used only -line-by-line alignment of the source and translated document for calculating -the offsets of text features in the source document. We have then treated -the translated documents the same way as the source documents in English. - \subsection{Multi-feature Plagiarism Detection} Our pair-wise plagiarism detection is based on finding common passages @@ -160,6 +131,73 @@ bigger than number of features per character in the possible joined interval. These parameters were fine-tuned to achieve the best results on the training corpus. With these parameters, our algorithm got the total plagdet score of 0.73 on the training corpus. +\subsection{Other Approaches Tried} + +There are several other approaches we have evaluated, but which were +omitted from our final submission for various reasons. We think mentioning +them here is worthwhile nevertheless. + +\subsubsection{Intrinsic Plagiarism Detection} + +Our approach is based on character $n$-gram profiles of the interval of +the fixed size (in terms of $n$-grams), and their differences to the +profile of the whole document \cite{pan09stamatatos}. We have further +enhanced the approach with using gaussian smoothing of the style-change +function \cite{Kasprzak2010}. + +For PAN 2012, we have experimented with using 1-, 2-, and 3-grams instead +of only 3-grams, and using the different measure of the difference between +the n-gram profiles. We have used an approach similar to \cite{ngram}, +where we have compute the profile as an ordered set of 400 most-frequent +$n$-grams in a given text (the whole document or a partial window). Apart +from ordering the set, we have ignored the actual number of occurrences +of a given $n$-gram altogether, and used the value inveresly +proportional to the $n$-gram order in the profile, in accordance with +the Zipf's law \cite{zipf1935psycho}. + +This approach has provided more stable style-change function than +than the one proposed in \cite{pan09stamatatos}. Because of pair-wise +nature of the detailed comparison sub-task, we couldn't use the results +of the intrinsic detection immediately, therefore we wanted to use them +as hints to the external detection. + +\subsubsection{Language Detection} + +For language detection, we used the $n$-gram based categorization \cite{ngram}. +We have computed the language profiles from the source documents of the +training corpus (using the annotations from the corpus itself). The result +of this approach was better than using the stopwords-based detection we have +used in PAN 2010. However, there were still mis-detected documents, +mainly the long lists of surnames and other tabular data. We have added +an ad-hoc fix, where for documents having their profile too distant from all of +English, German, and Spanish profiles, we have declared them to be in English. + +\subsubsection{Cross-lingual Plagiarism Detection} + +For cross-lingual plagiarism detection, our aim was to use the public +interface to Google translate if possible, and use the resulting document +as the source for standard intra-lingual detector. +Should the translation service not be available, we wanted +to use the fall-back strategy of translating isolated words only, +with the additional exact matching of longer words (we have used words with +5 characters or longer). +We have supposed that these longer words can be names or specialized terms, +present in both languages. + +We have used dictionaries from several sources, like +{\it dicts.info}\footnote{\url{http://www.dicts.info/}}, +{\it omegawiki}\footnote{\url{http://www.omegawiki.org/}}, +and {\it wiktionary}\footnote{\url{http://en.wiktionary.org/}}. The source +and translated document were aligned on a line-by-line basis. + +In the final form of the detailed comparison sub-task, the results of machine +translation of the source documents were provided to the detector programs +by the surrounding environment, so we have discarded the language detection +and machine translation from our submission altogether, and used only +line-by-line alignment of the source and translated document for calculating +the offsets of text features in the source document. We have then treated +the translated documents the same way as the source documents in English. + \subsection{Further discussion} As in our PAN 2010 submission, we tried to make use of the intrinsic plagiarism @@ -179,4 +217,183 @@ In the full paper, we will also discuss the following topics: \nocite{pan09stamatatos} \nocite{ngram} +\endinput + +Co chci diskutovat v zaveru: +- nebylo mozno cachovat data +- nebylo mozno vylucovat prekryvajici se podobnosti +- cili udaje o run-time jsou uplne nahouby +- 669 radku kodu bez komentaru a prazdnych radku +- hranice mezi pasazema nekdy zahrnovala whitespace a nekdy ne. + +Diskuse plagdet: +- uzivatele chteji "aby odevzdej ukazovalo 0\% shody", nezajima je + co to cislo znamena +- nezalezi na hranicich detekovane pasaze +- false-positives jsou daleko horsi +- granularita je zlo + +Finalni vysledky nad testovacim korpusem: + +0.7288 0.5994 0.9306 1.0007 2012-06-16 02:23 plagdt recall precis granul + 01-no-plagiarism 0.0000 0.0000 0.0000 1.0000 + 02-no-obfuscation 0.9476 0.9627 0.9330 1.0000 + 03-artificial-low 0.8726 0.8099 0.9477 1.0013 + 04-artificial-high 0.3649 0.2255 0.9562 1.0000 + 05-translation 0.7610 0.6662 0.8884 1.0008 + 06-simulated-paraphr 0.5972 0.4369 0.9433 1.0000 + +Vysledky nad souteznimi daty: +plagdet precision recall granularity +0.6826726 0.8931670 0.5524708 1.0000000 + +Run-time: +12500 sekund tokenizace vcetne sc a detekce jazyka +2500 sekund bez sc a detekce jazyka +14 sekund vyhodnoceni valid intervalu a postprocessing + + +TODO: +- hranici podle hustoty matchovani +- xml tridit podle this_offset + +Tady je obsah souboru JOURNAL - jak jsem meril nektera vylepseni: +================================================================= +baseline.py +0.1250 0.1259 0.9783 2.4460 2012-05-03 06:02 plagdt recall precis granul + 01_no_plagiarism 1.0000 1.0000 1.0000 1.0000 + 02_no_obfuscation 0.8608 0.8609 0.8618 1.0009 + 03_artificial_low 0.1006 0.1118 0.9979 2.9974 + 04_artificial_high 0.0054 0.0029 0.9991 1.0778 + 05_translation 0.0003 0.0002 1.0000 1.2143 + 06_simulated_paraphr 0.0565 0.0729 0.9983 4.3075 + +valid_intervals bez postprocessingu (takhle jsem to poprve odevzdal) +0.3183 0.2034 0.9883 1.0850 2012-05-25 15:25 plagdt recall precis granul + 01_no_plagiarism 1.0000 1.0000 1.0000 1.0000 + 02_no_obfuscation 0.9861 0.9973 0.9752 1.0000 + 03_artificial_low 0.4127 0.3006 0.9975 1.1724 + 04_artificial_high 0.0008 0.0004 1.0000 1.0000 + 05_translation 0.0001 0.0000 1.0000 1.0000 + 06_simulated_paraphr 0.3470 0.2248 0.9987 1.0812 + +postprocessed (slucovani blizkych intervalu) +0.3350 0.2051 0.9863 1.0188 2012-05-25 15:27 plagdt recall precis granul + 01_no_plagiarism 1.0000 1.0000 1.0000 1.0000 + 02_no_obfuscation 0.9863 0.9973 0.9755 1.0000 + 03_artificial_low 0.4541 0.3057 0.9942 1.0417 + 04_artificial_high 0.0008 0.0004 1.0000 1.0000 + 05_translation 0.0001 0.0000 1.0000 1.0000 + 06_simulated_paraphr 0.3702 0.2279 0.9986 1.0032 + +whitespace (uprava whitespaces) +0.3353 0.2053 0.9858 1.0188 2012-05-31 17:57 plagdt recall precis granul + 01_no_plagiarism 1.0000 1.0000 1.0000 1.0000 + 02_no_obfuscation 0.9865 0.9987 0.9745 1.0000 + 03_artificial_low 0.4546 0.3061 0.9940 1.0417 + 04_artificial_high 0.0008 0.0004 1.0000 1.0000 + 05_translation 0.0001 0.0000 1.0000 1.0000 + 06_simulated_paraphr 0.3705 0.2281 0.9985 1.0032 + +gap_100: whitespace, + ve valid intervalu dovolim mezeru 100 petic misto 50 +0.3696 0.2305 0.9838 1.0148 2012-05-31 18:07 plagdt recall precis granul + 01_no_plagiarism 1.0000 1.0000 1.0000 1.0000 + 02_no_obfuscation 0.9850 0.9987 0.9717 1.0000 + 03_artificial_low 0.5423 0.3846 0.9922 1.0310 + 04_artificial_high 0.0058 0.0029 0.9151 1.0000 + 05_translation 0.0001 0.0000 1.0000 1.0000 + 06_simulated_paraphr 0.4207 0.2667 0.9959 1.0000 + +gap_200: whitespace, + ve valid intervalu dovolim mezeru 200 petic misto 50 +0.3906 0.2456 0.9769 1.0070 2012-05-31 18:09 plagdt recall precis granul + 01_no_plagiarism 1.0000 1.0000 1.0000 1.0000 + 02_no_obfuscation 0.9820 0.9987 0.9659 1.0000 + 03_artificial_low 0.5976 0.4346 0.9875 1.0139 + 04_artificial_high 0.0087 0.0044 0.9374 1.0000 + 05_translation 0.0001 0.0001 1.0000 1.0000 + 06_simulated_paraphr 0.4360 0.2811 0.9708 1.0000 + +gap_200_int_10: gap_200, + valid int. ma min. 10 petic misto 20 +0.4436 0.2962 0.9660 1.0308 2012-05-31 18:11 plagdt recall precis granul + 01_no_plagiarism 1.0000 1.0000 1.0000 1.0000 + 02_no_obfuscation 0.9612 0.9987 0.9264 1.0000 + 03_artificial_low 0.7048 0.5808 0.9873 1.0530 + 04_artificial_high 0.0457 0.0242 0.9762 1.0465 + 05_translation 0.0008 0.0004 1.0000 1.0000 + 06_simulated_paraphr 0.5123 0.3485 0.9662 1.0000 + +no_trans: gap_200_int_10, + nedetekovat preklady vubec, abych se vyhnul F-P +0.4432 0.2959 0.9658 1.0310 2012-06-01 16:41 plagdt recall precis granul + 01_no_plagiarism 1.0000 1.0000 1.0000 1.0000 + 02_no_obfuscation 0.9608 0.9980 0.9263 1.0000 + 03_artificial_low 0.7045 0.5806 0.9872 1.0530 + 04_artificial_high 0.0457 0.0242 0.9762 1.0465 + 05_translation 0.0000 0.0000 0.0000 1.0000 + 06_simulated_paraphr 0.5123 0.3485 0.9662 1.0000 + + +swng_unsorted se stejnym postprocessingem jako vyse "whitespace" +0.2673 0.1584 0.9281 1.0174 2012-05-31 14:20 plagdt recall precis granul + 01_no_plagiarism 0.0000 0.0000 0.0000 1.0000 + 02_no_obfuscation 0.9439 0.9059 0.9851 1.0000 + 03_artificial_low 0.3178 0.1952 0.9954 1.0377 + 04_artificial_high 0.0169 0.0095 0.9581 1.1707 + 05_translation 0.0042 0.0028 0.0080 1.0000 + 06_simulated_paraphr 0.1905 0.1060 0.9434 1.0000 + +swng_sorted +0.2550 0.1906 0.4067 1.0253 2012-05-30 16:07 plagdt recall precis granul + 01_no_plagiarism 0.0000 0.0000 0.0000 1.0000 + 02_no_obfuscation 0.6648 0.9146 0.5222 1.0000 + 03_artificial_low 0.4093 0.2867 0.8093 1.0483 + 04_artificial_high 0.0454 0.0253 0.4371 1.0755 + 05_translation 0.0030 0.0019 0.0064 1.0000 + 06_simulated_paraphr 0.1017 0.1382 0.0814 1.0106 + +sort_susp: gap_200_int_10 + postprocessing tridim intervaly podle offsetu v susp, nikoliv v src +0.4437 0.2962 0.9676 1.0308 2012-06-01 18:06 plagdt recall precis granul + 01_no_plagiarism 1.0000 1.0000 1.0000 1.0000 + 02_no_obfuscation 0.9641 0.9987 0.9317 1.0000 + 03_artificial_low 0.7048 0.5809 0.9871 1.0530 + 04_artificial_high 0.0457 0.0242 0.9762 1.0465 + 05_translation 0.0008 0.0004 1.0000 1.0000 + 06_simulated_paraphr 0.5123 0.3485 0.9662 1.0000 + +post_gap2_16000: sort_susp, + sloucit dva intervaly pokud je < 16000 znaku a mezera je jen polovina velikosti tech intervalu (bylo 4000) +0.4539 0.2983 0.9642 1.0054 2012-06-01 18:09 plagdt recall precis granul + 01_no_plagiarism 1.0000 1.0000 1.0000 1.0000 + 02_no_obfuscation 0.9631 0.9987 0.9300 1.0000 + 03_artificial_low 0.7307 0.5883 0.9814 1.0094 + 04_artificial_high 0.0480 0.0247 0.9816 1.0078 + 05_translation 0.0008 0.0004 1.0000 1.0000 + 06_simulated_paraphr 0.5133 0.3487 0.9721 1.0000 + +post_gap2_32000: sort_susp, + sloucit intervaly < 32000 znaku a mezera aspon polovina velikosti +0.4543 0.2986 0.9638 1.0050 2012-06-01 18:12 plagdt recall precis granul + 01_no_plagiarism 1.0000 1.0000 1.0000 1.0000 + 02_no_obfuscation 0.9628 0.9987 0.9294 1.0000 + 03_artificial_low 0.7315 0.5893 0.9798 1.0085 + 04_artificial_high 0.0480 0.0247 0.9816 1.0078 + 05_translation 0.0008 0.0004 1.0000 1.0000 + 06_simulated_paraphr 0.5138 0.3487 0.9763 1.0000 + +post_gap2_64000: sort_susp, + sloucit intervaly < 32000 znaku a mezera aspon pol +ovina velikosti +0.4543 0.2988 0.9616 1.0050 2012-06-01 18:21 plagdt recall precis granul + 01_no_plagiarism 1.0000 1.0000 1.0000 1.0000 + 02_no_obfuscation 0.9603 0.9987 0.9248 1.0000 + 03_artificial_low 0.7316 0.5901 0.9782 1.0085 + 04_artificial_high 0.0480 0.0247 0.9816 1.0078 + 05_translation 0.0008 0.0004 1.0000 1.0000 + 06_simulated_paraphr 0.5138 0.3487 0.9763 1.0000 + +post_gap1_2000: post_gap2_32000, + spojit bez podminek veci co maji mezeru pod 2000 (bylo 600) +0.4543 0.2986 0.9635 1.0050 2012-06-01 18:29 plagdt recall precis granul + 01_no_plagiarism 1.0000 1.0000 1.0000 1.0000 + 02_no_obfuscation 0.9628 0.9987 0.9294 1.0000 + 03_artificial_low 0.7315 0.5895 0.9794 1.0085 + 04_artificial_high 0.0480 0.0247 0.9816 1.0078 + 05_translation 0.0008 0.0004 1.0000 1.0000 + 06_simulated_paraphr 0.5138 0.3487 0.9763 1.0000 +