1 \section{Detailed Document Comparison}~\label{yenya}
5 The detailed comparison task of PAN 2012 consisted in a comparison
6 of given document pairs, with the expected output being the annotation of
7 similarities found between these documents.
8 The submitted program has been run in a controlled environment
9 separately for each document pair, without the possibility of keeping any
12 In this section, we describe our approach in the detailed comparison
13 task. The rest of this section is organized as follows: in the next
14 subsection, we summarise the differences from our previous approach.
15 In subsection \ref{sec-alg-overview}, we give an overview of our approach.
16 TODO napsat jak to nakonec bude.
18 \subsection{Differences Against PAN 2010}
20 Our approach in this task
21 is loosely based on the approach we have used in PAN 2010 \cite{Kasprzak2010}.
22 The main difference is that instead of looking for similarities of
23 one type (for PAN 2010, we have used word 5-grams),
24 we have developed a method of evaluating multiple types of similarities
25 (we call them {\it common features}) of different properties, such as
28 As a proof of concept, we have used two types of common features: word
29 5-grams and stop-word 8-grams, the later being based on the method described in
30 \cite{stamatatos2011plagiarism}.
32 In addition to the above, we have made several minor improvements to the
33 algorithm, such as parameter tuning and improving the detections
34 merging in the post-processing stage.
36 \subsection{Algorithm Overview}
37 \label{sec-alg-overview}
39 The algorithm evaluates the document pair in several stages:
42 \item intrinsic plagiarism detection
43 \item language detection of the source document
45 \item cross-lingual plagiarism detection, if the source document is not in English
47 \item detecting intervals with common features
48 \item post-processing phase, mainly serves for merging the nearby common intervals
51 \subsection{Multi-feature Plagiarism Detection}
53 Our pair-wise plagiarism detection is based on finding common passages
54 of text, present both in the source and in the suspicious document. We call them
55 {\it common features}. In PAN 2010, we have used sorted word 5-grams, formed from
56 words of three or more characters, as features to compare.
57 Recently, other means of plagiarism detection have been explored:
58 stopword $n$-gram detection is one of them
59 \cite{stamatatos2011plagiarism}.
61 We propose the plagiarism detection system based on detecting common
62 features of various types, for example word $n$-grams, stopword $n$-grams,
63 translated single words, translated word bigrams,
64 exact common longer words from document pairs having each document
65 in a different language, etc. The system
66 has to be to the great extent independent of the specialities of various
67 feature types. It cannot, for example, use the order of given features
68 as a measure of distance between the features, as for example, several
69 word 5-grams can be fully contained inside one stopword 8-gram.
71 We therefore propose to describe the {\it common feature} of two documents
72 (susp and src) with the following tuple:
74 \hbox{offset}_{\hbox{susp}},
75 \hbox{length}_{\hbox{susp}},
76 \hbox{offset}_{\hbox{src}},
77 \hbox{length}_{\hbox{src}} \rangle$. This way, the common feature is
78 described purely in terms of character offsets, belonging to the feature
79 in both documents. In our final submission, we have used the following two types
83 \item word 5-grams, from words of three or more characters, sorted, lowercased
84 \item stopword 8-grams, from 50 most-frequent English words (including
85 the possessive suffix 's), unsorted, lowercased, with 8-grams formed
86 only from the seven most-frequent words ({\it the, of, a, in, to, 's})
90 We have gathered all the common features of both types for a given document
91 pair, and formed {\it valid intervals} from them, as described
92 in \cite{Kasprzak2009a}. A similar approach is used also in
93 \cite{stamatatos2011plagiarism}.
94 The algorithm is modified for multi-feature detection to use character offsets
95 only instead of feature order numbers. We have used valid intervals
96 consisting of at least 5 common features, with the maximum allowed gap
97 inside the interval (characters not belonging to any common feature
98 of a given valid interval) set to 3,500 characters.
100 We have also experimented with modifying the allowed gap size using the
101 intrinsic plagiarism detection: to allow only shorter gap if the common
102 features around the gap belong to different passages, detected as plagiarized
103 in the suspicious document by the intrinsic detector, and allow larger gap,
104 if both the surrounding common features belong to the same passage,
105 detected by the intrinsic detector. This approach, however, did not show
106 any improvement against allowed gap of a static size, so it was omitted
107 from the final submission.
109 \subsection{Postprocessing}
111 In the postprocessing phase, we took the resulting valid intervals,
112 and made attempt to further improve the results. We have firstly
113 removed overlaps: if both overlapping intervals were
114 shorter than 300 characters, we have removed both of them. Otherwise, we
115 kept the longer detection (longer in terms of length in the suspicious document).
117 We have then joined the adjacent valid intervals into one detection,
118 if at least one of the following criteria has been met:
120 \item the gap between the intervals contained at least 4 common features,
121 and it contained at least one feature per 10,000
122 characters\footnote{we have computed the length of the gap as the number
123 of characters between the detections in the source document, plus the
124 number of charaters between the detections in the suspicious document.}, or
125 \item the gap was smaller than 30,000 characters and the size of the adjacent
126 valid intervals was at least twice as big as the gap between them, or
127 \item the gap was smaller than 30,000 characters and the number of common
128 features per character in the adjacent interval was not more than three times
129 bigger than number of features per character in the possible joined interval.
132 These parameters were fine-tuned to achieve the best results on the training corpus. With these parameters, our algorithm got the total plagdet score of 0.73 on the training corpus.
134 \subsection{Other Approaches Tried}
136 There are several other approaches we have evaluated, but which were
137 omitted from our final submission for various reasons. We think mentioning
138 them here is worthwhile nevertheless.
140 \subsubsection{Intrinsic Plagiarism Detection}
142 Our approach is based on character $n$-gram profiles of the interval of
143 the fixed size (in terms of $n$-grams), and their differences to the
144 profile of the whole document \cite{pan09stamatatos}. We have further
145 enhanced the approach with using gaussian smoothing of the style-change
146 function \cite{Kasprzak2010}.
148 For PAN 2012, we have experimented with using 1-, 2-, and 3-grams instead
149 of only 3-grams, and using the different measure of the difference between
150 the n-gram profiles. We have used an approach similar to \cite{ngram},
151 where we have compute the profile as an ordered set of 400 most-frequent
152 $n$-grams in a given text (the whole document or a partial window). Apart
153 from ordering the set, we have ignored the actual number of occurrences
154 of a given $n$-gram altogether, and used the value inveresly
155 proportional to the $n$-gram order in the profile, in accordance with
156 the Zipf's law \cite{zipf1935psycho}.
158 This approach has provided more stable style-change function than
159 than the one proposed in \cite{pan09stamatatos}. Because of pair-wise
160 nature of the detailed comparison sub-task, we couldn't use the results
161 of the intrinsic detection immediately, therefore we wanted to use them
162 as hints to the external detection.
164 \subsubsection{Language Detection}
166 For language detection, we used the $n$-gram based categorization \cite{ngram}.
167 We have computed the language profiles from the source documents of the
168 training corpus (using the annotations from the corpus itself). The result
169 of this approach was better than using the stopwords-based detection we have
170 used in PAN 2010. However, there were still mis-detected documents,
171 mainly the long lists of surnames and other tabular data. We have added
172 an ad-hoc fix, where for documents having their profile too distant from all of
173 English, German, and Spanish profiles, we have declared them to be in English.
175 \subsubsection{Cross-lingual Plagiarism Detection}
177 For cross-lingual plagiarism detection, our aim was to use the public
178 interface to Google translate if possible, and use the resulting document
179 as the source for standard intra-lingual detector.
180 Should the translation service not be available, we wanted
181 to use the fall-back strategy of translating isolated words only,
182 with the additional exact matching of longer words (we have used words with
183 5 characters or longer).
184 We have supposed that these longer words can be names or specialized terms,
185 present in both languages.
187 We have used dictionaries from several sources, like
188 {\it dicts.info}\footnote{\url{http://www.dicts.info/}},
189 {\it omegawiki}\footnote{\url{http://www.omegawiki.org/}},
190 and {\it wiktionary}\footnote{\url{http://en.wiktionary.org/}}. The source
191 and translated document were aligned on a line-by-line basis.
193 In the final form of the detailed comparison sub-task, the results of machine
194 translation of the source documents were provided to the detector programs
195 by the surrounding environment, so we have discarded the language detection
196 and machine translation from our submission altogether, and used only
197 line-by-line alignment of the source and translated document for calculating
198 the offsets of text features in the source document. We have then treated
199 the translated documents the same way as the source documents in English.
201 \subsection{Further discussion}
203 As in our PAN 2010 submission, we tried to make use of the intrinsic plagiarism
204 detection, but despite making further improvements to the intrinsic plagiarism detector, we have again failed to reach any significant improvement
205 when using it as a hint for the external plagiarism detection.
207 In the full paper, we will also discuss the following topics:
210 \item language detection and cross-language common features
211 \item intrinsic plagiarism detection
212 \item suitability of plagdet score\cite{potthastframework} for performance measurement
213 \item feasibility of our approach in large-scale systems
214 \item discussion of parameter settings
217 \nocite{pan09stamatatos}
222 Co chci diskutovat v zaveru:
223 - nebylo mozno cachovat data
224 - nebylo mozno vylucovat prekryvajici se podobnosti
225 - cili udaje o run-time jsou uplne nahouby
226 - 669 radku kodu bez komentaru a prazdnych radku
227 - hranice mezi pasazema nekdy zahrnovala whitespace a nekdy ne.
230 - uzivatele chteji "aby odevzdej ukazovalo 0\% shody", nezajima je
232 - nezalezi na hranicich detekovane pasaze
233 - false-positives jsou daleko horsi
236 Finalni vysledky nad testovacim korpusem:
238 0.7288 0.5994 0.9306 1.0007 2012-06-16 02:23 plagdt recall precis granul
239 01-no-plagiarism 0.0000 0.0000 0.0000 1.0000
240 02-no-obfuscation 0.9476 0.9627 0.9330 1.0000
241 03-artificial-low 0.8726 0.8099 0.9477 1.0013
242 04-artificial-high 0.3649 0.2255 0.9562 1.0000
243 05-translation 0.7610 0.6662 0.8884 1.0008
244 06-simulated-paraphr 0.5972 0.4369 0.9433 1.0000
246 Vysledky nad souteznimi daty:
247 plagdet precision recall granularity
248 0.6826726 0.8931670 0.5524708 1.0000000
251 12500 sekund tokenizace vcetne sc a detekce jazyka
252 2500 sekund bez sc a detekce jazyka
253 14 sekund vyhodnoceni valid intervalu a postprocessing
257 - hranici podle hustoty matchovani
258 - xml tridit podle this_offset
260 Tady je obsah souboru JOURNAL - jak jsem meril nektera vylepseni:
261 =================================================================
263 0.1250 0.1259 0.9783 2.4460 2012-05-03 06:02 plagdt recall precis granul
264 01_no_plagiarism 1.0000 1.0000 1.0000 1.0000
265 02_no_obfuscation 0.8608 0.8609 0.8618 1.0009
266 03_artificial_low 0.1006 0.1118 0.9979 2.9974
267 04_artificial_high 0.0054 0.0029 0.9991 1.0778
268 05_translation 0.0003 0.0002 1.0000 1.2143
269 06_simulated_paraphr 0.0565 0.0729 0.9983 4.3075
271 valid_intervals bez postprocessingu (takhle jsem to poprve odevzdal)
272 0.3183 0.2034 0.9883 1.0850 2012-05-25 15:25 plagdt recall precis granul
273 01_no_plagiarism 1.0000 1.0000 1.0000 1.0000
274 02_no_obfuscation 0.9861 0.9973 0.9752 1.0000
275 03_artificial_low 0.4127 0.3006 0.9975 1.1724
276 04_artificial_high 0.0008 0.0004 1.0000 1.0000
277 05_translation 0.0001 0.0000 1.0000 1.0000
278 06_simulated_paraphr 0.3470 0.2248 0.9987 1.0812
280 postprocessed (slucovani blizkych intervalu)
281 0.3350 0.2051 0.9863 1.0188 2012-05-25 15:27 plagdt recall precis granul
282 01_no_plagiarism 1.0000 1.0000 1.0000 1.0000
283 02_no_obfuscation 0.9863 0.9973 0.9755 1.0000
284 03_artificial_low 0.4541 0.3057 0.9942 1.0417
285 04_artificial_high 0.0008 0.0004 1.0000 1.0000
286 05_translation 0.0001 0.0000 1.0000 1.0000
287 06_simulated_paraphr 0.3702 0.2279 0.9986 1.0032
289 whitespace (uprava whitespaces)
290 0.3353 0.2053 0.9858 1.0188 2012-05-31 17:57 plagdt recall precis granul
291 01_no_plagiarism 1.0000 1.0000 1.0000 1.0000
292 02_no_obfuscation 0.9865 0.9987 0.9745 1.0000
293 03_artificial_low 0.4546 0.3061 0.9940 1.0417
294 04_artificial_high 0.0008 0.0004 1.0000 1.0000
295 05_translation 0.0001 0.0000 1.0000 1.0000
296 06_simulated_paraphr 0.3705 0.2281 0.9985 1.0032
298 gap_100: whitespace, + ve valid intervalu dovolim mezeru 100 petic misto 50
299 0.3696 0.2305 0.9838 1.0148 2012-05-31 18:07 plagdt recall precis granul
300 01_no_plagiarism 1.0000 1.0000 1.0000 1.0000
301 02_no_obfuscation 0.9850 0.9987 0.9717 1.0000
302 03_artificial_low 0.5423 0.3846 0.9922 1.0310
303 04_artificial_high 0.0058 0.0029 0.9151 1.0000
304 05_translation 0.0001 0.0000 1.0000 1.0000
305 06_simulated_paraphr 0.4207 0.2667 0.9959 1.0000
307 gap_200: whitespace, + ve valid intervalu dovolim mezeru 200 petic misto 50
308 0.3906 0.2456 0.9769 1.0070 2012-05-31 18:09 plagdt recall precis granul
309 01_no_plagiarism 1.0000 1.0000 1.0000 1.0000
310 02_no_obfuscation 0.9820 0.9987 0.9659 1.0000
311 03_artificial_low 0.5976 0.4346 0.9875 1.0139
312 04_artificial_high 0.0087 0.0044 0.9374 1.0000
313 05_translation 0.0001 0.0001 1.0000 1.0000
314 06_simulated_paraphr 0.4360 0.2811 0.9708 1.0000
316 gap_200_int_10: gap_200, + valid int. ma min. 10 petic misto 20
317 0.4436 0.2962 0.9660 1.0308 2012-05-31 18:11 plagdt recall precis granul
318 01_no_plagiarism 1.0000 1.0000 1.0000 1.0000
319 02_no_obfuscation 0.9612 0.9987 0.9264 1.0000
320 03_artificial_low 0.7048 0.5808 0.9873 1.0530
321 04_artificial_high 0.0457 0.0242 0.9762 1.0465
322 05_translation 0.0008 0.0004 1.0000 1.0000
323 06_simulated_paraphr 0.5123 0.3485 0.9662 1.0000
325 no_trans: gap_200_int_10, + nedetekovat preklady vubec, abych se vyhnul F-P
326 0.4432 0.2959 0.9658 1.0310 2012-06-01 16:41 plagdt recall precis granul
327 01_no_plagiarism 1.0000 1.0000 1.0000 1.0000
328 02_no_obfuscation 0.9608 0.9980 0.9263 1.0000
329 03_artificial_low 0.7045 0.5806 0.9872 1.0530
330 04_artificial_high 0.0457 0.0242 0.9762 1.0465
331 05_translation 0.0000 0.0000 0.0000 1.0000
332 06_simulated_paraphr 0.5123 0.3485 0.9662 1.0000
335 swng_unsorted se stejnym postprocessingem jako vyse "whitespace"
336 0.2673 0.1584 0.9281 1.0174 2012-05-31 14:20 plagdt recall precis granul
337 01_no_plagiarism 0.0000 0.0000 0.0000 1.0000
338 02_no_obfuscation 0.9439 0.9059 0.9851 1.0000
339 03_artificial_low 0.3178 0.1952 0.9954 1.0377
340 04_artificial_high 0.0169 0.0095 0.9581 1.1707
341 05_translation 0.0042 0.0028 0.0080 1.0000
342 06_simulated_paraphr 0.1905 0.1060 0.9434 1.0000
345 0.2550 0.1906 0.4067 1.0253 2012-05-30 16:07 plagdt recall precis granul
346 01_no_plagiarism 0.0000 0.0000 0.0000 1.0000
347 02_no_obfuscation 0.6648 0.9146 0.5222 1.0000
348 03_artificial_low 0.4093 0.2867 0.8093 1.0483
349 04_artificial_high 0.0454 0.0253 0.4371 1.0755
350 05_translation 0.0030 0.0019 0.0064 1.0000
351 06_simulated_paraphr 0.1017 0.1382 0.0814 1.0106
353 sort_susp: gap_200_int_10 + postprocessing tridim intervaly podle offsetu v susp, nikoliv v src
354 0.4437 0.2962 0.9676 1.0308 2012-06-01 18:06 plagdt recall precis granul
355 01_no_plagiarism 1.0000 1.0000 1.0000 1.0000
356 02_no_obfuscation 0.9641 0.9987 0.9317 1.0000
357 03_artificial_low 0.7048 0.5809 0.9871 1.0530
358 04_artificial_high 0.0457 0.0242 0.9762 1.0465
359 05_translation 0.0008 0.0004 1.0000 1.0000
360 06_simulated_paraphr 0.5123 0.3485 0.9662 1.0000
362 post_gap2_16000: sort_susp, + sloucit dva intervaly pokud je < 16000 znaku a mezera je jen polovina velikosti tech intervalu (bylo 4000)
363 0.4539 0.2983 0.9642 1.0054 2012-06-01 18:09 plagdt recall precis granul
364 01_no_plagiarism 1.0000 1.0000 1.0000 1.0000
365 02_no_obfuscation 0.9631 0.9987 0.9300 1.0000
366 03_artificial_low 0.7307 0.5883 0.9814 1.0094
367 04_artificial_high 0.0480 0.0247 0.9816 1.0078
368 05_translation 0.0008 0.0004 1.0000 1.0000
369 06_simulated_paraphr 0.5133 0.3487 0.9721 1.0000
371 post_gap2_32000: sort_susp, + sloucit intervaly < 32000 znaku a mezera aspon polovina velikosti
372 0.4543 0.2986 0.9638 1.0050 2012-06-01 18:12 plagdt recall precis granul
373 01_no_plagiarism 1.0000 1.0000 1.0000 1.0000
374 02_no_obfuscation 0.9628 0.9987 0.9294 1.0000
375 03_artificial_low 0.7315 0.5893 0.9798 1.0085
376 04_artificial_high 0.0480 0.0247 0.9816 1.0078
377 05_translation 0.0008 0.0004 1.0000 1.0000
378 06_simulated_paraphr 0.5138 0.3487 0.9763 1.0000
380 post_gap2_64000: sort_susp, + sloucit intervaly < 32000 znaku a mezera aspon pol
382 0.4543 0.2988 0.9616 1.0050 2012-06-01 18:21 plagdt recall precis granul
383 01_no_plagiarism 1.0000 1.0000 1.0000 1.0000
384 02_no_obfuscation 0.9603 0.9987 0.9248 1.0000
385 03_artificial_low 0.7316 0.5901 0.9782 1.0085
386 04_artificial_high 0.0480 0.0247 0.9816 1.0078
387 05_translation 0.0008 0.0004 1.0000 1.0000
388 06_simulated_paraphr 0.5138 0.3487 0.9763 1.0000
390 post_gap1_2000: post_gap2_32000, + spojit bez podminek veci co maji mezeru pod 2000 (bylo 600)
391 0.4543 0.2986 0.9635 1.0050 2012-06-01 18:29 plagdt recall precis granul
392 01_no_plagiarism 1.0000 1.0000 1.0000 1.0000
393 02_no_obfuscation 0.9628 0.9987 0.9294 1.0000
394 03_artificial_low 0.7315 0.5895 0.9794 1.0085
395 04_artificial_high 0.0480 0.0247 0.9816 1.0078
396 05_translation 0.0008 0.0004 1.0000 1.0000
397 06_simulated_paraphr 0.5138 0.3487 0.9763 1.0000