yenya: uvod a data ktera sem chci zahrnout

[pan12-paper.git] / yenya-detailed.tex
diff --git a/yenya-detailed.tex b/yenya-detailed.tex

index 3615ab9b45e364f79d2562566d9488569701ab40..f2dd93f257c33b7a5a9672999da5e930d97d3276 100644 (file)
--- a/yenya-detailed.tex
+++ b/yenya-detailed.tex
@@ -1,11 +1,40 @@
  \section{Detailed Document Comparison}
  
-\subsection{General Approach}
+\label{detailed}
  
-The approach Masaryk University team has used in PAN 2012 Plagiarism
-detection---detailed comparison sub-task is based on the same approach
-that we have used in PAN 2010 \cite{Kasprzak2010}.  This time, we have
-used a similar approach, enhanced by several means
+The detailed comparison task of PAN 2012 consisted in a comparison
+of given document pairs, with the expected output being the annotation of
+similarities found between these documents.
+The submitted program has been run in a controlled environment
+separately for each document pair, without the possibility of keeping any
+data between runs.
+
+In this section, we describe our approach in the detailed comparison
+task. The rest of this section is organized as follows: in the next
+subsection, we summarise the differences from our previous approach.
+In subsection \ref{sec-alg-overview}, we give an overview of our approach.
+TODO napsat jak to nakonec bude.
+
+\subsection{Differences Against PAN 2010}
+
+Our approach in this task
+is loosely based on the approach we have used in PAN 2010 \cite{Kasprzak2010}.
+The main difference is that instead of looking for similarities of
+one type (for PAN 2010, we have used word 5-grams),
+we have developed a method of evaluating multiple types of similarities
+(we call them {\it common features}) of different properties, such as
+density and length.
+
+As a proof of concept, we have used two types of common features: word
+5-grams and stop-word 8-grams, the later being based on the method described in
+\cite{stamatatos2011plagiarism}.
+
+In addition to the above, we have made several minor improvements to the
+algorithm, such as parameter tuning and improving the detections
+merging in the post-processing stage.
+
+\subsection{Algorithm Overview}
+\label{sec-alg-overview}
  
  The algorithm evaluates the document pair in several stages:
  
@@ -19,7 +48,96 @@ The algorithm evaluates the document pair in several stages:
  \item post-processing phase, mainly serves for merging the nearby common intervals
  \end{itemize}
  
-\subsection{Intrinsic plagiarism detection}
+\subsection{Multi-feature Plagiarism Detection}
+
+Our pair-wise plagiarism detection is based on finding common passages
+of text, present both in the source and in the suspicious document. We call them
+{\it common features}. In PAN 2010, we have used sorted word 5-grams, formed from
+words of three or more characters, as features to compare.
+Recently, other means of plagiarism detection have been explored:
+stopword $n$-gram detection is one of them
+\cite{stamatatos2011plagiarism}.
+
+We propose the plagiarism detection system based on detecting common
+features of various types, for example word $n$-grams, stopword $n$-grams,
+translated single words, translated word bigrams,
+exact common longer words from document pairs having each document
+in a different language, etc. The system
+has to be to the great extent independent of the specialities of various
+feature types. It cannot, for example, use the order of given features
+as a measure of distance between the features, as for example, several
+word 5-grams can be fully contained inside one stopword 8-gram.
+
+We therefore propose to describe the {\it common feature} of two documents
+(susp and src) with the following tuple:
+$\langle
+\hbox{offset}_{\hbox{susp}},
+\hbox{length}_{\hbox{susp}},
+\hbox{offset}_{\hbox{src}},
+\hbox{length}_{\hbox{src}} \rangle$. This way, the common feature is
+described purely in terms of character offsets, belonging to the feature
+in both documents. In our final submission, we have used the following two types
+of common features:
+
+\begin{itemize}
+\item word 5-grams, from words of three or more characters, sorted, lowercased
+\item stopword 8-grams, from 50 most-frequent English words (including
+       the possessive suffix 's), unsorted, lowercased, with 8-grams formed
+       only from the seven most-frequent words ({\it the, of, a, in, to, 's})
+       removed
+\end{itemize}
+
+We have gathered all the common features of both types for a given document
+pair, and formed {\it valid intervals} from them, as described
+in \cite{Kasprzak2009a}. A similar approach is used also in
+\cite{stamatatos2011plagiarism}.
+The algorithm is modified for multi-feature detection to use character offsets
+only instead of feature order numbers. We have used valid intervals
+consisting of at least 5 common features, with the maximum allowed gap
+inside the interval (characters not belonging to any common feature
+of a given valid interval) set to 3,500 characters.
+
+We have also experimented with modifying the allowed gap size using the
+intrinsic plagiarism detection: to allow only shorter gap if the common
+features around the gap belong to different passages, detected as plagiarized
+in the suspicious document by the intrinsic detector, and allow larger gap,
+if both the surrounding common features belong to the same passage,
+detected by the intrinsic detector. This approach, however, did not show
+any improvement against allowed gap of a static size, so it was omitted
+from the final submission.
+
+\subsection{Postprocessing}
+
+In the postprocessing phase, we took the resulting valid intervals,
+and made attempt to further improve the results. We have firstly
+removed overlaps: if both overlapping intervals were
+shorter than 300 characters, we have removed both of them. Otherwise, we
+kept the longer detection (longer in terms of length in the suspicious document).
+
+We have then joined the adjacent valid intervals into one detection,
+if at least one of the following criteria has been met:
+\begin{itemize}
+\item the gap between the intervals contained at least 4 common features,
+and it contained at least one feature per 10,000
+characters\footnote{we have computed the length of the gap as the number
+of characters between the detections in the source document, plus the
+number of charaters between the detections in the suspicious document.}, or
+\item the gap was smaller than 30,000 characters and the size of the adjacent
+valid intervals was at least twice as big as the gap between them, or
+\item the gap was smaller than 30,000 characters and the number of common
+features per character in the adjacent interval was not more than three times
+bigger than number of features per character in the possible joined interval.
+\end{itemize}
+
+These parameters were fine-tuned to achieve the best results on the training corpus. With these parameters, our algorithm got the total plagdet score of 0.73 on the training corpus.
+
+\subsection{Other Approaches Tried}
+
+There are several other approaches we have evaluated, but which were
+omitted from our final submission for various reasons. We think mentioning
+them here is worthwhile nevertheless.
+
+\subsubsection{Intrinsic Plagiarism Detection}
  
  Our approach is based on character $n$-gram profiles of the interval of
  the fixed size (in terms of $n$-grams), and their differences to the
@@ -32,7 +150,7 @@ of only 3-grams, and using the different measure of the difference between
  the n-gram profiles. We have used an approach similar to \cite{ngram},
  where we have compute the profile as an ordered set of 400 most-frequent
  $n$-grams in a given text (the whole document or a partial window). Apart
-from ordering the set we have ignored the actual number of occurrences
+from ordering the set, we have ignored the actual number of occurrences
  of a given $n$-gram altogether, and used the value inveresly
  proportional to the $n$-gram order in the profile, in accordance with
  the Zipf's law \cite{zipf1935psycho}.
@@ -40,19 +158,21 @@ the Zipf's law \cite{zipf1935psycho}.
  This approach has provided more stable style-change function than
  than the one proposed in \cite{pan09stamatatos}. Because of pair-wise
  nature of the detailed comparison sub-task, we couldn't use the results
-of the intrinsic detection immediately, so we wanted to use them
+of the intrinsic detection immediately, therefore we wanted to use them
  as hints to the external detection.
  
-\subsection{Cross-lingual detection}
+\subsubsection{Language Detection}
+
+For language detection, we used the $n$-gram based categorization \cite{ngram}.
+We have computed the language profiles from the source documents of the
+training corpus (using the annotations from the corpus itself). The result
+of this approach was better than using the stopwords-based detection we have
+used in PAN 2010. However, there were still mis-detected documents,
+mainly the long lists of surnames and other tabular data. We have added
+an ad-hoc fix, where for documents having their profile too distant from all of
+English, German, and Spanish profiles, we have declared them to be in English.
  
-%For language detection, we used the $n$-gram based categorization \cite{ngram}.
-%We have computed the language profiles from the source documents of the
-%training corpus (using the annotations from the corpus itself). The result
-%of this approach was better than using the stopwords-based detection we have
-%used in PAN 2010. However, there were still mis-detected documents,
-%mainly the long lists of surnames and other tabular data. We have added
-%an ad-hoc fix, where for documents having their profile too distant from all of
-%English, German, and Spanish profiles, we have declared them to be in English.
+\subsubsection{Cross-lingual Plagiarism Detection}
  
  For cross-lingual plagiarism detection, our aim was to use the public
  interface to Google translate if possible, and use the resulting document
@@ -61,13 +181,13 @@ Should the translation service not be available, we wanted
  to use the fall-back strategy of translating isolated words only,
  with the additional exact matching of longer words (we have used words with
  5 characters or longer).
-We have supposed these longer words can be names or specialized terms,
+We have supposed that these longer words can be names or specialized terms,
  present in both languages.
  
  We have used dictionaries from several sources, like
-{\tt dicts.info\footnote{\url{http://www.dicts.info/}}},
-{\tt omegawiki\footnote{\url{http://www.omegawiki.org/}}},
-and {\tt wiktionary\footnote{\url{http://en.wiktionary.org/}}}. The source
+{\it dicts.info}\footnote{\url{http://www.dicts.info/}},
+{\it omegawiki}\footnote{\url{http://www.omegawiki.org/}},
+and {\it wiktionary}\footnote{\url{http://en.wiktionary.org/}}. The source
  and translated document were aligned on a line-by-line basis.
  
  In the final form of the detailed comparison sub-task, the results of machine
@@ -75,76 +195,205 @@ translation of the source documents were provided to the detector programs
  by the surrounding environment, so we have discarded the language detection
  and machine translation from our submission altogether, and used only
  line-by-line alignment of the source and translated document for calculating
-the offsets of text features in the source document.
-
-\subsection{Multi-feature Plagiarism Detection}
+the offsets of text features in the source document. We have then treated
+the translated documents the same way as the source documents in English.
  
-Our pair-wise plagiarism detection is based on finding common passages
-of text, present both in the source and suspicious document. We call them
-{\it features}. In PAN 2010, we have used sorted word 5-grams, formed from
-words of three or more characters, as features to compare.
-Recently, other means of plagiarism detection have been explored:
-Stop-word $n$-gram detection is one of them
-\cite{stamatatos2011plagiarism}.
-
-We propose the plagiarism detection system based on detecting common
-features of various type, like word $n$-grams, stopword $n$-grams,
-translated words or word bigrams, exact common longer words from document
-pairs having each document in a different language, etc. The system
-has to be to the great extent independent of the specialities of various
-feature types. It cannot, for example, use the order of given features
-as a measure of distance between the features, as for example, several
-word 5-grams can be fully contained inside one stopword 8-gram.
+\subsection{Further discussion}
  
-We thus define {\it common feature} of two documents (susp and src)
-as the following tuple:
-$$\langle
-\hbox{offset}_{\hbox{susp}},
-\hbox{length}_{\hbox{susp}},
-\hbox{offset}_{\hbox{src}},
-\hbox{length}_{\hbox{src}} \rangle$$
+As in our PAN 2010 submission, we tried to make use of the intrinsic plagiarism
+detection, but despite making further improvements to the intrinsic plagiarism detector, we have again failed to reach any significant improvement
+when using it as a hint for the external plagiarism detection.
  
-In our final submission, we have used only the following two types
-of common features:
+In the full paper, we will also discuss the following topics:
  
  \begin{itemize}
-\item word 5-grams, from words of three or more characters, sorted, lowercased
-\item stop-word 8-grams, from 50 most-frequent English words (including
-       the possessive suffix 's), unsorted, lowercased, with 8-grams formed
-       only from the seven most-frequent words ({\it the, of, a, in, to, 's})
-       removed
+\item language detection and cross-language common features
+\item intrinsic plagiarism detection
+\item suitability of plagdet score\cite{potthastframework} for performance measurement
+\item feasibility of our approach in large-scale systems
+\item discussion of parameter settings
  \end{itemize}
  
-We have gathered all the common features for a given document pair, and formed
-{\it valid intervals} from them, as described in \cite{Kasprzak2009a}
-(a similar approach is used also in \cite{stamatatos2011plagiarism}).
-The algorithm is modified for multi-feature detection to use character offsets
-only instead of feature order numbers. We have used valid intervals
-consisting of at least 5 common features, with the maximum allowed gap
-inside the interval (characters not belonging to any common feature
-of a given valid interval) set to 3,500 characters.
+\nocite{pan09stamatatos}
+\nocite{ngram}
  
-We have also experimented with modifying the allowed gap size using the
-intrinsic plagiarism detection: to allow only shorter gap if the common
-features around the gap belong to different passages, detected as plagiarized
-in the suspicious document by the intrinsic detector, and allow larger gap,
-if both the surrounding common features belong to the same passage,
-detected by the intrinsic detector. This approach, however, did not show
-any improvement against allowed gap of a static size, so it was omitted
-from the final submission.
+\endinput
  
-\subsection{Postprocessing}
+Co chci diskutovat v zaveru:
+- nebylo mozno cachovat data
+- nebylo mozno vylucovat prekryvajici se podobnosti
+- cili udaje o run-time jsou uplne nahouby
+- 669 radku kodu bez komentaru a prazdnych radku
+- hranice mezi pasazema nekdy zahrnovala whitespace a nekdy ne.
  
+Diskuse plagdet:
+- uzivatele chteji "aby odevzdej ukazovalo 0\% shody", nezajima je
+       co to cislo znamena
+- nezalezi na hranicich detekovane pasaze
+- false-positives jsou daleko horsi
+- granularita je zlo
  
-\subsection{Further discussion}
+Finalni vysledky nad testovacim korpusem:
  
-In the full paper, we will also discuss the following topics:
+0.7288 0.5994 0.9306 1.0007   2012-06-16 02:23   plagdt recall precis granul
+                            01-no-plagiarism     0.0000 0.0000 0.0000 1.0000
+                            02-no-obfuscation    0.9476 0.9627 0.9330 1.0000
+                            03-artificial-low    0.8726 0.8099 0.9477 1.0013
+                            04-artificial-high   0.3649 0.2255 0.9562 1.0000
+                            05-translation       0.7610 0.6662 0.8884 1.0008
+                            06-simulated-paraphr 0.5972 0.4369 0.9433 1.0000
+
+Vysledky nad souteznimi daty:
+plagdet         precision       recall          granularity
+0.6826726      0.8931670       0.5524708       1.0000000
+
+Run-time:
+12500 sekund tokenizace vcetne sc a detekce jazyka
+2500 sekund bez sc a detekce jazyka
+14 sekund vyhodnoceni valid intervalu a postprocessing
+
+
+TODO:
+- hranici podle hustoty matchovani
+- xml tridit podle this_offset
+
+Tady je obsah souboru JOURNAL - jak jsem meril nektera vylepseni:
+=================================================================
+baseline.py
+0.1250 0.1259 0.9783 2.4460   2012-05-03 06:02   plagdt recall precis granul
+                            01_no_plagiarism     1.0000 1.0000 1.0000 1.0000
+                            02_no_obfuscation    0.8608 0.8609 0.8618 1.0009
+                            03_artificial_low    0.1006 0.1118 0.9979 2.9974
+                            04_artificial_high   0.0054 0.0029 0.9991 1.0778
+                            05_translation       0.0003 0.0002 1.0000 1.2143
+                            06_simulated_paraphr 0.0565 0.0729 0.9983 4.3075
+
+valid_intervals bez postprocessingu (takhle jsem to poprve odevzdal)
+0.3183 0.2034 0.9883 1.0850   2012-05-25 15:25   plagdt recall precis granul
+                            01_no_plagiarism     1.0000 1.0000 1.0000 1.0000
+                            02_no_obfuscation    0.9861 0.9973 0.9752 1.0000
+                            03_artificial_low    0.4127 0.3006 0.9975 1.1724
+                            04_artificial_high   0.0008 0.0004 1.0000 1.0000
+                            05_translation       0.0001 0.0000 1.0000 1.0000
+                            06_simulated_paraphr 0.3470 0.2248 0.9987 1.0812
+
+postprocessed (slucovani blizkych intervalu)
+0.3350 0.2051 0.9863 1.0188   2012-05-25 15:27   plagdt recall precis granul
+                            01_no_plagiarism     1.0000 1.0000 1.0000 1.0000
+                            02_no_obfuscation    0.9863 0.9973 0.9755 1.0000
+                            03_artificial_low    0.4541 0.3057 0.9942 1.0417
+                            04_artificial_high   0.0008 0.0004 1.0000 1.0000
+                            05_translation       0.0001 0.0000 1.0000 1.0000
+                            06_simulated_paraphr 0.3702 0.2279 0.9986 1.0032
+
+whitespace (uprava whitespaces)
+0.3353 0.2053 0.9858 1.0188   2012-05-31 17:57   plagdt recall precis granul
+                            01_no_plagiarism     1.0000 1.0000 1.0000 1.0000
+                            02_no_obfuscation    0.9865 0.9987 0.9745 1.0000
+                            03_artificial_low    0.4546 0.3061 0.9940 1.0417
+                            04_artificial_high   0.0008 0.0004 1.0000 1.0000
+                            05_translation       0.0001 0.0000 1.0000 1.0000
+                            06_simulated_paraphr 0.3705 0.2281 0.9985 1.0032
+
+gap_100: whitespace, + ve valid intervalu dovolim mezeru 100 petic misto 50
+0.3696 0.2305 0.9838 1.0148   2012-05-31 18:07   plagdt recall precis granul
+                            01_no_plagiarism     1.0000 1.0000 1.0000 1.0000
+                            02_no_obfuscation    0.9850 0.9987 0.9717 1.0000
+                            03_artificial_low    0.5423 0.3846 0.9922 1.0310
+                            04_artificial_high   0.0058 0.0029 0.9151 1.0000
+                            05_translation       0.0001 0.0000 1.0000 1.0000
+                            06_simulated_paraphr 0.4207 0.2667 0.9959 1.0000
+
+gap_200: whitespace, + ve valid intervalu dovolim mezeru 200 petic misto 50
+0.3906 0.2456 0.9769 1.0070   2012-05-31 18:09   plagdt recall precis granul
+                            01_no_plagiarism     1.0000 1.0000 1.0000 1.0000
+                            02_no_obfuscation    0.9820 0.9987 0.9659 1.0000
+                            03_artificial_low    0.5976 0.4346 0.9875 1.0139
+                            04_artificial_high   0.0087 0.0044 0.9374 1.0000
+                            05_translation       0.0001 0.0001 1.0000 1.0000
+                            06_simulated_paraphr 0.4360 0.2811 0.9708 1.0000
+
+gap_200_int_10: gap_200, + valid int. ma min. 10 petic misto 20
+0.4436 0.2962 0.9660 1.0308   2012-05-31 18:11   plagdt recall precis granul
+                            01_no_plagiarism     1.0000 1.0000 1.0000 1.0000
+                            02_no_obfuscation    0.9612 0.9987 0.9264 1.0000
+                            03_artificial_low    0.7048 0.5808 0.9873 1.0530
+                            04_artificial_high   0.0457 0.0242 0.9762 1.0465
+                            05_translation       0.0008 0.0004 1.0000 1.0000
+                            06_simulated_paraphr 0.5123 0.3485 0.9662 1.0000
+
+no_trans: gap_200_int_10, + nedetekovat preklady vubec, abych se vyhnul F-P
+0.4432 0.2959 0.9658 1.0310   2012-06-01 16:41   plagdt recall precis granul
+                            01_no_plagiarism     1.0000 1.0000 1.0000 1.0000
+                            02_no_obfuscation    0.9608 0.9980 0.9263 1.0000
+                            03_artificial_low    0.7045 0.5806 0.9872 1.0530
+                            04_artificial_high   0.0457 0.0242 0.9762 1.0465
+                            05_translation       0.0000 0.0000 0.0000 1.0000
+                            06_simulated_paraphr 0.5123 0.3485 0.9662 1.0000
+
+
+swng_unsorted se stejnym postprocessingem jako vyse "whitespace"
+0.2673 0.1584 0.9281 1.0174   2012-05-31 14:20   plagdt recall precis granul
+                            01_no_plagiarism     0.0000 0.0000 0.0000 1.0000
+                            02_no_obfuscation    0.9439 0.9059 0.9851 1.0000
+                            03_artificial_low    0.3178 0.1952 0.9954 1.0377
+                            04_artificial_high   0.0169 0.0095 0.9581 1.1707
+                            05_translation       0.0042 0.0028 0.0080 1.0000
+                            06_simulated_paraphr 0.1905 0.1060 0.9434 1.0000
+
+swng_sorted
+0.2550 0.1906 0.4067 1.0253   2012-05-30 16:07   plagdt recall precis granul
+                            01_no_plagiarism     0.0000 0.0000 0.0000 1.0000
+                            02_no_obfuscation    0.6648 0.9146 0.5222 1.0000
+                            03_artificial_low    0.4093 0.2867 0.8093 1.0483
+                            04_artificial_high   0.0454 0.0253 0.4371 1.0755
+                            05_translation       0.0030 0.0019 0.0064 1.0000
+                            06_simulated_paraphr 0.1017 0.1382 0.0814 1.0106
+
+sort_susp: gap_200_int_10 + postprocessing tridim intervaly podle offsetu v susp, nikoliv v src
+0.4437 0.2962 0.9676 1.0308   2012-06-01 18:06   plagdt recall precis granul
+                            01_no_plagiarism     1.0000 1.0000 1.0000 1.0000
+                            02_no_obfuscation    0.9641 0.9987 0.9317 1.0000
+                            03_artificial_low    0.7048 0.5809 0.9871 1.0530
+                            04_artificial_high   0.0457 0.0242 0.9762 1.0465
+                            05_translation       0.0008 0.0004 1.0000 1.0000
+                            06_simulated_paraphr 0.5123 0.3485 0.9662 1.0000
+
+post_gap2_16000: sort_susp, + sloucit dva intervaly pokud je < 16000 znaku a mezera je jen polovina velikosti tech intervalu (bylo 4000)
+0.4539 0.2983 0.9642 1.0054   2012-06-01 18:09   plagdt recall precis granul
+                            01_no_plagiarism     1.0000 1.0000 1.0000 1.0000
+                            02_no_obfuscation    0.9631 0.9987 0.9300 1.0000
+                            03_artificial_low    0.7307 0.5883 0.9814 1.0094
+                            04_artificial_high   0.0480 0.0247 0.9816 1.0078
+                            05_translation       0.0008 0.0004 1.0000 1.0000
+                            06_simulated_paraphr 0.5133 0.3487 0.9721 1.0000
+
+post_gap2_32000: sort_susp, + sloucit intervaly < 32000 znaku a mezera aspon polovina velikosti
+0.4543 0.2986 0.9638 1.0050   2012-06-01 18:12   plagdt recall precis granul
+                            01_no_plagiarism     1.0000 1.0000 1.0000 1.0000
+                            02_no_obfuscation    0.9628 0.9987 0.9294 1.0000
+                            03_artificial_low    0.7315 0.5893 0.9798 1.0085
+                            04_artificial_high   0.0480 0.0247 0.9816 1.0078
+                            05_translation       0.0008 0.0004 1.0000 1.0000
+                            06_simulated_paraphr 0.5138 0.3487 0.9763 1.0000
+
+post_gap2_64000: sort_susp, + sloucit intervaly < 32000 znaku a mezera aspon pol
+ovina velikosti
+0.4543 0.2988 0.9616 1.0050   2012-06-01 18:21   plagdt recall precis granul
+                            01_no_plagiarism     1.0000 1.0000 1.0000 1.0000
+                            02_no_obfuscation    0.9603 0.9987 0.9248 1.0000
+                            03_artificial_low    0.7316 0.5901 0.9782 1.0085
+                            04_artificial_high   0.0480 0.0247 0.9816 1.0078
+                            05_translation       0.0008 0.0004 1.0000 1.0000
+                            06_simulated_paraphr 0.5138 0.3487 0.9763 1.0000
+
+post_gap1_2000: post_gap2_32000, + spojit bez podminek veci co maji mezeru pod 2000 (bylo 600)
+0.4543 0.2986 0.9635 1.0050   2012-06-01 18:29   plagdt recall precis granul
+                            01_no_plagiarism     1.0000 1.0000 1.0000 1.0000
+                            02_no_obfuscation    0.9628 0.9987 0.9294 1.0000
+                            03_artificial_low    0.7315 0.5895 0.9794 1.0085
+                            04_artificial_high   0.0480 0.0247 0.9816 1.0078
+                            05_translation       0.0008 0.0004 1.0000 1.0000
+                            06_simulated_paraphr 0.5138 0.3487 0.9763 1.0000
  
-\begin{itemize}
-\item language detection
-\item suitability of plagdet score\cite{potthastframework} for performance measurement
-\item feasibility of our approach in large-scale systems
-\item other possible features to use, especially for cross-lingual detection
-\item discussion of parameter settings
-\end{itemize}