\section{Selecting}\r
Document snippets were used for deciding whether to download the document for the text alignment.\r
We used 2-tuples measurement, which indicates how many neighbouring word pairs coexist in the snippet and in the suspicious document.\r
-Performance of this measure is depicted at picture~\ref{fig:snippet_graph}.\r
+Performance of this measure is depicted at Figure~\ref{fig:snippet_graph}.\r
Having this measure, a threshold for download decision needs to be set in order to maximize all discovered similarities\r
and minimize total downloads.\r
A profitable threshold is such that matches with the largest distance between those two curves.\r
\r
\section{Text Alignment}\r
\r
-The system uses the same basic principles as in \cite{suchomel_kas_12}.\r
+The system uses the same basic principles as in \cite{suchomel_kas_12}:\r
+\r
+\begin{itemize}\r
+\item{\cemph{common features} between source and suspicious documents}\r
+\begin{itemize}\r
+\item{word 5-grams}\r
+\item{stop-word 8-grams \cite{stamatatos2011plagiarism}}\r
+\end{itemize}\r
+\item{\cemph{valid intervals} of characters covered by common features\r
+ ``densely enough''}\r
+\item{\cemph{postprocessing}---remove overlapping detections,\r
+ join neighbouring detections}\r
+\end{itemize}\r
+\r
+\subsection{Alternative Features}\r
+\r
+\begin{itemize}\r
+\item{\cemph{contextual n-grams} \cite{torrejondetailed}}\r
+\begin{itemize}\r
+\item{\cemph{The quick} brown \cemph{fox jumped} over the lazy dogs.}\r
+\item{The \cemph{quick brown} fox \cemph{jumped over} the lazy dogs.}\r
+\end{itemize}\r
+\item{plain word 4-grams}\r
+\begin{itemize}\r
+\item{\cemph{The quick brown fox} jumped over the lazy dogs.}\r
+\item{The \cemph{quick brown fox jumped} over the lazy dogs.}\r
+\end{itemize}\r
+\end{itemize}\r
+\r
+\begin{table}\r
+\r
+\begin{center}\r
+\begin{tabular}{|l|r|r|r|r|}\r
+\hline\r
+\bf feature & \bf recall & \bf precision & \bf granularity & plagdet \\\r
+\hline\r
+plain 5-grams & 0.6306 & 0.8484 & 1.0000 & \cemph{0.7235} \\\r
+contextual 4-grams & 0.6721 & \cemph{0.8282} & 1.0000 & \cemph{0.7421} \\\r
+plain 4-grams & \cemph{0.7556} & 0.7340 & 1.0000 & \cemph{0.7447} \\\r
+\hline\r
+\end{tabular}\r
+\end{center}\r
+\r
+\caption{Comparison of contextual 4-grams and plain word 4-grams}\r
+\end{table}\r
+\r
+\subsection{Global Postprocessing}\r
+\r
+\begin{itemize}\r
+\item{Similar to PAN 2010 \cite{Kasprzak2010}}\r
+\item{Overlapping detections removal}\r
+\item{\cemph{Result:} improvement, but not as big as in 2010}\r
+\end{itemize}\r
\r
%\r
% Spolecna cast\r