From: Jan "Yenya" Kasprzak <kas@fi.muni.cz>
Date: Thu, 19 Sep 2013 16:07:50 +0000 (+0200)
Subject: Y: text alignment a dalsi drobne upravy
X-Git-Tag: 20130920-vytisteno~6
X-Git-Url: https://www.fi.muni.cz/~kas/git//home/kas/public_html/git/?p=pan13-paper.git;a=commitdiff_plain;h=43ffb5d253cf82ac83620f7bfd2ad6ad4affa5bc

Y: text alignment a dalsi drobne upravy
---

diff --git a/pan13-poster/poster.tex b/pan13-poster/poster.tex
index a01b220..56f1111 100755
--- a/pan13-poster/poster.tex
+++ b/pan13-poster/poster.tex
@@ -205,7 +205,7 @@ From the suspicious document, there were three diverse types of queries extracte
 \section{Selecting}
 Document snippets were used for deciding whether to download the document for the text alignment.
 We used 2-tuples measurement, which indicates how many neighbouring word pairs coexist in the snippet and in the suspicious document.
-Performance of this measure is depicted at picture~\ref{fig:snippet_graph}.
+Performance of this measure is depicted at Figure~\ref{fig:snippet_graph}.
 Having this measure, a threshold for download decision needs to be set in order to maximize all discovered similarities
 and minimize total downloads.
 A profitable threshold is such that matches with the largest distance between those two curves.
@@ -223,7 +223,59 @@ A profitable threshold is such that matches with the largest distance between th
 
 \section{Text Alignment}
 
-The system uses the same basic principles as in \cite{suchomel_kas_12}.
+The system uses the same basic principles as in \cite{suchomel_kas_12}:
+
+\begin{itemize}
+\item{\cemph{common features} between source and suspicious documents}
+\begin{itemize}
+\item{word 5-grams}
+\item{stop-word 8-grams \cite{stamatatos2011plagiarism}}
+\end{itemize}
+\item{\cemph{valid intervals} of characters covered by common features
+	``densely enough''}
+\item{\cemph{postprocessing}---remove overlapping detections,
+	join neighbouring detections}
+\end{itemize}
+
+\subsection{Alternative Features}
+
+\begin{itemize}
+\item{\cemph{contextual n-grams} \cite{torrejondetailed}}
+\begin{itemize}
+\item{\cemph{The quick} brown \cemph{fox jumped} over the lazy dogs.}
+\item{The \cemph{quick brown} fox \cemph{jumped over} the lazy dogs.}
+\end{itemize}
+\item{plain word 4-grams}
+\begin{itemize}
+\item{\cemph{The quick brown fox} jumped over the lazy dogs.}
+\item{The \cemph{quick brown fox jumped} over the lazy dogs.}
+\end{itemize}
+\end{itemize}
+
+\begin{table}
+
+\begin{center}
+\begin{tabular}{|l|r|r|r|r|}
+\hline
+\bf feature & \bf recall & \bf precision & \bf granularity & plagdet \\
+\hline
+plain      5-grams & 0.6306 & 0.8484 & 1.0000 & \cemph{0.7235} \\
+contextual 4-grams & 0.6721 & \cemph{0.8282} & 1.0000 & \cemph{0.7421} \\
+plain      4-grams & \cemph{0.7556} & 0.7340 & 1.0000 & \cemph{0.7447} \\
+\hline
+\end{tabular}
+\end{center}
+
+\caption{Comparison of contextual 4-grams and plain word 4-grams}
+\end{table}
+
+\subsection{Global Postprocessing}
+
+\begin{itemize}
+\item{Similar to PAN 2010 \cite{Kasprzak2010}}
+\item{Overlapping detections removal}
+\item{\cemph{Result:} improvement, but not as big as in 2010}
+\end{itemize}
 
 %
 % Spolecna cast