From: Jan "Yenya" Kasprzak Date: Thu, 19 Sep 2013 16:07:50 +0000 (+0200) Subject: Y: text alignment a dalsi drobne upravy X-Git-Tag: 20130920-vytisteno~6 X-Git-Url: https://www.fi.muni.cz/~kas/git//home/kas/public_html/git/?p=pan13-paper.git;a=commitdiff_plain;h=43ffb5d253cf82ac83620f7bfd2ad6ad4affa5bc Y: text alignment a dalsi drobne upravy --- diff --git a/pan13-poster/poster.tex b/pan13-poster/poster.tex index a01b220..56f1111 100755 --- a/pan13-poster/poster.tex +++ b/pan13-poster/poster.tex @@ -205,7 +205,7 @@ From the suspicious document, there were three diverse types of queries extracte \section{Selecting} Document snippets were used for deciding whether to download the document for the text alignment. We used 2-tuples measurement, which indicates how many neighbouring word pairs coexist in the snippet and in the suspicious document. -Performance of this measure is depicted at picture~\ref{fig:snippet_graph}. +Performance of this measure is depicted at Figure~\ref{fig:snippet_graph}. Having this measure, a threshold for download decision needs to be set in order to maximize all discovered similarities and minimize total downloads. A profitable threshold is such that matches with the largest distance between those two curves. @@ -223,7 +223,59 @@ A profitable threshold is such that matches with the largest distance between th \section{Text Alignment} -The system uses the same basic principles as in \cite{suchomel_kas_12}. +The system uses the same basic principles as in \cite{suchomel_kas_12}: + +\begin{itemize} +\item{\cemph{common features} between source and suspicious documents} +\begin{itemize} +\item{word 5-grams} +\item{stop-word 8-grams \cite{stamatatos2011plagiarism}} +\end{itemize} +\item{\cemph{valid intervals} of characters covered by common features + ``densely enough''} +\item{\cemph{postprocessing}---remove overlapping detections, + join neighbouring detections} +\end{itemize} + +\subsection{Alternative Features} + +\begin{itemize} +\item{\cemph{contextual n-grams} \cite{torrejondetailed}} +\begin{itemize} +\item{\cemph{The quick} brown \cemph{fox jumped} over the lazy dogs.} +\item{The \cemph{quick brown} fox \cemph{jumped over} the lazy dogs.} +\end{itemize} +\item{plain word 4-grams} +\begin{itemize} +\item{\cemph{The quick brown fox} jumped over the lazy dogs.} +\item{The \cemph{quick brown fox jumped} over the lazy dogs.} +\end{itemize} +\end{itemize} + +\begin{table} + +\begin{center} +\begin{tabular}{|l|r|r|r|r|} +\hline +\bf feature & \bf recall & \bf precision & \bf granularity & plagdet \\ +\hline +plain 5-grams & 0.6306 & 0.8484 & 1.0000 & \cemph{0.7235} \\ +contextual 4-grams & 0.6721 & \cemph{0.8282} & 1.0000 & \cemph{0.7421} \\ +plain 4-grams & \cemph{0.7556} & 0.7340 & 1.0000 & \cemph{0.7447} \\ +\hline +\end{tabular} +\end{center} + +\caption{Comparison of contextual 4-grams and plain word 4-grams} +\end{table} + +\subsection{Global Postprocessing} + +\begin{itemize} +\item{Similar to PAN 2010 \cite{Kasprzak2010}} +\item{Overlapping detections removal} +\item{\cemph{Result:} improvement, but not as big as in 2010} +\end{itemize} % % Spolecna cast