From: Simon Suchomel Date: Thu, 19 Sep 2013 13:19:02 +0000 (+0200) Subject: 1. verze hotove Simonovy casti X-Git-Tag: 20130920-vytisteno~9 X-Git-Url: https://www.fi.muni.cz/~kas/git//home/kas/public_html/git/?p=pan13-paper.git;a=commitdiff_plain;h=b92a122ec0f3aca815db768cbd5ff1cde427cd38 1. verze hotove Simonovy casti --- diff --git a/pan13-poster/img/document_awfc.pdf b/pan13-poster/img/document_awfc.pdf index 71e5eff..0a48308 100755 Binary files a/pan13-poster/img/document_awfc.pdf and b/pan13-poster/img/document_awfc.pdf differ diff --git a/pan13-poster/img/document_keywords.pdf b/pan13-poster/img/document_keywords.pdf new file mode 100755 index 0000000..f60baf6 Binary files /dev/null and b/pan13-poster/img/document_keywords.pdf differ diff --git a/pan13-poster/img/document_paragraphs.pdf b/pan13-poster/img/document_paragraphs.pdf new file mode 100755 index 0000000..38c4372 Binary files /dev/null and b/pan13-poster/img/document_paragraphs.pdf differ diff --git a/pan13-poster/img/queryprocess.pdf b/pan13-poster/img/queryprocess.pdf new file mode 100755 index 0000000..e6d8a1a Binary files /dev/null and b/pan13-poster/img/queryprocess.pdf differ diff --git a/pan13-poster/poster.tex b/pan13-poster/poster.tex index 42987ac..5e3c9a0 100755 --- a/pan13-poster/poster.tex +++ b/pan13-poster/poster.tex @@ -116,41 +116,32 @@ \begin{multicols}{2}\setlength{\columnseprule}{0pt} - - \section{Introduction} - +% PAN 2013 LOrem ipsum Lorem ipsum Lorem ipsumLorem ipsumLorem ipsumLorem ipsumLorem ipsum - - +% \vfill \columnbreak - +% \begin{figure} \centering - \includegraphics[width=0.8\textwidth]{img/source_retrieval_process.pdf} + \includegraphics[width=0.6\textwidth]{img/source_retrieval_process.pdf} \caption{Plagiarism discovery process.} \label{fig:process} \end{figure} - - \end{multicols} - - - \begin{multicols}{2} - %\rm - %%% Introduction \section{Querying} Querying means to effectively utilize the search engine in order to retrieve as many relevant documents as possible with the minimum amount of queries. %We consider the resulting document relevantif it shares some of text characteristics with the suspicious document. -In real-world queries as such represent appreciable cost, therefore their minimization should be one of the top priorities. \\ -\subsection{Types of Queries} -From the suspicious document, there were three diverse types of queries extracted. -\subsubsection{Keywords Based Queries} +In real-world queries as such represent appreciable cost, therefore their minimization should be one of the top priorities. +%\subsection{Types of Queries} +From the suspicious document, there were three diverse types of queries extracted.\\ +\begin{minipage}{0.55\linewidth} +\subsection{Keywords Based Queries} \begin{ytemize} \item TF--IDF base automated keywords extraction; \item 5-token long; @@ -158,9 +149,15 @@ From the suspicious document, there were three diverse types of queries extracte \item Non-positional; \item Non-phrasal. \end{ytemize} - +\end{minipage} +\begin{minipage}{0.45\linewidth} +\begin{figure}[h] + %\centering + \includegraphics[width=1\linewidth]{img/document_keywords.pdf} +\end{figure} +\end{minipage} \begin{minipage}{0.55\linewidth} -\subsubsection{Intrinsic Plagiarism Based Queries} +\subsection{Intrinsic Plagiarism Based Queries} \begin{ytemize} \item Averaged Word Frequency Class based chunking~\cite{AWFC}; \item Random sentence selection from the chunk; @@ -175,16 +172,35 @@ From the suspicious document, there were three diverse types of queries extracte \includegraphics[width=1\linewidth]{img/document_awfc.pdf} \end{figure} \end{minipage} - -\subsubsection{Paragraph Based Queries} +\begin{minipage}{0.55\linewidth} +\subsection{Paragraph Based Queries} \begin{ytemize} \item Longest sentences from miscellaneous paragraphs; \item Deterministic; \item Positional; \item Phrasal. \end{ytemize} +\end{minipage} +\begin{minipage}{0.45\linewidth} +\begin{figure}[h] + %\centering + \includegraphics[width=1\linewidth]{img/document_paragraphs.pdf} +\end{figure} +\end{minipage} + +\begin{figure}[h] + \centering + \includegraphics[width=0.8\linewidth]{img/queryprocess.pdf} + \caption{Stepwise queries execution process.} +\end{figure} \section{Selecting} +Document snippets were used for deciding whether to download the document for the text alignment. +We used 2-tuples measurement, which indicates how many neighbouring word pairs coexist in the snippet and in the suspicious document. +Performance of this measure is depicted at picture~\ref{fig:snippet_graph}. +Having this measure, a threshold for download decision needs to be set in order to maximize all discovered similarities +and minimize total downloads. +A profitable threshold is such that matches with the largest distance between those two curves. \begin{figure} \centering \includegraphics[width=0.8\textwidth]{img/snippets_graph.pdf} @@ -192,6 +208,7 @@ From the suspicious document, there were three diverse types of queries extracte \label{fig:snippet_graph} \end{figure} + % % Yenyova cast %