X-Git-Url: https://www.fi.muni.cz/~kas/git//home/kas/public_html/git/?a=blobdiff_plain;f=pan13-poster%2Fposter.tex;h=0148eb1110b33871befb78855d0e64ee2698fd4a;hb=cd0116df7be443169acb4e844372d27e1aed0a0e;hp=c3ab1c57b5ad05ceb3e72724b69833b96c2be5cb;hpb=95bbf9a1fc66f175da1daff45dc1601b2a9caa69;p=pan13-paper.git

diff --git a/pan13-poster/poster.tex b/pan13-poster/poster.tex
index c3ab1c5..0148eb1 100755
--- a/pan13-poster/poster.tex
+++ b/pan13-poster/poster.tex
@@ -4,9 +4,15 @@
 \usepackage{amsmath}
 \usepackage{amssymb}
 \usepackage{multicol}
-\usepackage{bera}
 \usepackage[utf8]{inputenc}
 %\usepackage{fancybullets}
+%\usepackage{floatflt}
+%\usepackage{graphics}
+\usepackage{fontspec}
+\usepackage{xunicode}
+\setmainfont[Mapping=tex-text]{DejaVu Sans}
+\setsansfont[Mapping=tex-text]{DejaVu Sans}
+\setmonofont[Mapping=tex-text]{DejaVu Sans Mono}
 
 \definecolor{BoxCol}{rgb}{0.9,0.9,1}
 % uncomment for light blue background to \section boxes 
@@ -18,7 +24,7 @@
 \definecolor{ReallyEmph}{rgb}{0.7,0,0}
 
 \renewcommand{\titlesize}{\Huge}
-\title{Diverse Queries and Feature Type Selection \\ for Plagiarism Discovery}
+\title{Diverse Queries and Feature Type Selection for Plagiarism Discovery}
 
 % Note: only give author names, not institute
 \author{Å imon Suchomel, Jan Kasprzak, and Michal Brandejs}
@@ -49,6 +55,13 @@
 
 \setlength{\figbotskip}{\smallskipamount}
 
+\renewcommand{\SubSection}[2][?]{
+  \vspace{0.5\secskip}
+  \refstepcounter{subsection}
+  {\bf \subsectionsize \textcolor{SectionCol}{\arabic{section}.\arabic{subsection}~#2}}
+  \par\vspace{0.375\secskip}
+}
+
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 %%% Begin of Document
 
@@ -114,38 +127,39 @@
 
 
 \begin{multicols}{2}\setlength{\columnseprule}{0pt}
-
-
 \section{Introduction}
-PAN 2013 LOrem ipsum Lorem ipsum Lorem ipsumLorem ipsumLorem ipsumLorem ipsumLorem ipsum 
-
-
-
+%
+A program for helping detering real-world plagiarism needs to accomplish many tasks.
+Original documents which served for creation of plagiarism must be retrieved and also suspicious passages according to
+input document must be highlighted. This poster presents methodology used during PAN2013 competition on uncovering plagiarism.
+
+The whole process is depicted at picture~\ref{fig:process}. The source retrieval task is divided into
+2 subtasks: Quering and Selecting, during which the software utilizes a given search engine. The retrieved
+sources must be examined in detail in order to highlight as many plagiarism cases as possible. This process is depicted
+as Text Alignment. Results of this process are called {\em detections}, i.e.~passages of {\em source document} and {\em suspicious document}, which are similar enough to each other, and can serve as a basis for further manual examination for possible plagiarism.
+%
+\vfill
+\columnbreak
+%
 \begin{figure}
  \centering
   \includegraphics[width=0.8\textwidth]{img/source_retrieval_process.pdf}
   \caption{Plagiarism discovery process.}
   \label{fig:process}
 \end{figure} 
-
-
 \end{multicols}
-
-
-
 \begin{multicols}{2}
-
 %\rm
-
 %%% Introduction
 \section{Querying}
-Querying means to effectively utilize the search engine in order to retrieve as many relevant
+Querying means to effectively utilize a search engine in order to retrieve as many relevant
 documents as possible with the minimum amount of queries.
 %We consider the resulting document relevantif it shares some of text characteristics with the suspicious document.
-In real-world queries as such represent appreciable cost, therefore their minimization should be one of the top priorities. \\
-\subsection{Types of Queries}
-From the suspicious document, there were three diverse types of queries extracted.
-\subsubsection{Keywords Based Queries}
+In real-world, queries as such represent appreciable cost, therefore their quantity minimization should be one of the top priorities. 
+%\subsection{Types of Queries}
+During initial phase, there were three diverse types of queries extracted from each suspicious document.\\
+\begin{minipage}{0.55\linewidth}
+\subsection{Keywords Based Queries}
 \begin{ytemize}
 \item TF--IDF base automated keywords extraction;
 \item 5-token long; 
@@ -153,56 +167,167 @@ From the suspicious document, there were three diverse types of queries extracte
 \item Non-positional;
 \item Non-phrasal.
 \end{ytemize}
-\subsubsection{Intrinsic Plagiarism Based Queries}
+\end{minipage}
+\begin{minipage}{0.45\linewidth}
+\begin{figure}[h]
+ %\centering
+  \includegraphics[width=1\linewidth]{img/document_keywords.pdf}
+\end{figure}
+\end{minipage}
+\begin{minipage}{0.55\linewidth}
+\subsection{Intrinsic Plagiarism Based Queries}
 \begin{ytemize}
-\item Averaged Word Frequency Class based chunking~\cite{AWFC};
+\item Averaged Word Frequency Class based chunking~\cite{awfc};
 \item Random sentence selection from the chunk;
 \item Non-deterministic;
 \item Positional;
 \item Phrasal.
 \end{ytemize}
-\subsubsection{Paragraph Based Queries}
+\end{minipage}
+\begin{minipage}{0.45\linewidth}
+\begin{figure}[h]
+ %\centering
+  \includegraphics[width=1\linewidth]{img/document_awfc.pdf}
+\end{figure}
+\end{minipage}
+\begin{minipage}{0.55\linewidth}
+\subsection{Paragraph Based Queries}
 \begin{ytemize}
 \item Longest sentences from miscellaneous paragraphs;
 \item Deterministic;
 \item Positional;
 \item Phrasal.
 \end{ytemize}
+\end{minipage}
+\begin{minipage}{0.45\linewidth}
+\begin{figure}[h]
+ %\centering
+  \includegraphics[width=1\linewidth]{img/document_paragraphs.pdf}
+\end{figure}
+\end{minipage}
+
+\begin{figure}[h]
+ \centering
+  \includegraphics[width=0.8\linewidth]{img/queryprocess.pdf}
+   \caption{Stepwise queries execution process.}
+\end{figure}
 
 \section{Selecting}
+Document snippets were used for deciding whether to download the document for the text alignment.
+We used 2-tuples measurement, which indicates how many neighbouring word pairs coexist in the snippet and in the suspicious document.
+Performance of this measure is depicted at Figure~\ref{fig:snippet_graph}.
+Having this measure, a threshold for download decision needs to be set in order to maximize all discovered similarities
+and minimize total downloads.
+A profitable threshold is such that matches with the largest distance between those two curves.
 \begin{figure}
   \centering
   \includegraphics[width=0.8\textwidth]{img/snippets_graph.pdf}
   \caption{Downloads and similarities performance.}
   \label{fig:snippet_graph}
 \end{figure}
-
+%
+% Yenyova cast
+%
 \section{Text Alignment}
 
-\section{Conclusion}
+The system uses the same basic principles as in \cite{suchomel_kas_12}:
 
-NÄjakÃ½ zÃ¡vÄr
+\begin{ytemize}
+\item{\cemph{common features} between source and suspicious documents}
+\begin{ytemize}
+\item{word 5-grams}
+\item{stop-word 8-grams \cite{stamatatos2011plagiarism}}
+\end{ytemize}
+\item{\cemph{valid intervals} of characters covered by common features
+	``densely enough''}
+\item{\cemph{postprocessing}---remove overlapping detections,
+	join neighbouring detections}
+\end{ytemize}
 
-%%% References
+\subsection{Alternative Features}
 
-%% Note: use of BibTeX als works!!
+\begin{ytemize}
+\item{\cemph{contextual n-grams} \cite{torrejondetailed}}
+\begin{ytemize}
+\item{\cemph{The quick} brown \cemph{fox jumped} over the lazy dogs.}
+\item{The \cemph{quick brown} fox \cemph{jumped over} the lazy dogs.}
+\end{ytemize}
+\item{plain word 4-grams}
+\begin{ytemize}
+\item{\cemph{The quick brown fox} jumped over the lazy dogs.}
+\item{The \cemph{quick brown fox jumped} over the lazy dogs.}
+\end{ytemize}
+\end{ytemize}
 
-\bibliographystyle{plain}
-\begin{thebibliography}{1}
+\begin{table}
 
-\bibitem{ISMU}
-\cemph{Masaryk University Information System}\\
-{\tt http://is.muni.cz/}, contact: {\tt iscor@fi.muni.cz}.
+\begin{center}
+\begin{tabular}{|l|r|r|r|r|}
+\hline
+\bf feature & \bf recall & \bf precision & \bf granularity & plagdet \\
+\hline
+plain      5-grams & 0.6306 & 0.8484 & 1.0000 & \cemph{0.7235} \\
+contextual 4-grams & 0.6721 & \cemph{0.8282} & 1.0000 & \cemph{0.7421} \\
+plain      4-grams & \cemph{0.7556} & 0.7340 & 1.0000 & \cemph{0.7447} \\
+\hline
+\end{tabular}
+\end{center}
+
+\caption{Comparison of contextual 4-grams and plain word 4-grams}
+\end{table}
+
+\subsection{Global Postprocessing}
+
+\begin{ytemize}
+\item{Similar to PAN 2010 \cite{Kasprzak2010}}
+\item{Overlapping detections removal}
+\item{\cemph{Result:} improvement, but not as significant as in 2010}
+\end{ytemize}
 
-\bibitem{Theses}
-\cemph{Czech National Archive of Graduate Theses}\\
-{\tt http://theses.cz/}, contact: {\tt theses@fi.muni.cz}.
+%
+% Spolecna cast
+%
 
-\bibitem{AWFC}
-\cemph{Sven Meyer Zu Eissen and Benno Stein: Intrinsic Plagiarism Detection}\\
-{\tt Proceedings of the European Conference on Information Retrieval (ECIR-06)}, {\tt 2006}
+\section{Conclusion}
 
-\end{thebibliography}
+\subsection{Candidate retrieval}
+
+\begin{ytemize}
+\item{Second best ratio of recall to the number of queries}
+\item{Missing support for phrasal search in ChatNoir is a big stumbling block}
+\end{ytemize}
+
+\subsection{Text alignment}
+
+\begin{ytemize}
+\item{Significant improvement against PAN 2013}
+\item{Word 4-grams are better than contextual 4-grams}
+\item{We need a better ranking system than plagdet!}
+\end{ytemize}
+
+%%% References
+
+%% Note: use of BibTeX als works!!
+
+\bibliographystyle{plain}
+\bibliography{pan13-notebook}
+\nocite{awfc}
+
+%\begin{thebibliography}{1}
+%
+%\bibitem{ISMU}
+%\cemph{Masaryk University Information System}\\
+%{\tt http://is.muni.cz/}, contact: {\tt iscor@fi.muni.cz}.
+%
+%\bibitem{Theses}
+%\cemph{Czech National Archive of Graduate Theses}\\
+%{\tt http://theses.cz/}, contact: {\tt theses@fi.muni.cz}.
+%
+%\bibitem{AWFC}
+%\cemph{Sven Meyer Zu Eissen and Benno Stein: Intrinsic Plagiarism Detection}\\
+%{\tt Proceedings of the European Conference on Information Retrieval (ECIR-06)}, {\tt 2006}
+%
+%\end{thebibliography}
 
 \smallskip
 \hrule height .1em
@@ -210,14 +335,20 @@ NÄjakÃ½ zÃ¡vÄr
 
 % \sffamily
 
-QR kÃ³d?
 
+\hbox to \hsize{
+	{\hsize=0.5\hsize\vbox{
 \cemph{Contact information:}\\
-	Å imon Suchomel {\tt suchomel@fi.muni.cz},\\
-	Jan Kasprzak, {\tt kas@fi.muni.cz}.
-
+	Å imon Suchomel {\tt suchomel@fi.muni.cz}\\
+	Jan Kasprzak {\tt kas@fi.muni.cz}\\
+	{\cemph{\tt http://www.fi.muni.cz/\~{}kas/pan13/}}
+}
+	\hfill
+	{\hsize=0.4\hsize\vbox{
+	\includegraphics[width=\hsize]{qrcode.png}
+}}}}
+	
 
 \end{multicols}
 
 \end{document}
-