1 \documentclass[a0,portrait]{sciposter}
\r
6 \usepackage{multicol}
\r
8 \usepackage[utf8]{inputenc}
\r
9 %\usepackage{fancybullets}
\r
10 %\usepackage{floatflt}
\r
11 %\usepackage{graphics}
\r
13 \definecolor{BoxCol}{rgb}{0.9,0.9,1}
\r
14 % uncomment for light blue background to \section boxes
\r
15 % for use with default option boxedsections
\r
17 \definecolor{SectionCol}{rgb}{0,0,0.5}
\r
18 % uncomment for dark blue \section text
\r
20 \definecolor{ReallyEmph}{rgb}{0.7,0,0}
\r
22 \renewcommand{\titlesize}{\Huge}
\r
23 \title{Diverse Queries and Feature Type Selection \\ for Plagiarism Discovery}
\r
25 % Note: only give author names, not institute
\r
26 \author{Šimon Suchomel, Jan Kasprzak, and Michal Brandejs}
\r
28 % insert correct institute name
\r
29 \institute{Faculty of Informatics, Masaryk University, Brno, Czech Republic}
\r
31 % \email{kas@fi.muni.cz} % shows author email address below institute
\r
33 %\date is unused by the current \maketitle
\r
35 \font\logofont=fi-logo600 at .16\textwidth
\r
37 \renewcommand{\sectionsize}{\Large}
\r
39 \newcommand{\cemph}[1]{{\sffamily\bfseries\itshape \textcolor{SectionCol}{#1}}}
\r
40 \newcommand{\lemph}[1]{{\rmfamily\itshape \textcolor{SectionCol}{#1}}}
\r
41 \newcommand{\eitem}[1]{\item \cemph{#1}}
\r
43 \newenvironment{ytemize}
\r
45 \setlength{\itemsep}{0pt}
\r
46 \setlength{\parskip}{0pt}
\r
50 \conference{{\bf CLEF 2013}, 23--27 September 2013, Valencia, Spain}
\r
52 \setlength{\figbotskip}{\smallskipamount}
\r
54 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\r
55 %%% Begin of Document
\r
61 % Uncomment to put footer logo on left side, and
\r
62 % conference name on right side of footer
\r
64 % Some examples of caption control (remove % to check result)
\r
66 %\renewcommand{\algorithmname}{Algoritme} % for Dutch
\r
68 %\renewcommand{\mastercapstartstyle}[1]{\textit{\textbf{#1}}}
\r
69 %\renewcommand{\algcapstartstyle}[1]{\textsc{\textbf{#1}}}
\r
70 %\renewcommand{\algcapbodystyle}{\bfseries}
\r
71 %\renewcommand{\thealgorithm}{\Roman{algorithm}}
\r
75 \vspace*{-.06\textwidth}
\r
78 \begin{minipage}[c]{.11\textwidth}
\r
79 \vspace{-.75\textwidth}
\r
80 \hbox{\hskip -.83\textwidth\includegraphics[width=3\textwidth]{znak_MU_modry}\hskip -\textwidth}
\r
81 \vspace{-\textwidth}
\r
84 \begin{minipage}[c]{.7\textwidth}
\r
86 \renewcommand{\baselinestretch}{2.0}\normalsize
\r
87 {\titlesize \bf \@title}\par
\r
88 \renewcommand{\baselinestretch}{1.0}\normalsize
\r
89 \vspace{0.4\titleskip}
\r
90 {\authorsize {\bf\@author} \par}
\r
92 \vspace{0.2\titleskip}
\r
94 \ifthenelse{\equal{\printemail}{}}{%nothing
\r
96 \vspace{0.2\titleskip}
\r
97 \texttt{\printemail}
\r
103 \begin{minipage}[c]{.15\textwidth}
\r
104 \hbox to \hsize{\logofont SL\hss}
\r
108 \vspace{-.02\textwidth}
\r
110 %%% Begin of Multicols-Enviroment
\r
112 %{\sffamily\itshape
\r
118 \begin{multicols}{2}\setlength{\columnseprule}{0pt}
\r
119 \section{Introduction}
\r
121 PAN 2013 LOrem ipsum Lorem ipsum Lorem ipsumLorem ipsumLorem ipsumLorem ipsumLorem ipsum
\r
128 \includegraphics[width=0.6\textwidth]{img/source_retrieval_process.pdf}
\r
129 \caption{Plagiarism discovery process.}
\r
130 \label{fig:process}
\r
133 \begin{multicols}{2}
\r
137 Querying means to effectively utilize the search engine in order to retrieve as many relevant
\r
138 documents as possible with the minimum amount of queries.
\r
139 %We consider the resulting document relevantif it shares some of text characteristics with the suspicious document.
\r
140 In real-world queries as such represent appreciable cost, therefore their minimization should be one of the top priorities.
\r
141 %\subsection{Types of Queries}
\r
142 From the suspicious document, there were three diverse types of queries extracted.\\
\r
143 \begin{minipage}{0.55\linewidth}
\r
144 \subsection{Keywords Based Queries}
\r
146 \item TF--IDF base automated keywords extraction;
\r
147 \item 5-token long;
\r
148 \item Deterministic;
\r
149 \item Non-positional;
\r
153 \begin{minipage}{0.45\linewidth}
\r
156 \includegraphics[width=1\linewidth]{img/document_keywords.pdf}
\r
159 \begin{minipage}{0.55\linewidth}
\r
160 \subsection{Intrinsic Plagiarism Based Queries}
\r
162 \item Averaged Word Frequency Class based chunking~\cite{AWFC};
\r
163 \item Random sentence selection from the chunk;
\r
164 \item Non-deterministic;
\r
169 \begin{minipage}{0.45\linewidth}
\r
172 \includegraphics[width=1\linewidth]{img/document_awfc.pdf}
\r
175 \begin{minipage}{0.55\linewidth}
\r
176 \subsection{Paragraph Based Queries}
\r
178 \item Longest sentences from miscellaneous paragraphs;
\r
179 \item Deterministic;
\r
184 \begin{minipage}{0.45\linewidth}
\r
187 \includegraphics[width=1\linewidth]{img/document_paragraphs.pdf}
\r
193 \includegraphics[width=0.8\linewidth]{img/queryprocess.pdf}
\r
194 \caption{Stepwise queries execution process.}
\r
197 \section{Selecting}
\r
198 Document snippets were used for deciding whether to download the document for the text alignment.
\r
199 We used 2-tuples measurement, which indicates how many neighbouring word pairs coexist in the snippet and in the suspicious document.
\r
200 Performance of this measure is depicted at picture~\ref{fig:snippet_graph}.
\r
201 Having this measure, a threshold for download decision needs to be set in order to maximize all discovered similarities
\r
202 and minimize total downloads.
\r
203 A profitable threshold is such that matches with the largest distance between those two curves.
\r
206 \includegraphics[width=0.8\textwidth]{img/snippets_graph.pdf}
\r
207 \caption{Downloads and similarities performance.}
\r
208 \label{fig:snippet_graph}
\r
216 \section{Text Alignment}
\r
222 \section{Conclusion}
\r
228 %% Note: use of BibTeX als works!!
\r
230 \bibliographystyle{plain}
\r
231 \begin{thebibliography}{1}
\r
234 \cemph{Masaryk University Information System}\\
\r
235 {\tt http://is.muni.cz/}, contact: {\tt iscor@fi.muni.cz}.
\r
238 \cemph{Czech National Archive of Graduate Theses}\\
\r
239 {\tt http://theses.cz/}, contact: {\tt theses@fi.muni.cz}.
\r
242 \cemph{Sven Meyer Zu Eissen and Benno Stein: Intrinsic Plagiarism Detection}\\
\r
243 {\tt Proceedings of the European Conference on Information Retrieval (ECIR-06)}, {\tt 2006}
\r
245 \end{thebibliography}
\r
255 \cemph{Contact information:}\\
\r
256 Šimon Suchomel {\tt suchomel@fi.muni.cz},\\
\r
257 Jan Kasprzak, {\tt kas@fi.muni.cz}.
\r