Pisu dal

[pan13-paper.git] / pan13-paper / simon-source_retrieval.tex
diff --git a/pan13-paper/simon-source_retrieval.tex b/pan13-paper/simon-source_retrieval.tex

index 2cb1a8f9c4f32945ef00a87e94c8fcc2fd9730c3..d5b338b948a6cc4a13fac8319d8bb51328861b7e 100755 (executable)
--- a/pan13-paper/simon-source_retrieval.tex
+++ b/pan13-paper/simon-source_retrieval.tex
@@ -50,7 +50,8 @@ of those parts is done.
  \subsection{Querying}\r
  Querying means to effectively utilize the search engine in order to retrieve as many relevant\r
  documents as possible with the minimum amount of queries. We consider the resulting document relevant \r
-if it shares some of text characteristics with the suspicious document.  \r
+if it shares some of text characteristics with the suspicious document. In real-world queries as such\r
+represent appreciable cost, therefore their minimization should be one of the top priorities.\r
  \r
  We used 3 different types of queries\footnote{We used similar three-way based methodology in PAN 2012 \r
  Candidate Document Retrieval subtask. However, this time we completely replaced the headers based queries\r
@@ -143,8 +144,31 @@ discovered search engine results were evaluated, but there were executed no more
  \r
  \r
  \subsection{Result Selection}\r
+The second main decisive area about source retrieval task is to decide which from the search engine results to download.\r
+This process is represented in figure~\ref{fig:source_retr_process} as 'Selecting'. \r
+Nowadays in real-world is download very cheap and quick operation. There can be some disk space considerations\r
+if there is a need to store original downloaded documents. The main cost represents documents post processing. \r
+Mainly on the Internet there is a wide range of file formats, which for text alignment must be\r
+converted into plaintext. This can time and computational-consuming. For example from many\r
+pdf documents the plain text is hardly extractable, thus one need to use optical character recognition methods.\r
+\r
+The ChatNoir offers snippets for discovered documents. The snippet generation is considered costless\r
+operation. The snippet purpose is to have a quick glance at a small extract of resulting page.\r
+The extract is maximally 500 characters long and it is a portion of the document around given keywords.\r
+On the basis of snippet, we needed to decide whether to actually download the result or not.\r
+\r
+Since the snippet is relatively small and it can be discontinuous part of the text, the \r
+text alignment methods described in section~\ref{text_alignment} were insufficient for \r
+\r
+\r
  \r
  \subsection{Snippet Control}\r
+\begin{figure}\r
+  \centering\r
+  \includegraphics[width=1.00\textwidth]{img/snippets_graph.pdf}\r
+  \caption{Downloads and similarities performance.}\r
+  \label{fig:snippet_graph}\r
+\end{figure}\r
  \subsection{Source Retrieval Results}\r
  \r
  \r