X-Git-Url: https://www.fi.muni.cz/~kas/git//home/kas/public_html/git/?a=blobdiff_plain;f=pan13-paper%2Fsimon-source_retrieval.tex;h=c27a16e627ebf792f05603fa21f075d4f2460e0c;hb=e792c6b43d78515ea33f68d402188695ae8bd0b2;hp=e32c1913b6233549de774586b17a7bbffdd08d4e;hpb=b16fe92fb7dd5fd6667718a0fe3d91e7ad95a581;p=pan13-paper.git

diff --git a/pan13-paper/simon-source_retrieval.tex b/pan13-paper/simon-source_retrieval.tex
index e32c191..c27a16e 100755
--- a/pan13-paper/simon-source_retrieval.tex
+++ b/pan13-paper/simon-source_retrieval.tex
@@ -1 +1,119 @@
 \section{Source Retrieval}
+The source retrieval is a subtask in a plagiarism detection process during
+which only a relatively small subset of documents are retrieved from the
+large corpus. Those candidate documents are usually further compared in detail with the
+suspicious document. In the PAN 2013 source retrieval subtask the main goal was to
+identified web pages which have been used as a source of plagiarism for creation of the 
+test corpus. 
+The test corpus contained XX documents each discussing one and only one theme.
+Those documents were created intentionally by
+ semiprofessional writers, thus they feature nearly realistic plagiarism cases. 
+ Such conditions are similar to a realistic plagiarism detection scenario, such as for
+state of the art commercial plagiarism detection systems or the anti-plagiarism service developed on and
+utilized at the Masaryk University. The main difference between real-world corpus 
+of suspicious documents such as for example corpus created from theses stored in Information System of Masaryk University
+and the corpus of suspicious documents used during the PAN 2013 competition is that in the PAN
+corpus each document contains plagiarism passages. Therefore we can deepen the search during the process
+in certain parts of the document where no similar passage has yet been found. This is the main
+idea of improving recall of detected plagiarism in a suspicious document.
+
+
+\begin{figure}
+  \centering
+  \includegraphics[width=1.00\textwidth]{img/source_retrieval_process.pdf}
+  \caption{Source retrieval process.}
+  \label{fig:source_retr_process}
+\end{figure}
+
+An online plagiarism detection can be viewed as a reverse engineering task where 
+we need to find original documents from which the plagiarized document was created.
+During the process the plagiarist locates original documents with the use of a search engine.
+The user decides what query the search engine to ask and which of the results from the result page to use.
+In real-world scenario the corpus is the whole Web and the search engine can be a contemporary commercial search engine
+which scales to the size of the Web. This methodology is based on the fact that we do not
+possess enough resources to download and effectively process the whole corpus.
+In the case of PAN 2013 competition the corpus
+of source documents is the ClueWeb\footnote{\url{http://lemurproject.org/clueweb09.php/}} corpus. 
+As a document retrieval tool for the competition we utilized the ChatNoir~\cite{chatnoir} search engine which indexes the English
+subset of the ClueWeb.   
+The reverse engineering decision process reside in creation of suitable queries on the basis of the suspicious document
+and in decision what to actually download and what to report as a plagiarism case from the search results.
+
+These first two stages can be viewed in figure~\ref{fig:source_retr_process} as Querying and Selecting. Selected results 
+from the search engine are forthwith textually aligned with the suspicious document (see section~\ref{text_alignment} for more details).
+This is the last decision phase -- what to report.
+If there is any continuous passage of reused text detected, the result document is reported
+ and the continuous passages in the suspicious document are marked as 'discovered' and no further processing
+of those parts is done. 
+ 
+\subsection{Querying}
+Querying means to effectively utilize the search engine in order to retrieve as many relevant
+documents as possible with the minimum amount of queries. We consider the resulting document relevant 
+if it shares some of text characteristics with the suspicious document.  
+
+We used 3 different types of queries\footnote{We used similar three-way based methodology in PAN 2012 
+Candidate Document Retrieval subtask. However, this time we completely replaced the headers based queries
+with paragraph based queries, since the headers based queries did not pay off in the overall process.}:
+i) keywords based queries, ii) intrinsic plagiarism
+based queries, and iii) paragraph based queries. Three main properties distinguish each type of query: i) Positional; ii) Phrasal; iii) Deterministic.
+Positional queries carry extra information about a textual interval in the suspicious document which the query represents.
+A phrasal query aims for retrieval of documents containing the same small piece of a text. They are usually created from closely coupled words. 
+Deterministic queries for specific suspicious document are always the same no matter how many times we run the software. 
+On the contrary the software can create in two runs potentially different nondeterministic queries.
+
+\subsubsection{Keywords Based Queries.}
+The keywords based queries compose of automatically extracted keywords from the whole suspicious document.
+Their purpose is to retrieve documents concerning the same theme. Two documents discussing the 
+same theme usually share a set of overlapping keywords. Also the combination of keywords in
+query matters. 
+As a method for automated keywords extraction, we used a frequency based approach described in~\cite{suchomel_kas_12}.
+The method combines term frequency analysis with TF-IDF score~\cite{Introduction_to_information_retrieval}. As a reference
+corpus we used English web corpus~\cite{ententen} crawled by SpiderLink~\cite{SpiderLink} in 2012 which contains 4.65 billion tokens. 
+
+Each keywords based query were constructed from five top ranked keywords consecutively. Each keyword were
+used only in one query. Too long keywords based queries would be over-specific and it would have resulted
+in a low recall. On the other hand having constructed too short (one or two tokens) queries would have resulted
+in a low precision and also possibly low recall since they would be too general.
+
+In order to direct the search more at the highest ranked keywords we also extracted their 
+most frequent two and three term long collocations. These were combined also into queries of 5 words.
+Resulting the 4 top ranked keywords alone can appear in two different queries, one from the keywords
+alone and one from the collocations. Collocation describes its keyword better than the keyword alone. 
+
+The keywords based queries are non-positional, since they represent the whole document. They are also non-phrasal since
+they are constructed of tokens gathered from different parts of the text. And they are deterministic, for certain input
+document the extractor always returns the same keywords.
+
+\subsubsection{Intrinsic Plagiarism Based Queries.}
+The second type of queries purpose to retrieve pages which contain similar text detected
+as different, in a manner of writing style, from other parts of the suspicious document.
+Such a change may point out plagiarized passage which is intrinsically bound up with the text.  
+We implemented vocabulary richness method which computes average word frequency class value for 
+a given text part. The method is described in~\cite{awfc}. The problem is that generally methods
+based on the vocabulary statistics work better for longer texts. According to authors this method
+scales well for shorter texts than other text style detection methods. 
+Still the usage is in our case limited by relatively short texts. It is also difficult to determine
+what parts of text to compare. Therefore we used sliding window concept for text chunking with the 
+same settings as described in~\cite{suchomel_kas_12}.
+
+A representative sentence longer than 6 words was randomly selected among those that apply from the suspicious part of the document.
+An intrinsic plagiarism based query is created from the representative sentence leaving out stop words.
+
+The intrinsic plagiarism based queries are positional. They carry the position of the representative sentence in the document.
+They are phrasal, since they represent a search for a specific sentence. And they are
+nondeterministic, because the representative sentence is selected randomly. 
+ 
+
+
+\subsubsection{Paragraph Based Queries.}
+they were executed as the last asi az v search control
+it would be extremely difficult to detect a single sentence other way than by exhaustive searching methods
+
+\subsection{Search Control}
+neoptimalizujeme na spravne utvorene dotazy z klicovych slov - stoji to vice dotazu
+
+
+\subsection{Result Selection}
+\subsection{Snippet Control}
+
+