X-Git-Url: https://www.fi.muni.cz/~kas/git//home/kas/public_html/git/?a=blobdiff_plain;f=pan13-paper%2Fsimon-source_retrieval.tex;h=b3289c9fe6beb2c3e0f1e7e818d68212a2849462;hb=060415b2f4c4f0482b6a128f8103651e1c9af823;hp=e32c1913b6233549de774586b17a7bbffdd08d4e;hpb=b16fe92fb7dd5fd6667718a0fe3d91e7ad95a581;p=pan13-paper.git diff --git a/pan13-paper/simon-source_retrieval.tex b/pan13-paper/simon-source_retrieval.tex index e32c191..b3289c9 100755 --- a/pan13-paper/simon-source_retrieval.tex +++ b/pan13-paper/simon-source_retrieval.tex @@ -1 +1,78 @@ \section{Source Retrieval} +The source retrieval is a subtask in a plagiarism detection process during +which only a relatively small subset of documents are retrieved from the +large corpus. Those candidate documents are usually further compared in detail with the +suspicious document. In the PAN 2013 source retrieval subtask the main goal was to +identified web pages which have been used as a source of plagiarism for creation of the +test corpus. +The test corpus contained XX documents each discussing one and only one theme. +Those documents were created intentionally by + semiprofessional writers, thus they feature nearly realistic plagiarism cases. + Such conditions are similar to a realistic plagiarism detection scenario, such as for +state of the art commercial plagiarism detection systems or the anti-plagiarism service developed on and +utilized at the Masaryk University. The main difference between real-world corpus +of suspicious documents such as for example corpus created from theses stored in Information System of Masaryk University +and the corpus of suspicious documents used during the PAN 2013 competition is that in the PAN +corpus each document contains plagiarism passages. Therefore we can deepen the search during the process +in certain parts of the document where no similar passage has yet been found. This is the main +idea of improving recall of detected plagiarism in a suspicious document. + + +\begin{figure} + \centering + \includegraphics[width=1.00\textwidth]{img/source_retrieval_process.pdf} + \caption{Source retrieval process.} + \label{fig:source_retr_process} +\end{figure} + +An online plagiarism detection can be viewed as a reverse engineering task where +we need to find original documents from which the plagiarized document was created. +During the process the plagiarist locates original documents with the use of a search engine. +The user decides what query the search engine to ask and which of the results from the result page to use. +In real-world scenario the corpus is the whole Web and the search engine can be a contemporary commercial search engine +which scales to the size of the Web. This methodology is based on the fact that we do not +possess enough resources to download and effectively process the whole corpus. +In the case of PAN 2013 competition the corpus +of source documents is the ClueWeb~\footnote{\url{http://lemurproject.org/clueweb09.php/}} corpus. +As a document retrieval tool for the competition we utilized the ChatNoir~\cite{chatnoir} search engine which indexes the English +subset of the ClueWeb. +The reverse engineering decision process reside in creation of suitable queries on the basis of the suspicious document +and in decision what to actually download and what to report as a plagiarism case from the search results. + +These first two stages can be viewed in figure~\ref{fig:source_retr_process} as Querying and Selecting. Selected results +from the search engine are forthwith textually aligned with the suspicious document (see section~\ref{text_alignment} for more details). +This is the last decision phase -- what to report. +If there is any continuous passage of reused text detected, the result document is reported + and the continuous passages in the suspicious document are marked as 'discovered' and no further processing +of those parts is made. + +\subsection{Querying} +Querying means to effectively utilize the search engine in order to retrieve as many relevant +documents as possible with the minimum amount of queries. We consider the resulting document relevant +if it shares some of text characteristics with the suspicious document. + +We used 3 different types of queries~\footnote{We used similar three-way based methodology in PAN 2012 +Candidate Document Retrieval subtask. However this time we completely replaced the headers based queries +with paragraph based queries, since the headers based queries did not pay off in the overall process.}: +i) keywords based queries, ii) intrinsic plagiarism +based queries, and iii) paragraph based queries. Three main properties distinguish each type of query: i) Positional; ii) Phrasal; iii) Deterministic. +Positional queries carry extra information about a textual interval in the suspicious document which the query represents. +A phrasal query aims for retrieval of documents containing the same small piece of a text. They are usually created from closely coupled words. +Deterministic queries for specific suspicious document are always the same no matter how many times we run the software. +On the contrary the software can create in two runs potentially different nondeterministic queries. + +\subsubsection{Keywords Based Queries} + +\subsubsection{Intrinsic Plagiarism Based Queries} +\subsubsection{Paragraph Based Queries} +\subsection{Search Control} + + +\subsection{Result Selection} +\subsection{Snippet Control} + + + + + +