Drobne upravy chyb

[pan13-paper.git] / pan13-paper / simon-source_retrieval.tex
diff --git a/pan13-paper/simon-source_retrieval.tex b/pan13-paper/simon-source_retrieval.tex

index e32c1913b6233549de774586b17a7bbffdd08d4e..b5b4a5a5da6b9739d2a3acd8e8c1aba647b2119d 100755 (executable)
--- a/pan13-paper/simon-source_retrieval.tex
+++ b/pan13-paper/simon-source_retrieval.tex
@@ -1 +1,208 @@
  \section{Source Retrieval}\r
+The source retrieval is a subtask in a plagiarism detection process during\r
+which only a relatively small subset of documents are retrieved from the\r
+large corpus. Those candidate documents are usually further compared in detail with the\r
+suspicious document. In the PAN 2013 source retrieval subtask the main goal was to\r
+identify web pages which have been used as a source of plagiarism for test corpus creation.\r
+\r
+The test corpus contained 58 documents each discussing only one theme.\r
+Those documents were created intentionally by\r
+ semiprofessional writers, thus they feature nearly realistic plagiarism cases~\cite{plagCorpus}. \r
+ Such conditions are similar to a realistic plagiarism detection scenario, such as for\r
+state of the art commercial plagiarism detection systems or the anti-plagiarism service developed on and\r
+utilized at the Masaryk University. The main difference between real-world corpus \r
+of suspicious documents such as for example corpus created from theses stored in the Information System of Masaryk University\r
+and the corpus of suspicious documents used during the PAN 2013 competition is that in the PAN\r
+corpus each document contains plagiarism passages. Therefore we can deepen the search during the process\r
+in certain parts of the document where no similar passage has yet been found. This is the main\r
+idea of improving recall of detected plagiarism in a suspicious document.\r
+\r
+\r
+\begin{figure}\r
+  \centering\r
+  \includegraphics[width=1.00\textwidth]{img/source_retrieval_process.pdf}\r
+  \caption{Source retrieval process.}\r
+  \label{fig:source_retr_process}\r
+\end{figure}\r
+\r
+An online plagiarism detection can be viewed as a reverse engineering task where \r
+we need to find original documents from which the plagiarized document was created.\r
+During the process the plagiarist locates original documents with the use of a search engine.\r
+The user decides what query the search engine to ask and which of the results from the result page to use.\r
+In the real-world scenario the corpus is the whole Web and the search engine can be a contemporary commercial search engine\r
+which scales to the size of the Web. This methodology is based on the fact that we do not\r
+possess enough resources to download and effectively process the whole corpus.\r
+In the case of PAN 2013 competition the corpus\r
+of source documents is the ClueWeb\footnote{\url{http://lemurproject.org/clueweb09.php/}} corpus.\r
+\r
+As a document retrieval tool for the competition we utilized the ChatNoir~\cite{chatnoir} search engine which indexes the English\r
+subset of the ClueWeb.   \r
+The reverse engineering decision process resides in creation of suitable queries on the basis of the suspicious document\r
+and in decision what to actually download and what to report as a plagiarism case from the search results.\r
+\r
+These first two stages are shown in Figure~\ref{fig:source_retr_process} as Querying and Selecting. Selected results \r
+from the search engine are forthwith textually aligned with the suspicious document (see section~\ref{text_alignment} for more details).\r
+This is the last decision phase -- what to report.\r
+If there is any continuous passage of reused text detected, the result document is reported\r
+ and the continuous passages in the suspicious document are marked as 'discovered' and no further processing\r
+of those parts is done. \r
+ \r
+\subsection{Querying}\r
+Querying means to effectively utilize the search engine in order to retrieve as many relevant\r
+documents as possible with the minimum amount of queries. We consider the resulting document relevant \r
+if it shares some of text characteristics with the suspicious document. In real-world queries as such\r
+represent appreciable cost, therefore their minimization should be one of the top priorities.\r
+\r
+We used 3 different types of queries\footnote{We used similar three-way based methodology in PAN 2012 \r
+Candidate Document Retrieval subtask. However, this time we completely replaced the headers based queries\r
+with paragraph based queries, since the headers based queries did not pay off in the overall process.}:\r
+i) keywords based queries, ii) intrinsic plagiarism\r
+based queries, and iii) paragraph based queries. Three main properties distinguish each type of query: i) Positional; ii) Phrasal; iii) Deterministic.\r
+Positional queries carry extra information about a textual interval in the suspicious document which the query represents.\r
+A phrasal query aims for retrieval of documents containing the same small piece of a text. They are usually created from closely coupled words. \r
+Deterministic queries for specific suspicious document are always the same no matter how many times we run the software. \r
+On the contrary the software can create in two runs potentially different nondeterministic queries.\r
+\r
+\subsubsection{Keywords Based Queries.}\r
+The keywords based queries are composed of automatically extracted keywords from the whole suspicious document.\r
+Their purpose is to retrieve documents concerning the same theme. Two documents discussing the \r
+same theme usually share a set of overlapping keywords. Also the combination of keywords in\r
+query matters. \r
+As a method for automated keywords extraction, we used a frequency based approach described in~\cite{suchomel_kas_12}.\r
+The method combines term frequency analysis with TF-IDF score~\cite{Introduction_to_information_retrieval}. As a reference\r
+corpus we used English web corpus~\cite{ententen} crawled by SpiderLink~\cite{SpiderLink} in 2012 which contains 4.65 billion tokens. \r
+\r
+Each keywords based query was constructed from five top ranked keywords consecutively. Each keyword was\r
+used only in one query. Too long keywords based queries would be overspecific and it would have resulted\r
+in a low recall. On the other hand having constructed too short queries (one or two tokens) would have resulted\r
+in a low precision and also possibly low recall since they would be too general.\r
+\r
+In order to direct the search more at the highest ranked keywords we also extracted their \r
+most frequent two and three term long collocations. These were combined also into queries of 5 words.\r
+Resulting the 4 top ranked keywords alone can appear in two different queries, one from the keywords\r
+alone and one from the collocations. Collocation describes its keyword better than the keyword alone. \r
+\r
+The keywords based queries are non-positional, since they represent the whole document. They are also non-phrasal since\r
+they are constructed of tokens gathered from different parts of the text. And they are deterministic; for certain input\r
+document the extractor always returns the same keywords.\r
+\r
+\subsubsection{Intrinsic Plagiarism Based Queries.}\r
+The second type of queries purpose to retrieve pages which contain similar text detected\r
+as different, in a manner of writing style, from other parts of the suspicious document.\r
+Such a change may point out plagiarized passage which is intrinsically bound up with the text.  \r
+We implemented vocabulary richness method which computes average word frequency class value for \r
+a given text part. The method is described in~\cite{awfc}. The problem is that generally methods\r
+based on the vocabulary statistics work better for longer texts. According to authors this method\r
+scales well for shorter texts than other text style detection methods. \r
+Still the usage is in our case limited by relatively short texts. It is also difficult to determine\r
+what parts of text to compare. Therefore we used sliding window concept for text chunking with the \r
+same settings as described in~\cite{suchomel_kas_12}.\r
+\r
+A representative sentence longer than 6 words was randomly selected among those that apply from the suspicious part of the document.\r
+An intrinsic plagiarism based query is created from the representative sentence leaving out stop words.\r
+\r
+The intrinsic plagiarism based queries are positional. They carry the position of the representative sentence in the document.\r
+They are phrasal, since they represent a search for a specific sentence. And they are\r
+nondeterministic, because the representative sentence is selected randomly. \r
+ \r
+\subsubsection{Paragraph Based Queries.}\r
+The purpose of paragraph based queries is to check some parts of the text in more depth.\r
+Those are parts for which no similarity has been found during previous searches. \r
+\r
+For this case we considered a paragraph as a minimum text chunk for plagiarism to occur. \r
+It is discussible whether a plagiarist would be persecuted for plagiarizing only one sentence in a paragraph.\r
+Also a detection of a specific sentence is very difficult if we want to avoid exhaustive search approach.\r
+If someone is to reuse some peace of continuous text, it would probably be no shorter than a paragraph. \r
+Despite the fact, that paragraphs differ in length, we represent one paragraph by only one query.\r
+\r
+\r
+The paragraph based query was created from each paragraph of suspicious document.\r
+From each paragraph we extracted the longest sentence from which the query was constructed.\r
+Ideally the extracted sentence should carry the highest information gain.\r
+The query was maximally 10 words in length which is the upper bound of ChatNoir\r
+and was constructed from the selected sentence by omitting stop words.\r
+\r
+\subsection{Search Control}\r
+For each suspicious document we prepared all three types of queries during the first phase at once.\r
+Queries were executed stepwise. \r
+After processing each query the results were evaluated (see the following subsection~\ref{resSelection} for more details) and from\r
+all textual similarities between each result and the suspicious document the suspicious document intervals of those similarities\r
+were marked as 'discovered'. \r
+At first there were processed the keywords based queries. All of them were\r
+always executed. \r
+After having all the keywords based queries processed, the intrinsic plagiarism based queries were executed according to \r
+their creation sequence. \r
+Since they carry its position, not all of the intrinsic plagiarism based queries were carried out.\r
+During the execution, if any of the query position intersected with any of the 'discovered' interval, the\r
+query was dropped out. In the same way, the last paragraph based queries were processed. \r
+\r
+This search control results in two major properties. Firstly, the source retrieval effort was increased \r
+in parts of the suspicious document, where there has not yet been found any textual similarity.\r
+It was increased especially by the paragraph based queries. And secondly, after detection a similarity for a certain part of the text,\r
+no more intentionally retrieval attempts for that part were effectuated. Meaning that all\r
+discovered search engine results were evaluated, but there were executed no more queries regarding that passage.\r
+\r
+\r
+\subsection{Result Selection}~\label{resSelection}\r
+The second main decisive area about source retrieval task is to decide which from the search engine results to download.\r
+This process is represented in Figure~\ref{fig:source_retr_process} as 'Selecting'. \r
+Nowadays in the real-world a download is very cheap and quick operation. There can be some disk space considerations\r
+if there is a need to store original downloaded documents. The main cost represents documents post processing. \r
+Mainly on the Internet there is a wide range of file formats, which for text alignment must be\r
+converted into plaintext. This can time and computational-consuming. For example from many\r
+pdf documents the plain text is hardly extractable, thus one need to use optical character recognition methods.\r
+\r
+The ChatNoir offers snippets for discovered documents. The snippet generation is considered costless\r
+operation. The snippet purpose is to have a quick glance at a small extract of resulting page.\r
+The extract is maximally 500 characters long and it is a portion of the document around given keywords.\r
+On the basis of snippet, we needed to decide whether to actually download the result or not.\r
+\r
+\subsection{Snippet Control}\r
+\begin{figure}\r
+  \centering\r
+  \includegraphics[width=1.00\textwidth]{img/snippets_graph.pdf}\r
+  \caption{Downloads and similarities performance.}\r
+  \label{fig:snippet_graph}\r
+\end{figure}\r
+\r
+Since the snippet is relatively small and it can be discontinuous part of the text, the \r
+text alignment methods described in section~\ref{text_alignment} were insufficient \r
+in decision making over document download. Therefore we chose to compare existence of snippet word tuples\r
+in the suspicious document. For 1-tuples the measure means how many words from the snippet\r
+also exist in the suspicious document. If the snippet contains many common words they may\r
+also occur in many documents. In this case the 1-tuple measurement is little decisive. \r
+\r
+We used 2-tuples measurement, which indicates how many neighbouring word pairs coexist in the snippet and in the suspicious document.\r
+We decided according to this value whether to download the source or not. For the deduction \r
+ of the threshold value we used 4413 search results from various queries according to documents \r
+ in the training corpus. Each resulting document was textually aligned to its corresponding suspicious document.\r
+One similarity represents continuous passage of text alignment similarity as is described in the following section~\ref{text_alignment}.\r
+In this way we obtained 248 similarities in total after downloading all of the 4431 documents.\r
+\r
+The 2-tuples similarity performance is depicted in Figure~\ref{fig:snippet_graph}.\r
+Horizontal axis represents threshold of the 2-tuples similarity percentage between the snippet and the suspicious document.\r
+The graph curves represent obtain resource percentage according to the snippet similarity threshold.\r
+A profitable threshold is the one with the largest distance between those two curves.\r
+We chose threshold of the snippet similarity to 20\%, which in the graph corresponds to 20\% of all\r
+downloads and simultaneously with 70\% discovered similarities.\r
+ \r
+\subsection{Source Retrieval Results}\r
+In PAN 2013 Source Retrieval subtask we competed with other 8 teams. \r
+There can not be selected the best approach because there were several independent\r
+performance measures. Possibly each approach has its pros and cons and many approaches\r
+are usable in different situations. \r
+\r
+We believe that in the realistic plagiarism detection the most important is keeping the number of\r
+queries low and simultaneously maximizing recall. \r
+% It is often some tradeoff between cost and efectivness.\r
+It is also advisable to keep the number of downloads down, but on the other hand,\r
+it is relatively cheep and easily scalable operation.\r
+\r
+Our approach had the second best ration of recall to the number of used queries, which\r
+tells about the query efficacy. The approach with the best ratio used few queries (4.9 queries per document which\r
+was 0.4 of the amount we used), but also obtained the lowest recall (0.65 of our recall).\r
+The approach with highest recall (and also lowest precision) achieved 2.8 times higher recall with 3.9 times more queries compared to ours.\r
+\r
+Our approach achieved also low precision, which means we reported many more results and they\r
+were not considered as correct hits. On the other hand each reported result contained some\r
+textual similarity according to text alignment subtask score, which we believe is still worthwhile to report.\r