pan13-paper/simon-source_retrieval.tex

   1 \section{Source Retrieval}~\label{source_retr}\r
   2 The source retrieval is a subtask in a plagiarism detection process during\r
   3 which only a relatively small subset of documents are retrieved from the\r
   4 large corpus. Those candidate documents are usually further compared in detail with the\r
   5 suspicious document. In PAN 2013 source retrieval subtask the main goal was to\r
   6 identify web pages which have been used as a source of plagiarism for test corpus creation.\r
   7 \r
   8 The test corpus contained 58 documents each discussing only one theme.\r
   9 Those documents were created intentionally by\r
  10  semiprofessional writers, thus they featured nearly realistic plagiarism cases~\cite{plagCorpus}.\r
  11 Resources were looked up in the ClueWeb\footnote{\url{http://lemurproject.org/clueweb09.php/}} corpus.\r
  12  Such conditions are similar to a realistic plagiarism detection scenario.%, such as for state of the art commercial plagiarism detection systems or the anti-plagiarism service developed on and utilized at the Masaryk University.\r
  13 The main difference between real-world corpus \r
  14 of suspicious documents such as for example corpus created from theses stored in the Information System of Masaryk University\r
  15 and the corpus of suspicious documents used during the PAN 2013 competition is that in the PAN\r
  16 corpus each document contains plagiarized passages. Therefore we can expected existence of\r
  17 a plagiarized passage and deepen the search during the process\r
  18 in certain parts of the document where no similar passage has yet been found.\r
  19 % This is the main idea of improving recall of detected plagiarism in a suspicious document.\r
  20 \r
  21 An online plagiarism detection can be viewed as a reverse engineering task where \r
  22 we need to find original documents from which the plagiarized document was created.\r
  23 During the process the plagiarist locates original documents with the use of a search engine.\r
  24 The user decides what query the search engine to ask and which of the results from the result page to use.\r
  25 \r
  26 %In the real-world scenario the corpus is the whole Web and the search engine can be a contemporary commercial search engine which scales to the size of the Web.\r
  27  \r
  28 The same methodology -- utilizing a search engine; is used for source retrieval. \r
  29 This approach is based on the fact that we do not\r
  30 possess enough resources to download and effectively process the whole corpus.\r
  31 In the case of PAN 2013 competition we utilized the ChatNoir~\cite{chatnoir} \r
  32 search engine which indexes the English subset of the ClueWeb.\r
  33 \r
  34 \begin{figure}\r
  35   \centering\r
  36   \includegraphics[width=0.8\textwidth]{img/source_retrieval_process.pdf}\r
  37   \caption{Source retrieval process.}\r
  38   \label{fig:source_retr_process}\r
  39 \end{figure}\r
  40 \r
  41 The reverse engineering decision process resides in creation of suitable queries on the basis of the suspicious document\r
  42 and in decision what to actually download and what to report as a plagiarism case from the search results.\r
  43 \r
  44 These first two stages are shown in Figure~\ref{fig:source_retr_process} as Querying and Selecting. Selected results \r
  45 from the search engine are then textually aligned with the suspicious document (see section~\ref{text_alignment} for more details).\r
  46 %This is the last decision phase -- what to report.\r
  47 If there is any continuous passage of reused text detected, the result document is reported\r
  48  and the continuous passages in the suspicious document are marked as {\it discovered} and no further processing\r
  49 of those parts is done. \r
  50  \r
  51 \subsection{Querying}\r
  52 Querying means to effectively utilize the search engine in order to retrieve as many relevant\r
  53 documents as possible with the minimum amount of queries. We consider the resulting document relevant \r
  54 if it shares some of text characteristics with the suspicious document. In real-world queries as such\r
  55 represent appreciable cost, therefore their minimization should be one of the top priorities.\r
  56 \r
  57 We used 3 different types of queries\footnote{We used similar three-way based methodology in PAN 2012 \r
  58 Candidate Document Retrieval subtask. However, this time we completely replaced the headers based queries\r
  59 with paragraph based queries, since the headers based queries did not pay off in the overall process.}:\r
  60 i) keywords based queries, ii) intrinsic plagiarism\r
  61 based queries, and iii) paragraph based queries. Three main properties distinguish each type of query: i) Positional; ii) Phrasal; iii) Deterministic.\r
  62 Positional queries carry extra information about a textual interval in the suspicious document which the query represents.\r
  63 A phrasal query aims for retrieval of documents containing the same small piece of text. They are usually created from closely coupled words. \r
  64 Deterministic queries for specific suspicious document are always the same no matter how many times we run the software. \r
  65 %On the contrary the software can create in two runs potentially different nondeterministic queries.\r
  66 \r
  67 \subsubsection{Keywords Based Queries.}\r
  68 The keywords based queries are composed of automatically extracted keywords from the whole suspicious document.\r
  69 Their purpose is to retrieve documents concerning the same theme.\r
  70 %Two documents discussing the same theme usually share a set of overlapping keywords. Also the combination of keywords in query matters. \r
  71 As a method for automated keywords extraction, we used a frequency based approach described in~\cite{suchomel_kas_12}.\r
  72 The method combines term frequency analysis with TF-IDF score.\r
  73 As a reference corpus we used English web corpus~\cite{ententen} crawled by SpiderLink~\cite{SpiderLink} in 2012 which contains 4.65 billion tokens. \r
  74 \r
  75 Each keywords based query was constructed from five top ranked keywords consecutively.\r
  76 Each keyword was used only in one query. \r
  77 %Too long keywords based queries would be overspecific and it would have resulted in a low recall.\r
  78 %On the other hand having constructed too short queries (one or two tokens) would have resulted in a low precision and also possibly low recall since they would be too general.\r
  79 In order to direct the search more at the highest ranked keywords we also extracted their \r
  80 most frequent two and three term long collocations.\r
  81 These were combined also into queries of 5 words.\r
  82 Resulting the 4 top ranked keywords alone can appear in two different queries, one from the keywords\r
  83 alone and one from the collocations.\r
  84 %Collocation describes its keyword better than the keyword alone. \r
  85 \r
  86 The keywords based queries are non-positional, since they represent the whole document. They are also non-phrasal since\r
  87 they are constructed of tokens gathered from different parts of the text. And they are deterministic; for certain input\r
  88 document the extractor always returns the same keywords.\r
  89 \r
  90 \subsubsection{Intrinsic Plagiarism Based Queries.}\r
  91 The second type of queries purpose to retrieve pages which contain text detected\r
  92 as different, in a manner of writing style, from other parts of the suspicious document.\r
  93 %Such a change may point out plagiarized passage which is intrinsically bound up with the text.  \r
  94 %We implemented vocabulary richness method which computes average word frequency class value for \r
  95 %a given text part. The method is described in~\cite{awfc}.\r
  96 For this purpose we implemented vocabulary richness method~\cite{awfc} together with\r
  97 sliding windows concept for text chunking as described in~\cite{suchomel_kas_12}.\r
  98 %The problem is that generally methods based on the vocabulary statistics work better for longer texts.\r
  99 %According to authors this method scales well for shorter texts than other text style detection methods. \r
 100 %The usage of this method is in our case limited by relatively short texts.\r
 101 %It is also difficult to determine\r
 102 %what parts of text to compare. Therefore we used sliding window concept for text chunking with the \r
 103 %same settings as described in~\cite{suchomel_kas_12}.\r
 104 \r
 105 A representative sentence longer than 6 words was randomly selected among those that apply from the suspicious part of the document.\r
 106 The query was created from the representative sentence leaving out stop words.\r
 107 The intrinsic plagiarism based queries are positional. They carry the position of the representative sentence.%    in the document.\r
 108 They are phrasal, since they represent a search for a specific sentence. And they are\r
 109 nondeterministic, because the representative sentence is selected randomly. \r
 110  \r
 111 \subsubsection{Paragraph Based Queries.}\r
 112 The purpose of paragraph based queries is to check some parts of the text in more depth.\r
 113 Those are parts for which no similarity has been found during previous searches. \r
 114 For this case we considered a paragraph as a minimum text chunk for plagiarism to occur. \r
 115 %It is discussible whether a plagiarist would be persecuted for plagiarizing only one sentence in a paragraph.\r
 116 %A detection of a specific sentence is very difficult if we want to avoid exhaustive search approach.\r
 117 %If someone is to reuse some peace of continuous text, it would probably be no shorter than a paragraph. \r
 118 Despite the fact, that paragraphs differ in length, we represent one paragraph by only one query.\r
 119 \r
 120 %The paragraph based query was created from each paragraph of suspicious document.\r
 121 From each paragraph we extracted the longest sentence from which the query was constructed.\r
 122 Ideally the extracted sentence should carry the highest information gain.\r
 123 The query was maximally 10 words in length which is the upper bound of ChatNoir\r
 124 and was constructed from the selected sentence by omitting stop words.\r
 125 \r
 126 \subsection{Search Control}\r
 127 For each suspicious document we prepared all three types of queries during the first phase at once.\r
 128 Queries were executed stepwise. \r
 129 After processing each query the results were evaluated. %(see the following subsection~\ref{resSelection} for more details)\r
 130 From all textual similarities between each result and the suspicious document, the document intervals of those similarities\r
 131 were marked as {\it discovered}. \r
 132 \r
 133 %At first there were processed the keywords based queries.\r
 134 Firstly, there were always all of the keywords based queries executed. \r
 135 %After having all the keywords based queries processed,\r
 136 Secondly the intrinsic plagiarism based queries were executed according to \r
 137 their creation sequence. \r
 138 %Since they carry its position, not all of them were carried out.\r
 139 During the execution, if any of the query position intersected with any of the {\it discovered} interval, the\r
 140 query was dropped out. Analogically, the last there were paragraph based queries processed. \r
 141 \r
 142 This search control results in two major properties. Firstly, the source retrieval effort was increased \r
 143 in parts of the suspicious document, where there has not yet been found any textual similarity.\r
 144 %It was increased especially by the paragraph based queries.\r
 145 And secondly, after detection a similarity for a certain part of the text,\r
 146 no more intentionally retrieval attempts for that part were effectuated. Meaning that all\r
 147 discovered search engine results were evaluated, but there were executed no more queries regarding that passage.\r
 148 \r
 149 \r
 150 \subsection{Result Selection}~\label{resSelection}\r
 151 The second main decisive area about source retrieval task is to decide which from the search engine results to download.\r
 152 This process is represented in Figure~\ref{fig:source_retr_process} as Selecting. \r
 153 %Nowadays in the real-world a download is cheap and quick operation.\r
 154 %There can be some disk space considerations if there is a need to store original downloaded documents.\r
 155 % The main cost probably represents document post processing. \r
 156 %Mainly on the Internet there is a wide range of file formats, which for text alignment must be\r
 157 %converted into plaintext. This can be time and computational-consuming.\r
 158 %For example from many pdf documents the plain text is hardly extractable, thus one need to use optical character recognition methods.\r
 159 \r
 160 The ChatNoir offers snippets for discovered documents. The snippet generation is considered costless\r
 161 operation. The snippet purpose is to have a quick glance at a small extract of resulting page.\r
 162 The extract is maximally 500 characters long and it is a portion of the document around given keywords.\r
 163 On the basis of snippet, we needed to decide whether to actually download the result or not.\r
 164 \r
 165 \subsection{Snippet Control}\r
 166 \begin{figure}\r
 167   \centering\r
 168   \includegraphics[width=0.8\textwidth]{img/snippets_graph.pdf}\r
 169   \caption{Downloads and similarities performance.}\r
 170   \label{fig:snippet_graph}\r
 171 \end{figure}\r
 172 \r
 173 Since the snippet is relatively small and it can be discontinuous part of the text, the \r
 174 text alignment methods described in section~\ref{text_alignment} were insufficient \r
 175 in decision making over document download. Therefore we chose to compare existence of snippet word tuples\r
 176 in the suspicious document. \r
 177 %For 1-tuples the measure means how many words from the snippet\r
 178 %also exist in the suspicious document. If the snippet contains many common words they may\r
 179 %also occur in many documents. In this case the 1-tuple measurement is little decisive. \r
 180 \r
 181 We used 2-tuples measurement, which indicates how many neighbouring word pairs coexist in the snippet and in the suspicious document.\r
 182 We decided according to this value whether to download the source or not. For the deduction \r
 183  of the threshold value we used 4413 search results from various queries according to documents \r
 184  in the training corpus. Each resulting document was textually aligned to its corresponding suspicious document.\r
 185 %One similarity represents continuous passage of text alignment similarity as is described in the following section~\ref{text_alignment}.\r
 186 In this way we calculated 248 similarities in total after downloading all of the 4431 documents.\r
 187 \r
 188 The 2-tuples similarity performance is depicted in Figure~\ref{fig:snippet_graph}.\r
 189 Horizontal axis represents threshold of the 2-tuples similarity percentage between the snippet and the suspicious document.\r
 190 The graph curves represent obtained resource percentage according to the snippet similarity threshold.\r
 191 A profitable threshold is the one with the largest distance between those two curves.\r
 192 We chose threshold of the snippet similarity to 20\%, which in the graph corresponds to 20\% of all\r
 193 downloads and simultaneously with 70\% discovered similarities.\r
 194  \r
 195 \subsection{Source Retrieval Results}\r
 196 In PAN 2013 Source Retrieval subtask we competed with other 8 teams. \r
 197 There can not be selected the best approach because there were several independent\r
 198 performance measures. Possibly each approach has its pros and cons and many approaches\r
 199 are usable in different situations. \r
 200 \r
 201 We believe that in the realistic plagiarism detection the most important is to keep the number of\r
 202 queries low and simultaneously maximize recall. \r
 203 % It is often some tradeoff between cost and efectivness.\r
 204 It is also advisable to keep the number of downloads low, but on the other hand,\r
 205 it is relatively cheep and easily scalable operation.\r
 206 \r
 207 Our approach had the second best ratio of recall to the number of used queries, which\r
 208 tells about the query efficacy. The approach with the best ratio used few queries (4.9 queries per document which\r
 209 was 0.4 of the amount we used), but also obtained the lowest recall (0.65 of our recall).\r
 210 The approach with the highest recall (and also lowest precision) achieved 2.8 times higher recall with 3.9 times more queries compared to ours.\r
 211 \r
 212 Our approach achieved also low precision, which means we reported many more results and they\r
 213 were not considered as correct hits. On the other hand each reported result contained some\r
 214 textual similarity according to text alignment subtask score, which we believe is still worthwhile to report.\r