simon-searchengine.tex

   1 \section{Candidate document retrieval}~\label{simon}
   2 The basic concept of candidate document retrieval problem is to use a Web search
   3 engine to find suitable documents. In PAN 2012 competition we used ChatNoir search
   4 engine~\cite{chatnoir} which indexes the entire English part of the ClueWeb09~\footnote{\url{http://boston.lti.cs.cmu.edu/clueweb09/wiki/}} corpus.
   5 We can now reduce the problem to construct appropriate queries for ChatNoir search engine.
   6 The goal is to retrieve similar documents, which contain the same passages as the source document
   7 or similar documents dealing with the same theme.
   8 The strongest emphasis has been placed on minimizing the number of executed queries
   9 since they may be charged, time limited or number limited in the real-world.
  10 For instance if we find a similarity for certain part of the document, we do not search
  11 for more similarities in that part.
  12 %All queries from a given document were firstly preconstructed and they were afterwards sequentially executed
  13 %according to their priority. After each executed query the intermediate results of plagiarism
  14 %were calculated using the multi feature document comparison techniques described in section~\ref{yenya}.
  15 %This process leads to continuing lowering the number of executed queries, for more details see the following passages.
  16 We can divide constructed queries into three main groups according to their origin:
  17  i) document keywords based, ii) intrinsic plagiarism based, and iii) headers based.
  18 %Please note two major attributes of the extracted keywords. Firstly we say
  19
  20 \subsection{Keywords based queries}
  21 The keywords based queries are constructed from automatically extracted keywords from the given document.
  22 As for keyword extraction we used methodology based on word repetition similar to method described by Scott~\cite{text_patterns}.
  23 The basic principle claims that the more frequent a single word in a given document is, the more likely it is to be key in it.
  24 However the most frequent words usually do not carry any meaning.
  25 For example articles (the), conjunctions (as) or prepositions (at).
  26 These types of words are usually refered as the stop words.
  27 Therefore we firstly filtered out the stop words according to a general English stop words list.
  28 Secondly  we used a reference corpus word list as an additional filter.
  29 We used words in their simple verbatim occurrence allied to a
  30 statistical estimate of likelihood. As a result of this, a word to become a key
  31 must not exist in the stop list and must occur in the suspicious document relatively much more frequently
  32 than in the reference corpus. We used the corpus created from common English Web sources, which contains
  33 more then 4 billion tokens~\cite{ententen}.
  34 The statistical measure TF-IDF~\cite{Introduction_to_information_retrieval}
  35 was applied in order to decide whether the processed word shall become a keyword.
  36 The TF-IDF combines term frequency (TF) and inverse document frequency (IDF) to produce composite weight of a given term.
  37 Taking into account the average length of the suspicious document and minimizing the number of used queries we set the TF-IDF weight threshold to 0.02,
  38 which resulted in about 10 to 20 terms identified as keywords per document.
  39
  40 Before executing Web search tasks we must combine extracted keywords into befitting queries.
  41 All keyword based queries were constructed from 5 keywords according to their TF-IDF weight consecutively.
  42 In other words 5 top ranked keywords combine into the first query, the other 5 top ranked keywords combine into the second query.
  43 By using this method we usually obtained from 2 to 3 initial queries per document.
  44
  45 \subsubsection{Collocations}
  46 In addition to this we enhanced subsequent queries by the most frequent collocations of top 4 keywords.
  47 For those 4 keywords we extracted the most frequent two-word and three-word collocation occurring within the single suspicious document.
  48 A scope for finding collocates was a single sentence. We counted collocates as the most frequent neighbouring words
  49 within the scope also omitting stop words.
  50 To take advantage of extracted collocations we combined them among selfs also into 5 word queries.
  51 After omitting queries containing duplicated words we added, on average, two new queries to each suspicious document.
  52 Despite of being composed from collocations we count them also as the keywords based queries. Together with
  53 the queries created only from bare keywords, those queries were unconditionally executed as the initial queries. \\
  54
  55 Queries containing around 5 words should be optimal in order to retrieve highest mean average precision of results.
  56 It means that we would like to extract as many results as possible which will still be dealing with the same theme as the source document.
  57 Expecting the order of words within a single query has small
  58 impact on results, no other heuristics of combining keywords into queries were applied also
  59 in order to keep the query amount minimal. In 32 from 33 input test documents there were more than a hundred resulting documents found using
  60 the initial queries, which gave a decent document base covering the source document theme for document comparison.
  61
  62
  63
  64 %using less kw in q would result into more general or underspecifi
  65
  66 All those previously mentioned queries characterize the document as a whole.
  67 They should cover the theme of the document expecting the single document is actually dealing with one theme.
  68
  69 \subsection{Intrinsic plagiarism based queries}
  70 Intrinsic plagiarism queries were constructed from a representative sentence residing in
  71 the selected part of the document. The representative sentence is any sentence
  72 which is greater then eight words in length.
  73 Such a sentence should produce the least searching false-positives~\cite{knight}.
  74
  75 The main task concerning representative sentences is to locate the suitable text part.
  76 This part should be located by a method leading to discover potential plagiarism inside the document of its own,
  77 which is called intrinsic plagiarism detection. Such a method is based on e.g. text statistics or syntactic features.
  78 We chose statistical vocabulary richness method called average word frequency class described in~\cite{awfc}.
  79 By using this method we can compute average word frequency class values (AWFC) for arbitrary parts of the text. The change
  80 of this value between two adjacent text windows indicates change of the writing style, which can be caused by a plagiarism.
  81 We computed AWFC value for every text window of 45 words size, which was shifted through the text by 40 words span.
  82 The both values were set empirically during training. Windows with
  83 largest change of AWFC compared to AWFC of their neighbouring previous window were selected as a candidate for query extraction.
  84
  85
  86 The query from the representative sentence was composed by 6 words from the beginning of the sentence, also omitting stop words.
  87 We chose 6 words to make more specific queries and not to exceed
  88 the phrasal ChatNoir query limit.
  89 %This type of query is actually looking for documents containing the very same or at least the very similar sentence,
  90 %therefore we chose more than 5 word queries in order to narrow the search to more specific results.
  91 Unfortunately in the time of PAN 2012 competition the ChatNoir search engine did not
  92 support phrasal search. Since querying a single sentence is a phrasal search, one might have
  93 expected to achieve even better results from those queries.
  94  %Due to the fact, that ChatNoir does not currently support phrasal search,
  95
  96 The intrinsic plagiarism based queries are document positionable, meaning that we can store
  97 their position within the document. It is due to the fact that
  98 those queries are created from more or less subsequent words. As a result of that, they are executed conditionally, for more
  99 information see Section~\ref{process}.
 100 They characterize only a specific portion of the suspicious document on the contrary to keyword based queries.
 101 The positions within the document were set to the
 102 start positions of the representative sentences.
 103
 104
 105 \subsection{Headers based queries}
 106 The last type of queries were constructed from local headers of the document.
 107 A header usually characterizes briefly and accurately the following section, thus it
 108 can be used as a search query. In real scenarios the headers should
 109 be located using some document metadata. In that way we can distinguish for example more header levels.
 110 In the case of PAN 2012 competition we constructed headers based queries according to several
 111 basic rules: i) the query is created from any line which has an empty line above and bellow,
 112 ii) it is from 2 to 6 words in length also omitting stop words.
 113 These basic rules applied on most headers from the given format of test corpus text files.
 114
 115 Note that the length of headers based queries are variable, thus they can be both more specific
 116 and more general.% according to their length.
 117 They are also document positionable.
 118 We calculated their position as the start position of the following text.
 119
 120 \subsection{Combining and executing queries}~\label{process}
 121
 122 All queries from a given document were firstly preconstructed and they were afterwards sequentially executed
 123 according to their priority.
 124 The first all keyword based queries, the second
 125 intrinsic plagiarism based and the last headers based.
 126 During the process we could omit some of the positionable queries, purposing to
 127 lower total number of executed queries.
 128 The condition for their omitting is described further in this section.
 129
 130
 131 After each executed query the intermediate results of plagiarism
 132 were calculated using the multi feature document comparison techniques described in Section~\ref{yenya}.
 133 The queries processing is outlined as Algorithm~\ref{alg:queries}, where for simplicity the snippet calculation and a Web source download is omitted.
 134 After executing a query the intermediate results were always
 135 calculated as follows:
 136
 137 For every query result in result page based
 138 on a snippet to input suspicious document similarities, we decided whether to actually
 139 download the resulting URL or not. Since the snippet contains portions of the Web document to each
 140 given keyword, we calculated pre-similarity as apportionment of every word match between the snipped and the suspicious document.
 141 If there were at least 80\% match, we downloaded the Web source for thorough investigation.
 142
 143 For each downloaded Web document we calculated similarities with the input suspicious document
 144 as a document pair described in Section~\ref{yenya}.
 145 All similarities were stored in form of intervals from within
 146 the input suspicious document. In other words for every similar part between the downloaded
 147 document and the suspicious document we stored the beginning position and the ending position of that part in
 148 the suspicious document.
 149
 150 As a result of that, we can during processing further queries, which haven't been executed yet, omit those
 151 of which document position intersects with any of already found similarities intervals.
 152 %This procedure leads us to lowering the total number of executed queries.
 153 All downloaded documents, which have at least one similarity interval, are declared
 154 as possible source of the plagiarism.
 155
 156
 157 \renewcommand{\algorithmicrequire}{\textbf{Input:}}
 158 \renewcommand{\algorithmicensure}{\textbf{Output:}}
 159
 160
 161 \algsetup{indent=2em}
 162 \begin{algorithm}[h!]
 163 \caption{Queries execution}\label{alg:queries}
 164 \begin{algorithmic}[1]
 165 \REQUIRE suspicious document $d$, queries $Q_{KW}, Q_{Intrinsic}, Q_{Headers}$
 166 %keyword, intrinsic and headers based queries $Q_{KW}, Q_{Intrinsic}, Q_{Headers}$,
 167
 168 \ENSURE plagiarism source candidate Web documents $W_{plag}$
 169 \medskip
 170
 171 \STATE $I \leftarrow \emptyset; I \in \mathbb{N}^2$
 172 \STATE $W \leftarrow \emptyset$
 173
 174 \FORALL{$q \in (Q_{KW} \cup Q_{Intrinsic} \cup Q_{Headers}) $}
 175   \IF{$position(q)$ do not intersects with any interval $ \vec{i} \in I$}
 176   \STATE $W \leftarrow execute(q)$
 177     \FORALL{$w \in W$}
 178     \STATE $I_{sub} \leftarrow get\_similarities(w, d)$ %\COMMENT{output is set of intervals similarities $[d_{start},d_{end}]$}
 179     \STATE $I \leftarrow I \cup I_{sub}$
 180       \IF{$I_{sub} \neq \emptyset$}
 181          \STATE $W_{plag} \leftarrow W_{plag} \cup \{w\}$
 182        \ENDIF
 183     \ENDFOR
 184   \ENDIF
 185 \ENDFOR
 186 \RETURN $W_{plag}$
 187
 188 \end{algorithmic}
 189 \end{algorithm}
 190
 191 \subsection{Queries comparison}~\label{comparison}
 192 During the test phase there were extracted 133 keywords based queries, 165 intrinsic plagiarism
 193 based queries and 331 headers based queries from the test corpus in total. Table~\ref{querycount} compares
 194 results according to query types.
 195 \begin{center}
 196 \begin{table}[h]
 197 \begin{center}
 198 \begin{tabular}{|c|c|c|c|}
 199 \hline
 200 {\bf Query type} & {\bf Extracted}& {\bf Omitted}  & {\bf Similarities portion }\\ \hline \hline
 201 KW        & 4.16  & N/A  &  72.5\% \\ \hline
 202 Intrinsic & 5.16  & 2.35  & 24.3\% \\ \hline
 203 Headers   & 10.34 & 4.75  & 3.2\% \\ \hline
 204 \end{tabular}\\
 205 \end{center}
 206 \vspace{0.3 cm}
 207 \caption{\footnotesize Queries type comparison.}
 208 \label{querycount}
 209 \end{table}
 210 \end{center}
 211 The second and the third column
 212 represents the mean of the query count and the omitted query count per document. The fourth
 213 column shows total portion of similarities found, taking into account the number of similarities regardless of interval sizes.
 214  We can see that nearly half of the prepared queries were
 215 omitted due to the fact, that there had been found a similarity covering their document position.
 216 We can also see that there were detected about 5 cases
 217 of potential plagiarism on average, by means of used AWFC intrinsic plagiarism detection method.
 218 Table~\ref{querycount} also shows keyword based queries as the most successful and
 219 headers based queries as the least successful. Despite the fact, that they were greatest
 220 in number they ended with only more than a 3\% of total similarities found. Nevertheless, please
 221 note that the headers based queries were executed as the last, thus they were used only for
 222 finding undiscovered potential similarities. In order to really compare the query type performance, we
 223 would need to execute and evaluate them separately.
 224
 225 To conclude this section we can say, that all types of queries were more or less successful. The headers based
 226 were executed last and in the process they were the least successful. The interesting
 227  finding is the fact, that we can even greatly lower the number of executed queries.
 228 By omitting all of headers based queries we could lover the total number of executed queries by 45\% with only
 229 3.2\% of recall lost.
 230 %\begin{center}
 231
 232
 233
 234 \begin{table}[h]
 235 \begin{center}
 236 { \scriptsize
 237 \begin{tabular}{l c c c c c c c c c c c }
 238 \hline
 239 & \multicolumn{2}{c}{\strut \bf Total workload} & \multicolumn{2}{c}{\bf Time to 1st Result}&\multirow{2}{*}{\parbox{0.6cm}{\bf No \\ result}} &
 240 \multicolumn{2}{c}{\bf Reported Srcs.} & \multicolumn{2}{c}{\bf Downloaded Srcs.} &
 241  \multicolumn{2}{c}{\bf Retrieved Srcs.} \\
 242
 243 {\bf Team \strut}&{\bf Queries}&{\bf  Downloads}&{\bf Queries}&{\bf Downloads}
 244 & & {\bf Prec.}&{\bf Recall}&{\bf Prec.}&{\bf Recall}&{\bf Prec.}&{\bf Recall}\\ \hline \hline
 245
 246 \parbox{2,3cm}{\strut Gillam et al. \\ University of Surrey, UK \strut} & 63.44 & 527.41 & 4.47 &
 247  25.88 & {\bf 1} & 0.6266 & 0.2493 & 0.0182 & {\bf 0.5567} & 0.0182 & {\bf 0.5567} \\ %\hline
 248
 249 \parbox{2,3cm}{\strut Jayapal \\ University of Sheffield, UK  \strut}&  67.06 & 173.47  & 8.78  & 13.50 &
 250  1 & {\bf 0.6582} & {\bf 0.2775} & 0.0709 & 0.4342 & {\bf 0.0698} & 0.4342 \\ %\hline
 251
 252  \parbox{2,3cm}{\strut Kong Leilei \\ Heilongjiang Institute of Technology,\\ China \strut } & 551.06   &
 253  326.66 & 80.59 & 27.47 & 2 & 0.5720 & 0.2351 & 0.0178 & 0.3742 & 0.0141 & 0.3788 \\ %\hline
 254
 255 \parbox{2,3cm}{\strut Palkovskii et al. \\ Zhytomyr State University, Ukraine \strut} & 63.13 &
 256 1026.72 & 27.28 & 318.94 & 6 & 0.4349 & 0.1203 & 0.0025 & 0.2133 & 0.0024 & 0.2133 \\ %\hline
 257
 258 \parbox{2,3cm}{\strut \bf our approach \strut }& {\bf 12.56} & {\bf 95.41} & {\bf 1.53} & {\bf 6.28} & 2 & 0.5177 & 0.2087 & {\bf 0.0813} &
 259 0.3513 & 0.0094 & 0.4519 \\ \hline
 260 \end{tabular}\\}
 261 \end{center}
 262 \vspace{0.3 cm}
 263 \caption{\footnotesize PAN 2012 candidate document retrieval results.}
 264 \label{candidateDocsResults}
 265 \end{table}
 266 %\end{center}
 267
 268 Table~\ref{candidateDocsResults} shows results of PAN 2012 candidate document retrieval
 269 task as averages over the all 32 documents from the test corpus. Our approach led
 270 to obtain decent retrieval performance with the minimal total workload and minimal time to 1st result.
 271 Also the 80\% word match threshold in Web snippet appears to be suitable, since we
 272 also achieved the highest precision among downloaded sources.
 273
 274
 275
 276