simon-searchengine.tex

   1 \section{Candidate document retrieval}~\label{simon}
   2 The basic concept of candidate document retrieval is to use a web search
   3 engine to find suitable documents. In PAN 12 competition we used ChatNoir search
   4 engine~\cite{chatnoir} which indexes the entire English part of the ClueWeb09 corpus.
   5 We can now reduce the problem to constructing appropriate queries for ChatNoir search engine.
   6 The goal is to retrieve similar documents, which contains the same passages as the source document
   7 or similar documents dealing with the same theme.
   8 The strongest emphasis has been places on minimizing the number of executed queries
   9 since they may be charged or time limited or number limited in the real world.
  10
  11 In the PAN 12 test corpus, there were 32 text documents all contained at least one plagiarism case.
  12 The documents were approximately 30 KB of size, the smallest were 18 KB and the largest were 44 KB.
  13 We have used methods which are applicable in general to any type of text input with no apriori information about the input document.
  14
  15 All queries from a given document were firstly preconstructed and they were afterwards sequentially executed
  16 according to their priority. After each executed query the intermediate results of plagiarism
  17 were calculated using the multi feature document comparison techniques described in section~\ref{yenya}.
  18 This process leads to continuing lowering the number of executed queries, for more details see the following passages.
  19
  20
  21 We can divide constructed queries into three main groups according to their origin:
  22  i) document keywords based, ii) intrinsic plagiarism based, and iii) headers based.
  23 %Please note two major attributes of the extracted keywords. Firstly we say
  24
  25 \subsection{Keywords based queries}
  26 The document keywords based queries are constructed from automatically extracted keywords from the given document.
  27 As for keyword extraction we used methodology based on word repetition similar to method described by Scott~\cite{text_patterns}.
  28 The basic principle claims that the more frequent a single word in a given document is, the more likely it is to be key in it.
  29 However the most frequent words usually do not carry any meaning.
  30 For example articles (the), conjunctions (as) or prepositions (at).
  31 These types of words are usually refered as the stop words.
  32 Thus we firstly filter out stop words according to general English stop words list.
  33 Secondly  we used reference corpus word list as an additional filter.
  34 We used words in their simple verbatim occurrence allied to a
  35 statistical estimate of likelihood. As a result of this a word to become a key
  36 must not exists in the stop list and must occur in the suspicious document relatively much more frequently
  37 than in the reference corpus. We used a corpus created from common English web sources, which contains
  38 more then 4 billion tokens~\cite{ententen}.
  39 The statistical measure TF-IDF was applied in order to decide whether the processed word shall become a keyword.
  40 The TF-IDF combines term frequency (TF) and inverse document frequency (IDF) to produce composite weight of a given term.
  41 Taking into account the average length of the suspicious document and minimizing the number of used queries we set TF-IDF weight threshold to 0.02,
  42 which resulted in about 10 to 20 terms identified as keywords per document.
  43
  44 Before executing web search tasks we must combine extracted keywords into befitting queries.
  45 All queries were constructed from 5 keywords according to their TF-IDF weight consecutively.
  46 In other words 5 top ranked keywords combine into the first query the other 5 top ranked keywords combine into the second query.
  47 By using this method we usually obtained from 2 to 3 initial queries per document.
  48
  49 \subsubsection{Collocations}
  50 In addition to this we enhanced subsequent queries by the most frequent collocations of top 4 keywords.
  51 For those 4 keywords we extracted the most frequent two-word and three-word collocation occurring within the single suspicious document.
  52 A scope for collocates was a single sentence. We counted collocates as the most frequent neighbouring words within the scope also omitting stop words.
  53 To take advantage of extracted collocations we combined them among selfs also into 5 word queries.
  54 After omitting queries containing duplicated words we added, on average, two new queries to each suspicious document.
  55 Despite of being composed from collocations we count those queries also as the keywords based. Together with
  56 the queries created only from bare keywords, those queries were unconditionally executed as the initial queries. \\
  57
  58 Queries containing around 5 words should be optimal in order to retrieve highest mean average precision of results.
  59 It means we would like to extract from results of those queries as many documents as possible but still dealing
  60 with the similar theme as the source document. Expecting the order of words within a single query has small
  61 impact on results, no other heuristics of combining keywords into queries were applied also
  62 in order to keep the query amount minimal. In 32 from 33 input test documents were more than a hundred resulting documents found using
  63 those initial queries, which gave a decent document base covering the source document theme for document comparison.
  64
  65
  66
  67 %using less kw in q would result into more general or underspecifi
  68
  69 All those previously mentioned queries characterize the document as a whole.
  70 They should cover the theme of the document expecting the single document is actually dealing with a one theme.
  71
  72 \subsection{Intrinsic plagiarism based queries}
  73 Intrinsic plagiarism queries were constructed from a representative sentence residing in
  74 the selected part of the document. Representative sentence is any sentence from the selected part of the document
  75 which is greater then eight words in length.
  76 Such a sentence should produce the least searching false-positives~\cite{knight}.
  77
  78 The main task concerning representative sentences is to locate the suitable text part.
  79 This part should be located by a method leading to discover potential plagiarism inside the document of its own,
  80 which is called intrinsic plagiarism detection. Such a method is based on for example text statistics or syntactic features.
  81 We chose statistical vocabulary richness method called average word frequency class described in~\cite{awfc}.
  82 Using this method we can compute average word frequency class values (AWFC) for arbitrary parts of the text. The change
  83 of this value between two adjacent text windows indicates change of writing style, which can be caused by a plagiarism.
  84 We computed AWFC value for every text window of size 45 words, which was shifted through the text by 40 words span.
  85 The values of window size and span of shifting was set empirically during training. Windows with
  86 largest change of AWFC compared to AWFC of their neighbouring previous window were selected as candidate for query extraction.
  87
  88
  89 The query from the representative sentence were composed by 6 words from the sentence beginning, also omitting stop words.
  90 We chose 6 words to cover higher portion of selected sentence and not to exceed
  91 the phrasal ChatNoir query limit.
  92 This type of query is actually looking for documents containing the very same or at least the very similar sentence,
  93 therefore we choose more than 5 word queries in order to narrow the search to more specific
  94 results. Unfortunately in the time of PAN 12 competition the ChatNoir search engine did not
  95 support phrasal search. Since querying a single sentence is phrasal search, one might have
  96 expected even better results from those queries.
  97  %Due to the fact, that ChatNoir does not currently support phrasal search,
  98
  99 The intrinsic plagiarism based queries are document positionable, meaning that we can store
 100 their position within the document. It is due to the fact, that
 101 those queries are created from more or less subsequent words. As a result of that they are also executed conditionally, for more
 102 information see section~\ref{process}.
 103 The position within the document were set to the
 104 start position of the representative sentence.
 105 They characterize only a specific portion of the suspicious document on the contrary to keyword based queries.
 106
 107 \subsection{Headers based queries}
 108 The last type of queries are queries constructed from local headers of document.
 109 The header usually characterize briefly and accurately following section, thus it
 110 can be used as a search query. We considered headers only from 2 to 6 words in length.
 111
 112
 113 \subsection{Input document processing}~\label{process}
 114
 115 \subsection{Queries comparison}
 116