yenya: aplikovany pripominky od Simona

[pan12-paper.git] / simon-searchengine.tex
diff --git a/simon-searchengine.tex b/simon-searchengine.tex

old mode 100644 (file)

new mode 100755 (executable)

index 0ce5371..12a7777
--- a/simon-searchengine.tex
+++ b/simon-searchengine.tex
@@ -1,4 +1,276 @@
-\section{Search-Engine Queries}
+\section{Candidate document retrieval}~\label{simon}
+The basic concept of candidate document retrieval problem is to use a Web search
+engine to find suitable documents. In PAN 2012 competition we used ChatNoir search
+engine~\cite{chatnoir} which indexes the entire English part of the ClueWeb09~\footnote{\url{http://boston.lti.cs.cmu.edu/clueweb09/wiki/}} corpus.
+We can now reduce the problem to construct appropriate queries for ChatNoir search engine.
+The goal is to retrieve similar documents, which contain the same passages as the source document
+or similar documents dealing with the same theme.
+The strongest emphasis has been placed on minimizing the number of executed queries
+since they may be charged, time limited or number limited in the real-world. 
+For instance if we find a similarity for certain part of the document, we do not search
+for more similarities in that part.
+%All queries from a given document were firstly preconstructed and they were afterwards sequentially executed
+%according to their priority. After each executed query the intermediate results of plagiarism 
+%were calculated using the multi feature document comparison techniques described in section~\ref{yenya}.
+%This process leads to continuing lowering the number of executed queries, for more details see the following passages.
+We can divide constructed queries into three main groups according to their origin:
+ i) document keywords based, ii) intrinsic plagiarism based, and iii) headers based.
+%Please note two major attributes of the extracted keywords. Firstly we say   
+ 
+\subsection{Keywords based queries}
+The keywords based queries are constructed from automatically extracted keywords from the given document.
+As for keyword extraction we used methodology based on word repetition similar to method described by Scott~\cite{text_patterns}.
+The basic principle claims that the more frequent a single word in a given document is, the more likely it is to be key in it.
+However the most frequent words usually do not carry any meaning.
+For example articles (the), conjunctions (as) or prepositions (at).
+These types of words are usually refered as the stop words. 
+Therefore we firstly filtered out the stop words according to a general English stop words list.
+Secondly  we used a reference corpus word list as an additional filter.
+We used words in their simple verbatim occurrence allied to a
+statistical estimate of likelihood. As a result of this, a word to become a key
+must not exist in the stop list and must occur in the suspicious document relatively much more frequently
+than in the reference corpus. We used the corpus created from common English Web sources, which contains
+more then 4 billion tokens~\cite{ententen}.
+The statistical measure TF-IDF~\cite{Introduction_to_information_retrieval}
+was applied in order to decide whether the processed word shall become a keyword.
+The TF-IDF combines term frequency (TF) and inverse document frequency (IDF) to produce composite weight of a given term.
+Taking into account the average length of the suspicious document and minimizing the number of used queries we set the TF-IDF weight threshold to 0.02,
+which resulted in about 10 to 20 terms identified as keywords per document.
  
-% Tohle napise Simon
+Before executing Web search tasks we must combine extracted keywords into befitting queries.
+All keyword based queries were constructed from 5 keywords according to their TF-IDF weight consecutively.
+In other words 5 top ranked keywords combine into the first query, the other 5 top ranked keywords combine into the second query.
+By using this method we usually obtained from 2 to 3 initial queries per document.
  
+\subsubsection{Collocations}
+In addition to this we enhanced subsequent queries by the most frequent collocations of top 4 keywords.
+For those 4 keywords we extracted the most frequent two-word and three-word collocation occurring within the single suspicious document.
+A scope for finding collocates was a single sentence. We counted collocates as the most frequent neighbouring words
+within the scope also omitting stop words.
+To take advantage of extracted collocations we combined them among selfs also into 5 word queries.
+After omitting queries containing duplicated words we added, on average, two new queries to each suspicious document.
+Despite of being composed from collocations we count them also as the keywords based queries. Together with
+the queries created only from bare keywords, those queries were unconditionally executed as the initial queries. \\
+
+Queries containing around 5 words should be optimal in order to retrieve highest mean average precision of results.
+It means that we would like to extract as many results as possible which will still be dealing with the same theme as the source document. 
+Expecting the order of words within a single query has small
+impact on results, no other heuristics of combining keywords into queries were applied also
+in order to keep the query amount minimal. In 32 from 33 input test documents there were more than a hundred resulting documents found using
+the initial queries, which gave a decent document base covering the source document theme for document comparison. 
+
+
+
+%using less kw in q would result into more general or underspecifi   
+  
+All those previously mentioned queries characterize the document as a whole.
+They should cover the theme of the document expecting the single document is actually dealing with one theme. 
+
+\subsection{Intrinsic plagiarism based queries}
+Intrinsic plagiarism queries were constructed from a representative sentence residing in
+the selected part of the document. The representative sentence is any sentence 
+which is greater then eight words in length.
+Such a sentence should produce the least searching false-positives~\cite{knight}.
+
+The main task concerning representative sentences is to locate the suitable text part.
+This part should be located by a method leading to discover potential plagiarism inside the document of its own,
+which is called intrinsic plagiarism detection. Such a method is based on e.g. text statistics or syntactic features.
+We chose statistical vocabulary richness method called average word frequency class described in~\cite{awfc}.
+By using this method we can compute average word frequency class values (AWFC) for arbitrary parts of the text. The change
+of this value between two adjacent text windows indicates change of the writing style, which can be caused by a plagiarism.
+We computed AWFC value for every text window of 45 words size, which was shifted through the text by 40 words span.
+The both values were set empirically during training. Windows with
+largest change of AWFC compared to AWFC of their neighbouring previous window were selected as a candidate for query extraction.
+
+
+The query from the representative sentence was composed by 6 words from the beginning of the sentence, also omitting stop words.
+We chose 6 words to make more specific queries and not to exceed 
+the phrasal ChatNoir query limit.
+%This type of query is actually looking for documents containing the very same or at least the very similar sentence,
+%therefore we chose more than 5 word queries in order to narrow the search to more specific results. 
+Unfortunately in the time of PAN 2012 competition the ChatNoir search engine did not
+support phrasal search. Since querying a single sentence is a phrasal search, one might have
+expected to achieve even better results from those queries.
+ %Due to the fact, that ChatNoir does not currently support phrasal search,
+ 
+The intrinsic plagiarism based queries are document positionable, meaning that we can store
+their position within the document. It is due to the fact that 
+those queries are created from more or less subsequent words. As a result of that, they are executed conditionally, for more
+information see Section~\ref{process}.
+They characterize only a specific portion of the suspicious document on the contrary to keyword based queries.
+The positions within the document were set to the
+start positions of the representative sentences. 
+
+
+\subsection{Headers based queries}
+The last type of queries were constructed from local headers of the document.
+A header usually characterizes briefly and accurately the following section, thus it
+can be used as a search query. In real scenarios the headers should
+be located using some document metadata. In that way we can distinguish for example more header levels.
+In the case of PAN 2012 competition we constructed headers based queries according to several 
+basic rules: i) the query is created from any line which has an empty line above and bellow,
+ii) it is from 2 to 6 words in length also omitting stop words.
+These basic rules applied on most headers from the given format of test corpus text files.
+
+Note that the length of headers based queries are variable, thus they can be both more specific 
+and more general.% according to their length.
+They are also document positionable. 
+We calculated their position as the start position of the following text.  
+
+\subsection{Combining and executing queries}~\label{process}
+
+All queries from a given document were firstly preconstructed and they were afterwards sequentially executed
+according to their priority.
+The first all keyword based queries, the second
+intrinsic plagiarism based and the last headers based.
+During the process we could omit some of the positionable queries, purposing to 
+lower total number of executed queries.
+The condition for their omitting is described further in this section.
+
+ 
+After each executed query the intermediate results of plagiarism 
+were calculated using the multi feature document comparison techniques described in Section~\ref{yenya}.
+The queries processing is outlined as Algorithm~\ref{alg:queries}, where for simplicity the snippet calculation and a Web source download is omitted.
+After executing a query the intermediate results were always  
+calculated as follows: 
+
+For every query result in result page based
+on a snippet to input suspicious document similarities, we decided whether to actually
+download the resulting URL or not. Since the snippet contains portions of the Web document to each
+given keyword, we calculated pre-similarity as apportionment of every word match between the snipped and the suspicious document.
+If there were at least 80\% match, we downloaded the Web source for thorough investigation.
+
+For each downloaded Web document we calculated similarities with the input suspicious document
+as a document pair described in Section~\ref{yenya}.
+All similarities were stored in form of intervals from within
+the input suspicious document. In other words for every similar part between the downloaded 
+document and the suspicious document we stored the beginning position and the ending position of that part in
+the suspicious document.
+
+As a result of that, we can during processing further queries, which haven't been executed yet, omit those
+of which document position intersects with any of already found similarities intervals. 
+%This procedure leads us to lowering the total number of executed queries.
+All downloaded documents, which have at least one similarity interval, are declared
+as possible source of the plagiarism.  
+
+
+\renewcommand{\algorithmicrequire}{\textbf{Input:}}
+\renewcommand{\algorithmicensure}{\textbf{Output:}}
+
+
+\algsetup{indent=2em}
+\begin{algorithm}[h!]
+\caption{Queries execution}\label{alg:queries}
+\begin{algorithmic}[1]
+\REQUIRE suspicious document $d$, queries $Q_{KW}, Q_{Intrinsic}, Q_{Headers}$
+%keyword, intrinsic and headers based queries $Q_{KW}, Q_{Intrinsic}, Q_{Headers}$,
+
+\ENSURE plagiarism source candidate Web documents $W_{plag}$
+\medskip
+
+\STATE $I \leftarrow \emptyset; I \in \mathbb{N}^2$
+\STATE $W \leftarrow \emptyset$
+
+\FORALL{$q \in (Q_{KW} \cup Q_{Intrinsic} \cup Q_{Headers}) $}
+  \IF{$position(q)$ do not intersects with any interval $ \vec{i} \in I$}
+  \STATE $W \leftarrow execute(q)$
+    \FORALL{$w \in W$}
+    \STATE $I_{sub} \leftarrow get\_similarities(w, d)$ %\COMMENT{output is set of intervals similarities $[d_{start},d_{end}]$}
+    \STATE $I \leftarrow I \cup I_{sub}$
+      \IF{$I_{sub} \neq \emptyset$}
+         \STATE $W_{plag} \leftarrow W_{plag} \cup \{w\}$
+       \ENDIF
+    \ENDFOR
+  \ENDIF
+\ENDFOR
+\RETURN $W_{plag}$
+
+\end{algorithmic}
+\end{algorithm}
+
+\subsection{Queries comparison}~\label{comparison}
+During the test phase there were extracted 133 keywords based queries, 165 intrinsic plagiarism
+based queries and 331 headers based queries from the test corpus in total. Table~\ref{querycount} compares
+results according to query types.  
+\begin{center}
+\begin{table}[h]
+\begin{center}
+\begin{tabular}{|c|c|c|c|}
+\hline
+{\bf Query type} & {\bf Extracted}& {\bf Omitted}  & {\bf Similarities portion }\\ \hline \hline
+KW        & 4.16  & N/A  &  72.5\% \\ \hline 
+Intrinsic & 5.16  & 2.35  & 24.3\% \\ \hline 
+Headers   & 10.34 & 4.75  & 3.2\% \\ \hline 
+\end{tabular}\\
+\end{center}
+\vspace{0.3 cm}
+\caption{\footnotesize Queries type comparison.}
+\label{querycount}
+\end{table}
+\end{center}
+The second and the third column
+represents the mean of the query count and the omitted query count per document. The fourth
+column shows total portion of similarities found, taking into account the number of similarities regardless of interval sizes.
+ We can see that nearly half of the prepared queries were
+omitted due to the fact, that there had been found a similarity covering their document position. 
+We can also see that there were detected about 5 cases 
+of potential plagiarism on average, by means of used AWFC intrinsic plagiarism detection method.
+Table~\ref{querycount} also shows keyword based queries as the most successful and
+headers based queries as the least successful. Despite the fact, that they were greatest 
+in number they ended with only more than a 3\% of total similarities found. Nevertheless, please
+note that the headers based queries were executed as the last, thus they were used only for
+finding undiscovered potential similarities. In order to really compare the query type performance, we
+would need to execute and evaluate them separately.
+ 
+To conclude this section we can say, that all types of queries were more or less successful. The headers based
+were executed last and in the process they were the least successful. The interesting
+ finding is the fact, that we can even greatly lower the number of executed queries.
+By omitting all of headers based queries we could lover the total number of executed queries by 45\% with only
+3.2\% of recall lost.
+%\begin{center}
+
+
+
+\begin{table}[h]
+\begin{center}
+{ \scriptsize
+\begin{tabular}{l c c c c c c c c c c c }
+\hline
+& \multicolumn{2}{c}{\strut \bf Total workload} & \multicolumn{2}{c}{\bf Time to 1st Result}&\multirow{2}{*}{\parbox{0.6cm}{\bf No \\ result}} & 
+\multicolumn{2}{c}{\bf Reported Srcs.} & \multicolumn{2}{c}{\bf Downloaded Srcs.} &
+ \multicolumn{2}{c}{\bf Retrieved Srcs.} \\
+ 
+{\bf Team \strut}&{\bf Queries}&{\bf  Downloads}&{\bf Queries}&{\bf Downloads}
+& & {\bf Prec.}&{\bf Recall}&{\bf Prec.}&{\bf Recall}&{\bf Prec.}&{\bf Recall}\\ \hline \hline
+
+\parbox{2,3cm}{\strut Gillam et al. \\ University of Surrey, UK \strut} & 63.44 & 527.41 & 4.47 &
+ 25.88 & {\bf 1} & 0.6266 & 0.2493 & 0.0182 & {\bf 0.5567} & 0.0182 & {\bf 0.5567} \\ %\hline
+ 
+\parbox{2,3cm}{\strut Jayapal \\ University of Sheffield, UK  \strut}& 67.06 & 173.47  & 8.78  & 13.50 &
+ 1 & {\bf 0.6582} & {\bf 0.2775} & 0.0709 & 0.4342 & {\bf 0.0698} & 0.4342 \\ %\hline
+ 
+ \parbox{2,3cm}{\strut Kong Leilei \\ Heilongjiang Institute of Technology,\\ China \strut } & 551.06  & 
+ 326.66        & 80.59 & 27.47 & 2 & 0.5720 & 0.2351 & 0.0178 & 0.3742 & 0.0141 & 0.3788 \\ %\hline
+ 
+\parbox{2,3cm}{\strut Palkovskii et al. \\ Zhytomyr State University, Ukraine \strut} & 63.13 &
+1026.72        & 27.28 & 318.94 & 6 & 0.4349 & 0.1203 & 0.0025 & 0.2133 & 0.0024 & 0.2133 \\ %\hline
+
+\parbox{2,3cm}{\strut \bf our approach \strut }& {\bf 12.56} & {\bf 95.41} & {\bf 1.53} & {\bf 6.28} & 2 & 0.5177 & 0.2087 & {\bf 0.0813} &
+0.3513 & 0.0094 & 0.4519 \\ \hline 
+\end{tabular}\\}
+\end{center}
+\vspace{0.3 cm}
+\caption{\footnotesize PAN 2012 candidate document retrieval results.}
+\label{candidateDocsResults}
+\end{table}
+%\end{center}
+
+Table~\ref{candidateDocsResults} shows results of PAN 2012 candidate document retrieval
+task as averages over the all 32 documents from the test corpus. Our approach led
+to obtain decent retrieval performance with the minimal total workload and minimal time to 1st result.
+Also the 80\% word match threshold in Web snippet appears to be suitable, since we
+also achieved the highest precision among downloaded sources.
+
+
+  
+