From: Simon Suchomel Date: Tue, 28 May 2013 13:15:43 +0000 (+0200) Subject: Prvni plneni textu, jeste je potreba hodne dopsat :) X-Git-Tag: odeslano-20130601-2314~14 X-Git-Url: https://www.fi.muni.cz/~kas/git//home/kas/public_html/git/?p=pan13-paper.git;a=commitdiff_plain;h=060415b2f4c4f0482b6a128f8103651e1c9af823 Prvni plneni textu, jeste je potreba hodne dopsat :) --- diff --git a/pan13-paper/img/source_retrieval_process.pdf b/pan13-paper/img/source_retrieval_process.pdf new file mode 100755 index 0000000..bc4c6b9 Binary files /dev/null and b/pan13-paper/img/source_retrieval_process.pdf differ diff --git a/pan13-paper/pan13-notebook.aux b/pan13-paper/pan13-notebook.aux index 863dc19..3f11c17 100644 --- a/pan13-paper/pan13-notebook.aux +++ b/pan13-paper/pan13-notebook.aux @@ -11,4 +11,5 @@ \@input{yenya-text_alignment.aux} \bibstyle{splncs03} \bibdata{pan13-notebook} -\@writefile{toc}{\contentsline {section}{\numberline {4}Conclusion}{4}} +\bibcite{chatnoir}{1} +\@writefile{toc}{\contentsline {section}{\numberline {4}Conclusions}{5}} diff --git a/pan13-paper/pan13-notebook.bib b/pan13-paper/pan13-notebook.bib index e69de29..f424bf1 100755 --- a/pan13-paper/pan13-notebook.bib +++ b/pan13-paper/pan13-notebook.bib @@ -0,0 +1,13 @@ +@INPROCEEDINGS{chatnoir, + AUTHOR = {Martin Potthast and Matthias Hagen and Benno Stein and Jan Gra{\ss}egger and Maximilian Michel and Martin Tippmann and Clement Welsch}, + BOOKTITLE = {35th International ACM Conference on Research and Development in Information Retrieval (SIGIR 12)}, + DOI = {}, + EDITOR = {Bill Hersh and Jamie Callan and Yoelle Maarek and Mark Sanderson}, + ISBN = {}, + MONTH = aug, + PAGES = {}, + PUBLISHER = {}, + SITE = {Portland, Oregon}, + TITLE = {{ChatNoir: A Search Engine for the ClueWeb09 Corpus}}, + YEAR = {2012} +} diff --git a/pan13-paper/pan13-notebook.log b/pan13-paper/pan13-notebook.log index 3aa6225..08cbc13 100644 --- a/pan13-paper/pan13-notebook.log +++ b/pan13-paper/pan13-notebook.log @@ -1,4 +1,4 @@ -This is pdfeTeXk, Version 3.141592-1.11a-2.1 (Web2C 7.5.2) (format=pdflatex 2011.8.15) 10 MAY 2013 15:21 +This is pdfeTeXk, Version 3.141592-1.11a-2.1 (Web2C 7.5.2) (format=pdflatex 2011.8.15) 28 MAY 2013 14:44 entering extended mode %&-line parsing enabled. **pan13-notebook.tex @@ -181,7 +181,7 @@ File: pdftex.def 2002/06/19 v0.03k graphics/color for pdftex \Gin@req@width=\dimen126 ) (./pan13-notebook.aux (./simon-source_retrieval.aux) -(./yenya-dtext_alignment.aux)) +(./yenya-text_alignment.aux)) \openout1 = `pan13-notebook.aux'. LaTeX Font Info: Checking defaults for OML/cmm/m/it on input line 8. @@ -281,34 +281,45 @@ red }] \openout2 = `simon-source_retrieval.aux'. - (./simon-source_retrieval.tex) [2 + (./simon-source_retrieval.tex + +File: img/source_retrieval_process.pdf Graphic file (type pdf) -] + +LaTeX Font Info: External font `cmex10' loaded for size +(Font) <9> on input line 36. +LaTeX Font Info: External font `cmex10' loaded for size +(Font) <6> on input line 36. + [2 + + <./img/source_retrieval_process.pdf>] +LaTeX Font Info: Font shape `T1/ptm/bx/n' in size <10> not available +(Font) Font shape `T1/ptm/b/n' tried instead on input line 49. +) [3] \openout2 = `yenya-text_alignment.aux'. - (./yenya-text_alignment.tex) [3 + (./yenya-text_alignment.tex) [4 -] -No file pan13-notebook.bbl. -[4 +] (./pan13-notebook.bbl) [5 -] (./pan13-notebook.aux (./simon-source_retrieval.aux) +] +(./pan13-notebook.aux (./simon-source_retrieval.aux) (./yenya-text_alignment.aux)) ) Here is how much of TeX's memory you used: - 1837 strings out of 94668 - 22204 string characters out of 1175711 - 76646 words of memory out of 1527888 - 4965 multiletter control sequences out of 10000+50000 - 32234 words of font info for 34 fonts, out of 1000000 for 2000 + 1868 strings out of 94668 + 22666 string characters out of 1175711 + 77666 words of memory out of 1527908 + 4987 multiletter control sequences out of 10000+50000 + 47511 words of font info for 49 fonts, out of 1000000 for 2000 458 hyphenation exceptions out of 1000 - 29i,4n,21p,221b,226s stack positions out of 5000i,500n,6000p,200000b,40000s - 22 PDF objects out of 300000 + 29i,9n,21p,221b,226s stack positions out of 5000i,500n,6000p,200000b,40000s + 56 PDF objects out of 300000 0 named destinations out of 131072 - 1 words of extra memory for PDF output out of 65536 + 6 words of extra memory for PDF output out of 65536 {/export/packages/share/texlive2003/texmf/dvips/ psnfss/8r.enc} -Output written on pan13-notebook.pdf (4 pages, 42541 bytes). +Output written on pan13-notebook.pdf (5 pages, 146423 bytes). diff --git a/pan13-paper/pan13-notebook.pdf b/pan13-paper/pan13-notebook.pdf index e3ba710..cbde3e1 100644 Binary files a/pan13-paper/pan13-notebook.pdf and b/pan13-paper/pan13-notebook.pdf differ diff --git a/pan13-paper/pan13-notebook.tex b/pan13-paper/pan13-notebook.tex index bfe20e2..a6d2ba3 100755 --- a/pan13-paper/pan13-notebook.tex +++ b/pan13-paper/pan13-notebook.tex @@ -33,7 +33,8 @@ The notebooks shall contain a full write-up of your approach, including all deta \include{yenya-text_alignment} -\section{Conclusion} +\section{Conclusions} + \bibliographystyle{splncs03} \begin{raggedright} diff --git a/pan13-paper/simon-source_retrieval.aux b/pan13-paper/simon-source_retrieval.aux index f5e2645..1648a02 100644 --- a/pan13-paper/simon-source_retrieval.aux +++ b/pan13-paper/simon-source_retrieval.aux @@ -1,21 +1,31 @@ \relax +\citation{chatnoir} \@writefile{toc}{\contentsline {section}{\numberline {2}Source Retrieval}{2}} +\@writefile{lof}{\contentsline {figure}{\numberline {1}{\ignorespaces Source retrieval process.}}{2}} +\newlabel{fig:source_retr_process}{{1}{2}} +\@writefile{toc}{\contentsline {subsection}{\numberline {2.1}Querying}{3}} +\@writefile{toc}{\contentsline {subsubsection}{Keywords Based Queries}{3}} +\@writefile{toc}{\contentsline {subsubsection}{Intrinsic Plagiarism Based Queries}{3}} +\@writefile{toc}{\contentsline {subsubsection}{Paragraph Based Queries}{3}} +\@writefile{toc}{\contentsline {subsection}{\numberline {2.2}Search Control}{3}} +\@writefile{toc}{\contentsline {subsection}{\numberline {2.3}Result Selection}{3}} +\@writefile{toc}{\contentsline {subsection}{\numberline {2.4}Snippet Control}{3}} \@setckpt{simon-source_retrieval}{ -\setcounter{page}{3} +\setcounter{page}{4} \setcounter{equation}{0} \setcounter{enumi}{0} \setcounter{enumii}{0} \setcounter{enumiii}{0} \setcounter{enumiv}{0} -\setcounter{footnote}{0} +\setcounter{footnote}{2} \setcounter{mpfootnote}{0} \setcounter{part}{0} \setcounter{section}{2} -\setcounter{subsection}{0} +\setcounter{subsection}{4} \setcounter{subsubsection}{0} \setcounter{paragraph}{0} \setcounter{subparagraph}{0} -\setcounter{figure}{0} +\setcounter{figure}{1} \setcounter{table}{0} \setcounter{chapter}{1} \setcounter{@inst}{1} diff --git a/pan13-paper/simon-source_retrieval.tex b/pan13-paper/simon-source_retrieval.tex index e32c191..b3289c9 100755 --- a/pan13-paper/simon-source_retrieval.tex +++ b/pan13-paper/simon-source_retrieval.tex @@ -1 +1,78 @@ \section{Source Retrieval} +The source retrieval is a subtask in a plagiarism detection process during +which only a relatively small subset of documents are retrieved from the +large corpus. Those candidate documents are usually further compared in detail with the +suspicious document. In the PAN 2013 source retrieval subtask the main goal was to +identified web pages which have been used as a source of plagiarism for creation of the +test corpus. +The test corpus contained XX documents each discussing one and only one theme. +Those documents were created intentionally by + semiprofessional writers, thus they feature nearly realistic plagiarism cases. + Such conditions are similar to a realistic plagiarism detection scenario, such as for +state of the art commercial plagiarism detection systems or the anti-plagiarism service developed on and +utilized at the Masaryk University. The main difference between real-world corpus +of suspicious documents such as for example corpus created from theses stored in Information System of Masaryk University +and the corpus of suspicious documents used during the PAN 2013 competition is that in the PAN +corpus each document contains plagiarism passages. Therefore we can deepen the search during the process +in certain parts of the document where no similar passage has yet been found. This is the main +idea of improving recall of detected plagiarism in a suspicious document. + + +\begin{figure} + \centering + \includegraphics[width=1.00\textwidth]{img/source_retrieval_process.pdf} + \caption{Source retrieval process.} + \label{fig:source_retr_process} +\end{figure} + +An online plagiarism detection can be viewed as a reverse engineering task where +we need to find original documents from which the plagiarized document was created. +During the process the plagiarist locates original documents with the use of a search engine. +The user decides what query the search engine to ask and which of the results from the result page to use. +In real-world scenario the corpus is the whole Web and the search engine can be a contemporary commercial search engine +which scales to the size of the Web. This methodology is based on the fact that we do not +possess enough resources to download and effectively process the whole corpus. +In the case of PAN 2013 competition the corpus +of source documents is the ClueWeb~\footnote{\url{http://lemurproject.org/clueweb09.php/}} corpus. +As a document retrieval tool for the competition we utilized the ChatNoir~\cite{chatnoir} search engine which indexes the English +subset of the ClueWeb. +The reverse engineering decision process reside in creation of suitable queries on the basis of the suspicious document +and in decision what to actually download and what to report as a plagiarism case from the search results. + +These first two stages can be viewed in figure~\ref{fig:source_retr_process} as Querying and Selecting. Selected results +from the search engine are forthwith textually aligned with the suspicious document (see section~\ref{text_alignment} for more details). +This is the last decision phase -- what to report. +If there is any continuous passage of reused text detected, the result document is reported + and the continuous passages in the suspicious document are marked as 'discovered' and no further processing +of those parts is made. + +\subsection{Querying} +Querying means to effectively utilize the search engine in order to retrieve as many relevant +documents as possible with the minimum amount of queries. We consider the resulting document relevant +if it shares some of text characteristics with the suspicious document. + +We used 3 different types of queries~\footnote{We used similar three-way based methodology in PAN 2012 +Candidate Document Retrieval subtask. However this time we completely replaced the headers based queries +with paragraph based queries, since the headers based queries did not pay off in the overall process.}: +i) keywords based queries, ii) intrinsic plagiarism +based queries, and iii) paragraph based queries. Three main properties distinguish each type of query: i) Positional; ii) Phrasal; iii) Deterministic. +Positional queries carry extra information about a textual interval in the suspicious document which the query represents. +A phrasal query aims for retrieval of documents containing the same small piece of a text. They are usually created from closely coupled words. +Deterministic queries for specific suspicious document are always the same no matter how many times we run the software. +On the contrary the software can create in two runs potentially different nondeterministic queries. + +\subsubsection{Keywords Based Queries} + +\subsubsection{Intrinsic Plagiarism Based Queries} +\subsubsection{Paragraph Based Queries} +\subsection{Search Control} + + +\subsection{Result Selection} +\subsection{Snippet Control} + + + + + + diff --git a/pan13-paper/yenya-text_alignment.aux b/pan13-paper/yenya-text_alignment.aux index 0cc5e52..566eeb0 100644 --- a/pan13-paper/yenya-text_alignment.aux +++ b/pan13-paper/yenya-text_alignment.aux @@ -1,13 +1,14 @@ \relax -\@writefile{toc}{\contentsline {section}{\numberline {3}Text Alignment}{3}} +\@writefile{toc}{\contentsline {section}{\numberline {3}Text Alignment}{4}} +\newlabel{text_alignment}{{3}{4}} \@setckpt{yenya-text_alignment}{ -\setcounter{page}{4} +\setcounter{page}{5} \setcounter{equation}{0} \setcounter{enumi}{0} \setcounter{enumii}{0} \setcounter{enumiii}{0} \setcounter{enumiv}{0} -\setcounter{footnote}{0} +\setcounter{footnote}{2} \setcounter{mpfootnote}{0} \setcounter{part}{0} \setcounter{section}{3} @@ -15,7 +16,7 @@ \setcounter{subsubsection}{0} \setcounter{paragraph}{0} \setcounter{subparagraph}{0} -\setcounter{figure}{0} +\setcounter{figure}{1} \setcounter{table}{0} \setcounter{chapter}{1} \setcounter{@inst}{1} diff --git a/pan13-paper/yenya-text_alignment.tex b/pan13-paper/yenya-text_alignment.tex index c3c6c49..7a93e50 100755 --- a/pan13-paper/yenya-text_alignment.tex +++ b/pan13-paper/yenya-text_alignment.tex @@ -1 +1 @@ -\section{Text Alignment} +\section{Text Alignment}~\label{text_alignment}