Mathematicians dream of a digital archive containing all peerreviewed mathematical literature ever published, properly linked and validated/verified. It is estimated that the entire corpus of mathematical knowledge published over the centuries does not exceed 100,000,000 pages, an amount easily manageable by current information technologies.
Following success of DML 2008 and DML 2009, workshop's objectives are to formulate the strategy and goals of a global mathematical digital library and to summarize the current successes and failures of ongoing technologies and related projects, asking such questions as:
has been published (viii+135 pages with author, name and subject indexes) by Masaryk University Press, ISBN 9788021052420. All DML proceedings have been indexed by Thomson Reuters in Conference Proceedings Citation Index CPCI and Google Scholar and are available in digital form from electronic archive DMLCZ. You may order printed copy from this eshop. Best papers will be chosen for a postconference book published by renowned publisher or for a journal special issue [as in 2008, cf. MCS Vol 3, issue 3].
Masakazu Suzuki (Project Infty, Kyushu University, JP): Mathematical Formulae Recognition and Logical Structure Analysis of Mathematical Papers
Abstract: In most cases the current online journals in mathematics are supplied in the form of PDF with print images of papers in the front and OCR'ed hidden texts behind to provide with search facilily using key words. The embedded hidden texts usually does not include good information about mathematical formulae in the papers. We can say that, for the future development of DML, it is desirable to include, in the digitised journals, more structured information of the content of mathematical papers, e.g. tag information to indicate logical structure of papers such as hedding of sections, definitions, theorems, lemmas, etc., together with mathematical formulae structures included.
In the talk, I will present the current stage of our technology to extract such information from the scanned images in the retrodigitised mathematical papers. Mechanicallyprepared new journals in the form of PDF are also the target of our research since it is not an easy task to get uniform structure description of mathematical formulae for example from the original LaTeX source with various styles and macro commands depending on authors. Although there are many methods presented in literature to recognize mathematical formulae, very few applications appeared to do this task in practical sense. One of the major problem in the development of math OCR is to avoid fatal effects caused by misrecognition and missegmentation of characters and symbols. In the talk, I will explain first the method we took to overcome this difficulty. Some demonstration of our software InftyReader to recognize mathematical documents will also be given in the lecture. Secondly, as a better approach to recognize a large number of pages like the case of DML, our adaptive method to improve the recognition rates of characters/symbols, mathematical formulae structures and logical structures of articles will also be presented.
(include, but are not limited to)
Petr Sojka, Michal Růžička
