IBM Dictionary and Linguistics Tools system, codenamed “Frost”, will eventually support over 30 languages, including some Western European languages, thus consolidating the results of more than 20 years development of lexical data and morphological analysers. The product is under development by IBMers from several countries; cooperation with academic communities is used for data development and for providing of linguistic expertise.
Frost architecture provides modular, crosslinguistic, cross-platform, and high-performance (several gigabytes per hour) base for industrial applications in Information Retrieval and Extraction, providing shallow parsing, part-of-speech tagging, morphological analysis and synonym support. 

To increase performance and reduce developing cycle specific linguistics phenomena are generalized and classified according computational models most suitable for their processing. E.g. clitic processing in Romance languages, decomposition of solid compounds in Germanic languages, Chinese word segmentation are treated in Frost with the one formal computational tool. This tool is based on the special implementation of non-deterministic finite-state processing, when back-tracking logic is extracted from finite-state machine into separate module. Separated programming logic gives flexibility, while finite-state processing ensures high-speed string matching. Finite state processing in this scheme is reduced to finding of the hierarchy of prefixes in deterministic finite-state dictionary, which contains word formation elements, provided with morphological, morphotactic and statistical information.

Morphological analysis in Frost is based on the usage of finite-state automatons and transducers. Also finite-state devices have been present since the emergence of computer science and are extensively used in natural language processing (including speech processing), the focus was on mathematical and algorithmic approaches to the “topology” thus leaving the gap between industrial and academic research. IBM Dictionary and Linguistics Tools team developed new approaches to the analysis of finite-state devices performance which allowed to provide several times improvement in terms of the run-time.
Frost exploits variable node format, which allows the usage of binary search, hach-tables and other programming techniques in addition to previously widely used linear search and TRIE structures. Assigning of a format to a node is done according to graph theoretic analyses and statistics of the usage of this particular node in corpora processing. In addition to the performance advantages, variable node formats opened the way to efficient application of finite-state processing for non-alphabetical languages.

Another aspect of Frost finite-state tools is that their implementation takes into account the architecture of modern computers, specifically we use cache- and prefetching-friendly memory representation.
Finite-state processing typically has simple access code, so it is the speed of the memory access which might be crucial for the performance. The architecture of modern processors and computers include the hierarchy of data storage devices to provide caching of frequently used data. Operational systems provide prefetch. Finite-state processing is highly irregular type of computation, so it is hardly to expect that the progress in the development of standard hardware and software caching tools will eliminate the need of adjusting finite-state processing to become cache-friendly.

Related link: <a href=""></a>