% -*- mode: c++ -*- \datethis % print date @** Introduction. This is PATLIB, PATtern LIBrary. This library stores a finite language using the packed dynamic trie and manipulates patterns. It is used to write a pattern generator. This library is written in ANSI C++ using the standard template library. Written and tested on Linux with glibc-2.2.2 and gcc-2.96. This library should work with any compiler supporting the STL and ANSI C++. \medskip Written and maintained by David Anto\v s, {\tt xantos (at) fi.muni.cz} \medskip Copyright 2001 David Anto\v s \medskip You may use and redistribute this software under the terms of General Public License. As this software is distributed free of charge, there is no warranty for the program. The entire risk of this program is with you. The author does not want to forbid anyone to use this software, nevertheless the author considers any military usage unmoral and unethical. \medskip The |ptl_vers.h| header file defines the version number (change it whenever the library changes) and the CVS identification string. @(ptl_vers.h@>= #ifndef PTL_VERS_H #define PTL_VERS_H const char* patlib_version="1.0"; const char* patlib_cvs_id="$Id: patlib.w,v 1.127 2001/12/04 08:10:22 antos Exp $"; #endif @ Organization of the code. The code is highly templatized, in fact there is nothing without templates. Therefore the library is divided into header files which contain all the code. The header file names look like {\tt ptl\_.h} where {\tt } is a shorthand of the full section name. The header files correspond usually to the ``starred'' sections in the CWEB source. The header files define a pre-processor variable {\tt PTL\_\_H} which is tested so as not to load the same header file more then once. The library has two more or less independent parts. The first part, the finite language store, may be used separately. It provides methods to store a finite language in the packed version of trie memory. This may be also used to store pattern set in the application using patterns. The second part, the generator, takes input data with marked points of interest and creates a set of patterns which recognize the points of interest there. The generator uses the finite language store to keep the pattern candidates and the patterns. @ Dealing with bugs in the library. The software is far from being perfect, so my English is. Note that a library like this is quite a specialized tool. The number of users is small, so if you find a bug, please report it to {\tt xantos (at) fi.muni.cz}. Keep in mind that if {\it you\/} don't tell me, maybe nobody will. Please tell me exactly your platform, compiler, the problem you've fallen into. Please always attach the revision control id you can see above. So we consider any behaviour different that the documentation says to be a bug. Feel free to send me any notes and remarks, they will be taken seriously. Ask anything you want to. And something more! This code is complicated, nevertheless the commentary should be clear as much as possible. If the commentary does not match the program, it is a bug. It is a big bug. So please report it. If the comments are not clear, if they are messy, it is my fault, not yours. Do not sit, do not be depressed, complain! (But please do not mail me before you read it three times :-).) @f iterator int @f const_iterator int @q C++ definitions:@> @i c++lib.w @** Exception handling in the library. When anything unpleasant happens the library throws the |Patlib_error| exception. The exception class is an offspring of |string| and it has |what(void)| function defined. The |what| function prints the text the exception has been constructed with to the standard error output. The header also includes the |exception| although it doesn't use it. @(ptl_exc.h@>= #ifndef PTL_EXC_H #define PTL_EXC_H #include #include #include class Patlib_error: public std::string { public: Patlib_error(string s): std::string(s) { // the constructor } void what(void) { // output of the exception text cout<<*this; } }; #endif @** Finite language store. The first thing we need to do is to prepare the way to store a finite language. For the purpose of pattern generating the trie data structure seems to be a good choice as it behaves well on not very long words. @ We start with the Growing array object. This serves as a potentially infinite array, simply said it allocates more memory when touching a previously unaccessed field. It makes implementing the pattern manipulator easier as we don't have to take care of memory overflow. @ The interesting things come next. The Trie pattern manipulator is a finite language store where each word may have an output information attached to its end. The simplest case, storing a dictionary, means that the output information is a boolean value, |true| as end of the word. The main public-accessible methods include inserting a pattern (a word, if you want to), ``walk through''---it means getting the language word by word, getting outputs of words, and some optimizing the structure after delete operations. Sometimes the one ``output field'' for a word may suit us, sometimes we need pairs of values. Therefore we provide the Candidate count trie. The name goes from the main usage of that structure in our generator, we need to store pairs of numbers when preparing candidate patterns. Nevertheless the structure may be used with any types it can be compiled with, practically with numerical ones. The methods here makes only access to the pairs of values easier. @ The ``heavy weights'' follow. The word may also have multiple outputs, e.g., on different positions. We provide the output store. We suppose the number of different outputs will be relatively small, therefore the manipulator with multi-output words consists of two parts, the word manipulator and the output manipulator, the outputs of a word become iterators to the output manipulator. The Multi output pattern manipulator service is an envelope hiding the inner parts. It makes the interface to word manipulations known from the word manipulator itself and adds several ways of deleting patterns depending on their outputs. The offspring, competitive variant of the manipulator, supposes moreover the output alphabet to be sorted. It provides methods to compute the competitive variant of the pattern output, it means that the higher values win over the smaller ones. @ And for completeness, we have to drop a word on the simple translate service. When testing applications where the word alphabet is simple we often need the alphabet to be ``translated'' into $\{1,\dots,n\}$ numeric set. The service does it simply and slowly. @* Growing array. Growing array is a service for trie pattern manipulator. @(ptl_ga.h@>= #ifndef PTL_GA_H /* To include the header only once */ #define PTL_GA_H #include #include #include "ptl_exc.h" @@; #endif @ Growing array. We need an automatically growing array of given type. We want to add to the array mostly, deallocating is done at the end of work. The growing array is addressed by linear address and it grows when addressing previously unaccessed member. Addressing is done with overloaded `|[]|' operator. We implement it as a vector of members. When a new member is added, it is initialized to the value given to the constructor. Therefore we need the |Tmember| object to have a `|=|' copy constructor. The |Tindex| type is a type of array index (for avoiding compilation warnings, and for historical reasons), |Tmember| is a type of member. @f Tindex int @f Tmember int @= template @/ class Growing_array:@/ public std::vector { protected:@/ const Tmember default_value;@# @@; @@; @@; }; @ The constructor prepares the default value and constructs the vector. @= public:@/ Growing_array(const Tmember& def_val): std::vector(), default_value(def_val)@/ { } @ Operator |[]| accesses the array like an usual array, if the accessed member is not allocated, it prepares it. @= public:@/ inline Tmember& operator[] (const Tindex& logical_addr) { try { while (logical_addr >= std::vector::size()) std::vector::push_back(default_value); return std::vector::operator[](logical_addr); } catch (...) { cerr<::size()<::capacity()<< endl; cerr<<" Member size (bytes) "<= public:@/ void print_statistics(void) const { cout<<" No. of members used "<::size()<::capacity()<= #ifndef PTL_TPM_H #define PTL_TPM_H #include #include #include #include "ptl_exc.h" #include "ptl_ga.h" @@; @@; #endif @* Dynamic packed trie pattern manipulator. This is the implementation of pattern manipulator. For a pattern $p_1\ldots p_k$, the information associated with that pattern is accessed by setting $t_1 = |trie_root|+p_1$ and then, for $1 trie_char; Growing_array trie_link; Growing_array trie_back; Growing_array trie_base_used; /* one byte per a bit is a feasible solution (we cannot use |vector|) */ Growing_array trie_outp; @# unsigned q_max; unsigned q_max_thresh; Tin_alph* trieq_char; Tpm_pointer* trieq_link; Tpm_pointer* trieq_back; Tout_information* trieq_outp; @ Several values may be obtained and set by users. The |q_max_thresh| variable controls the |first_fit| packing, the ``count'' variables are for statistics only. @= public:@/ virtual unsigned get_q_max_thresh(void) const { return q_max_thresh; } @# virtual void set_q_max_thresh(const unsigned& new_q_m_t) { // |q_max_thresh| must be at least 1 (even though 1 is quite a stupid choice) if (new_q_m_t > 0) q_max_thresh = new_q_m_t; } @# virtual Tpm_pointer get_trie_count(void) const // get the number of occupied nodes { return trie_count; } @# virtual Tpm_pointer get_pat_count(void) const // get the number of patterns { return pat_count; } @# virtual Tpm_pointer get_max_in_alph(void) const // get the maximum input alphabet number { return max_in_alph; } @ Initially, the dynamic packed trie has the only state, the root, with all transitions present but with null links. We have to set up only the |trie_char| array as the other arrays are set to the default values when initializating the arrays. @= public:@/ Trie_pattern_manipulator(const Tin_alph& max_i_a, const Tout_information& out_i_z, const unsigned& q_thr = 5): // set default values to embedded objects max_in_alph(max_i_a), out_inf_zero(out_i_z), trie_char(min_in_alph), trie_link(min_in_alph), trie_back(min_in_alph), trie_base_used(false), trie_outp(out_inf_zero), q_max_thresh(q_thr)@/ { for (Tpm_pointer c = min_in_alph; c <= max_in_alph; c++) { trie_char[trie_root + c] = c; } /* In this cycle there was a very nice bug. The type of |c| was |Tin_alph|. The problem appeared when |Tin_alph| become |unsigned char| and |max_in_alph| was 255. This makes infinite cycle then\dots. This bug lived here more than half a year. BTW, I discovered it on Friday, 13 July 2001. And when writing this text, my Emacs dumped core. Strange :-) */ trie_base_used[trie_root] = true; trie_bmax = trie_root; trie_max = trie_root + max_in_alph; @# trie_count = max_in_alph + 1; pat_count = 0; @# trie_link[0] = trie_max + 1; trie_back[trie_max + 1] = 0; @# trieq_char = new Tin_alph[max_in_alph + 1]; // init trieq arrays trieq_link = new Tpm_pointer[max_in_alph + 1]; trieq_back = new Tpm_pointer[max_in_alph + 1]; trieq_outp = new Tout_information[max_in_alph + 1]; } @ We have to destroy dynamic arrays in the destructor. @= public:@/ virtual ~Trie_pattern_manipulator() { delete []trieq_char; delete []trieq_link; delete []trieq_back; delete []trieq_outp; } @ The |first_fit| procedure find a hole on the packed trie into which state in |trieq_*| arrays will fit. This is normally done by going through the linked list of unoccupied cells and testing if the state will fit at each position. However if the state is too dense, we just pack it to the first unoccupied region (at |trie_max+1|). The state is considered dense if it has more than |q_max_thresh| states. @= protected:@/ virtual Tpm_pointer first_fit(void) {@/ unsigned int q; Tpm_pointer s, t;@# @ // pack it for (q = 1; q <= q_max; q++) { t = s + trieq_char[q]; // link around filled cell trie_link[trie_back[t]] = trie_link[t]; trie_back[trie_link[t]] = trie_back[t]; trie_char[t] = trieq_char[q]; trie_link[t] = trieq_link[q]; trie_back[t] = trieq_back[q]; trie_outp[t] = trieq_outp[q]; if (t > trie_max) { trie_max = t; } } trie_base_used[s] = true; return s;@/ } @ The threshold is used to decide what to do. We may pack the state to the end of used region or pack it among other transitions. @= if (q_max > q_max_thresh) { t = trie_back[trie_max + 1]; }@+ else@+ { t = 0; } while (1) { // I don't like goto, but do it efficiently without it: t = trie_link[t]; // get next unoccupied cell s = t - trieq_char[1];@/ @ if (trie_base_used[s]) { goto not_found; } for (q = q_max; q >= 2; q--) { // check if state fits there if (trie_char[s + trieq_char[q]] != min_in_alph) { goto not_found; } } goto found; not_found: ; // go to the next loop } found: ; @ The trie is only initialized (as a doubly linked list of empty cells) as far as necessary. Here we extend the initialization. @= while (trie_bmax < s) { trie_bmax++; trie_base_used[trie_bmax] = false; trie_char[trie_bmax + max_in_alph] = min_in_alph; trie_outp[trie_bmax + max_in_alph] = out_inf_zero; // is this necessary? it is done by growing array! trie_link[trie_bmax + max_in_alph] = trie_bmax + max_in_alph + 1; trie_back[trie_bmax + max_in_alph + 1] = trie_bmax + max_in_alph; } @ The |unpack| procedure finds all transitions associated with the state based on |s| and puts them into |trieq_*| arrays and sets |q_max| to one more than the number of transitions found. Freed cells are put at the beginning of the free list. @= protected:@/ virtual void unpack(const Tpm_pointer& s) { Tpm_pointer t;@# q_max = 1; for (Tpm_pointer c = min_in_alph; c <= max_in_alph; c++) { t = s + c; if (trie_char[t] == c && c != min_in_alph) { // found one trieq_char[q_max] = c; trieq_link[q_max] = trie_link[t]; trieq_back[q_max] = trie_back[t]; trieq_outp[q_max] = trie_outp[t]; q_max++; // now free trie node trie_back[trie_link[0]] = t; trie_link[t] = trie_link[0]; trie_link[0] = t; trie_back[t] = 0; trie_char[t] = min_in_alph; trie_outp[t] = out_inf_zero; // not needed } } trie_base_used[s] = false; } @ Here is a procedure that inserts a pattern into the pattern trie. The pattern is a |vector| of |Tin_alph| symbols and adjacent |Tout_information| is put to the end of the pattern in the trie. By hard inserting we mean that previous output is rewritten. We sometimes want to add to the output, so we have to get the last output of the word, change the output and ``hard'' insert it back. @= public:@/ virtual void hard_insert_pattern(const vector& w, const Tout_information& o) { if (w.empty()) return; /* if you want to insert a void pattern, we will ignore you */ Tpm_pointer s, t; vector::const_iterator i = w.begin(); s = trie_root + *i; t = trie_link[s]; @# while ((t > 0) && ((i + 1) != w.end())) { // follow existing trie i++; t += *i; if (trie_char[t] != *i) { @@; } s = t; t = trie_link[s]; } trieq_link[1] = 0; trieq_back[1] = 0; trieq_outp[1] = out_inf_zero; q_max = 1;@# while ((i + 1) != w.end()) { // insert rest of pattern i++; trieq_char[1] = *i; t = first_fit(); trie_link[s] = t; s = t + *i; trie_count++; } @# if ((trie_outp[s] == out_inf_zero) && (o != out_inf_zero)) { // we rewrite ``no-pattern'' by ``pattern'' pat_count++; } if ((trie_outp[s] != out_inf_zero) && (o == out_inf_zero)) { // we rewrite ``pattern'' by ``no-pattern'', we delete in fact pat_count--; } @# trie_outp[s] = o; } @ We have accessed transition not in the trie. We insert it, repacking the state if necessary. @= if (trie_char[t] == min_in_alph) { // no repacking needed trie_link[trie_back[t]] = trie_link[t]; trie_back[trie_link[t]] = trie_back[t]; trie_char[t] = *i; trie_link[t] = 0; trie_back[t] = 0; trie_outp[t] = out_inf_zero; if (t > trie_max) { trie_max = t; } } else { // whoops, have to repack unpack(t - *i); trieq_char[q_max] = *i; trieq_link[q_max] = 0; trieq_back[q_max] = 0; trieq_outp[q_max] = out_inf_zero; t = first_fit(); trie_link[s] = t; t += *i; } trie_count++; @ Walking through the pattern manipulator language. This is something like an ``iterator'', we init walk through and then we may get patterns and their outputs iteratively, one by one. We use a stack where we store current path in the manipulator. The |init_walk_through| procedure sets the initial stack values to start searching. @= protected:@/ vector pointer_stack; vector char_stack; public:@/ virtual void init_walk_through() { pointer_stack.clear(); char_stack.clear(); pointer_stack.push_back(trie_root); char_stack.push_back(min_in_alph); } @ Getting the next pattern is here. When we reach state having output we stop depth-first-search and return the values. When |get_next_pattern| is called again, it finds next pattern. If find was successful, |true| is returned by |get_next_pattern|, otherwise |false|. It allows to use construction |while (get_next_pattern(...))|. The pattern and its output is returned in |w| and |o| variables. Moreover when no more pattern remains (|false| is returned), the |w| vector is emptied and |o| is set to |out_inf_zero|. Important: Inserting and deleting patterns may interfere with the walk-through process. Nevertheless you may change the output of an {\it existing\/} pattern independently on the walk-through process using |hard_insert_pattern|. And never forget to use |init_walk_through| first! @= public:@/ virtual bool get_next_pattern(vector& w, Tout_information& o) { Tpm_pointer t, tstart; Tpm_pointer c, cstart; @# w.clear(); o = out_inf_zero; @# while(true) {@/ if (pointer_stack.empty()) return false; tstart = pointer_stack.back(); pointer_stack.pop_back(); /* we have where to go */ cstart = char_stack.back(); char_stack.pop_back(); @# for (c = cstart; c <= max_in_alph; c++) { // find transitions belonging to this family t = tstart + c; @# if (trie_char[t] == c && c != min_in_alph) { // found one pointer_stack.push_back(tstart); char_stack.push_back(c + 1); if (trie_outp[t] != out_inf_zero) { // state with output @@; // state found successfully @@; return true; } @@; break;@/ }@/ }@/ }@/ } @ Prepare the way to go deeper in the trie. @= if (trie_link[t] != 0) { // we have to go deeper, if possible pointer_stack.push_back(trie_link[t]); char_stack.push_back(min_in_alph); } @ We have come to the end of a pattern, so we now output it. We fill the |w| vector with the pattern and the |o| with the output of the pattern. Note that the last stack value is ignored by this routine as it shows where to go next. The value in |char_stack| is the character plus one. @= vector::iterator it = pointer_stack.begin(); vector::iterator ic = char_stack.begin(); Tpm_pointer u; // now let's go to the last-to-end stack character for (Tpm_pointer i = 0; i < pointer_stack.size(); i++) { u = (*it) + Tpm_pointer(*ic) - 1; // in stack, there is one more w.push_back(trie_char[u]); it++; ic++; } o = trie_outp[u]; @ This is a procedure to produce output of pattern on given word. All outputs of patterns matching the beginning of the |w| are pushed back to the |o| vector to the same position. If no output is associated with this transition, |o| vector is filled with |out_inf_zero|. If the |w| is longer than any matching pattern, the |o| vector is filled to the same length with the |out_inf_zero| value. Note that this is not the same as ``hyphenate'' procedure, we need to use |word_output| with all postfixes of the word to get complete information. This function must nevertheless be implemented in application layer of pattern generator as the pattern manipulator does not know anything about the semantics of output information. So we are not able to make the patterns outputs compete here, we do not know the order of the symbols. @= public:@/ void word_output(const vector& w, vector& o) { Tpm_pointer t; vector::const_iterator i = w.begin(); o.clear();@# if (w.empty()) return;@# t = trie_root;@# do { // follow trie and get the outputs t += *i; if (trie_char[t] == *i) o.push_back(trie_outp[t]); else break; t = trie_link[t]; i++; } while (t != 0 && i != w.end());@# while (i != w.end()) { // fill outputs to the same length o.push_back(out_inf_zero); i++; } } @ This procedure gives only the last output of a word. It is equivalent to previous procedure if you use only the last field of |o| vector, but this is optimized. So the last output is given, if no output of the word exists, |out_inf_zero| is returned. @= public:@/ void word_last_output(const vector& w, Tout_information& o) { #if 0==1 // unoptimized version o = out_inf_zero; vector whole_o; word_output(w, whole_o); if (whole_o.size() >= 1) o = *(whole_o.end() - 1); #endif Tpm_pointer s,t; vector::const_iterator i = w.begin(); o = out_inf_zero;@# if (w.empty()) return;@# t = trie_root; // follow the trie do { t += *i; if (trie_char[t] == *i) s = t; else break; t = trie_link[t]; i++; } while (t != 0 && i != w.end()); // if we are at the end of the word, we have the output if (i == w.end()) { o = trie_outp[s]; } } @ We delete patterns by overwriting their outputs by |out_inf_zero|. Hanging parts of patterns stay in their places. They may be removed at once if we want to reduce the number of occupied cells. The |delete_hanging| procedure removes the nodes which have no output or continue by branches with no output. The following recursive procedure is a subroutine run from the root node of the trie by public-accessible procedure |delete_hanging| and returns |true| if and only if entire subtrie is removed. @= private:@/ virtual bool delete_hanging_level(const Tpm_pointer& s) { Tpm_pointer t; bool all_freed;@# all_freed = true; for (Tpm_pointer c = min_in_alph; c <= max_in_alph; c++) { // find transitions belonging to this family t = s + c; if (trie_char[t] == c && c != min_in_alph) { // found one if (trie_link[t] != 0) { if (delete_hanging_level(trie_link[t])) trie_link[t] = 0; } if ((trie_link[t] != 0) || (trie_outp[t] != out_inf_zero) || (s == trie_root)) all_freed = false; else @@; } } if (all_freed) { // entire state is freed trie_base_used[s] = false; } return all_freed; } public:@/ virtual void delete_hanging(void) { delete_hanging_level(trie_root); } @ Cells freed by |delete_hanging_level| are put at the end of the free list. @= { trie_link[trie_back[trie_max + 1]] = t; trie_back[t] = trie_back[trie_max + 1]; trie_link[t] = trie_max + 1; trie_back[trie_max + 1] = t; trie_char[t] = min_in_alph; trie_count--; } @ We sometimes need to know all the outputs used in the pattern manipulator. We may need it for example to clean complicated output object which are no longer referenced. The |set| is first cleared and then filled up with all the values used as outputs. @= public:@/ void set_of_my_outputs(set &s) { s.clear(); for (Tpm_pointer i = 0; i <= trie_max; i++) { s.insert(trie_outp[i]); } } @ Print statistics on trie structure. If |detail| is non-zero, statistics on manipulator internal structures are also printed. @= public:@/ void print_statistics(int detail = 0) const { cout<<" nodes: "<= template @/ class Candidate_count_trie:@/ public Trie_pattern_manipulator > {@/ public:@/ typedef pair Tcount_pair; @@/ @@/ @@/ }; @ We have to initialize the default values. The default packing threshold is less (at least we recommend it) then in |Trie_pattern_manipulator| as speed is more important here. @= public:@/ Candidate_count_trie(const Tin_alph& max_i_a, const Tcount_good& out_i_z_good, const Tcount_bad& out_i_z_bad, const unsigned& q_thr = 3): Trie_pattern_manipulator (max_i_a, make_pair(out_i_z_good, out_i_z_bad), q_thr)@/ { } @ The following procedures increment the good and the bad counts in patterns. If you want to change the only one parameter, change the other with zero :-). If the pattern has not been in the trie, it is inserted and the counts are added to the default values. The procedure may be optimized, we may manipulate the data directly. In the way we do it the trie must be traversed twice. This is conceptually nicer. @= public:@/ virtual void increment_counts(const vector& w, const Tcount_good& good_inc, const Tcount_bad& bad_inc) { Tcount_pair counts; word_last_output(w, counts); counts.first += good_inc; counts.second += bad_inc; hard_insert_pattern(w, counts); /* now we overwrite the output or insert it if it was void */ } @ Get next pattern overloads the procedure from the |Trie_pattern_manipulator|, the |good| and |bad| counts are returned instead of a pair of them. Please see discussion on the original method to learn how to use it. Do not forget to run |init_walk_through| first! @= public:@/ virtual bool get_next_pattern(vector& w, Tcount_good& good, Tcount_bad& bad) { Tcount_pair counts; bool ret_val; ret_val = Trie_pattern_manipulator >::get_next_pattern(w, counts); good = counts.first; bad = counts.second; return ret_val; } @* Multiple output pattern handling. The pattern manipulators we have seen in previous sections are able to keep the only (well, sometimes a pair of) thing as the output. But we need outputs on different positions of a word. Therefore we provide simple methods to keep outputs of word positions and the manipulator with such kind of language. @(ptl_mopm.h@>= #ifndef PTL_MOPM_H #define PTL_MOPM_H #include #include #include #include "ptl_tpm.h" @@; @@; @@; @@; #endif @* Multiple pattern outputs. The pattern manipulator recognizes the end of a word by having the output on that position. We have to store information on what position of the pattern the output is related to, e.g.~the pattern is $qwerty(2,4)$ means that the pattern $qwerty$ has output~$4$ after the second character, the usual form of writing this is~$qw4erty$. Another complication is that the input word may have several outputs, so we have to store {\it sets\/} of outputs instead of the outputs themselves. If we have pattern~$qw4ert5y$, we store it as~$qwerty\{(2,4),(5,5)\}$ into the manipulator. We will do it really easily, we inherit all the operations we need from the standard container |multimap|. The |Tposition| is the position in a pattern (convention: the output before the first character is 0th, after the first character is 1st and so on). The |Toutput| is the real output of the position. @f Tposition int @f Toutput int @= template @/ class Outputs_of_a_pattern:@/ public std::multimap@/ { }; @ We have not said the whole story yet. We may now store the previous objects as outputs of the |Trie_pattern_manipulator|. But the |Outputs_of_a_pattern| objects are quite large and there will not be a lot of different ones. Therefore we put the |Outputs_of_a_pattern| into a |set| and we store iterators to the |set| as the trie outputs. The |set| iterators are guaranteed not to change when |set| operations are performed. (Of course the iterator becomes invalid if you delete a |set| member it points to, but nobody makes us do it.) If the number of different outputs is supposed to be high we should not use this and we would better make the previously defined objects as outputs of the pattern manipulator. Here we inherit the |set|. We also need the empty member to be present all the time. Therefore we provide the |empty_iter| value and in the constructor we create the empty member. The |empty_iter| value may be used as ``no output is here'' special value for the pattern manipulator. Of course, never remove the empty output field from the object. Also printing statistics is here. @= template @/ class Outputs_of_patterns:@/ public std::set >@/ { private:@/ Outputs_of_patterns::iterator empty_iter; @# public:@/ Outputs_of_patterns(void) { Outputs_of_a_pattern empty; empty_iter = (this->insert(empty)).first; } @# Outputs_of_patterns::iterator get_empty_iter(void) const { return empty_iter; } @# void print_statistics() const { cout<<" number of different outputs: "<= template @/ class Multi_output_pattern_manipulator { @@/ @@/ @@# @@/ @@/ @@/ @@/ @@/ @@/ @@/ @@/ @@/ }; @ Data structures of multi-output pattern manipulator. We have a pattern trie manipulator to hold input words and an output set to hold their outputs. The pattern manipulator has iterators to the set as the output information associated with a word. @= protected:@/ typedef Outputs_of_patterns::iterator Tout_iter; Outputs_of_patterns outputs; Trie_pattern_manipulator words; @ In the constructor, we have to set initial values to the embedded objects. The |words| manipulator is initialized with the |max_in_alph| value and the empty set iterator as the ``zero output information.'' Once more: never remove the empty field from the |outputs| set, or strange things will happen. The other constructor copies the manipulator. The ``logical'' content of the original (|old|) object is copied, so the copy may use less memory. Nevertheless the copying process is quite time-consuming as it is linear in number of all (position, value) pairs in the manipulator's language. The |old| object is passed by value and cannot be |const| as the walk-through procedure changes it's hidden data structures. But the language of |old| is not changed. This copying may be also used to decrease the number of unused outputs of patterns as it is no other reasonable way to delete them, so we let them rest in peace even if they are no longer used. @= public:@/ Multi_output_pattern_manipulator(const Tin_alph &max_i_a): outputs(), words(max_i_a, outputs.get_empty_iter())@/ { } @# Multi_output_pattern_manipulator(Multi_output_pattern_manipulator &old): outputs(), words(old.get_max_in_alph(), outputs.get_empty_iter())@/ { vector w; Outputs_of_a_pattern o; old.init_walk_through(); while (old.get_next_pattern(w, o)) for (Outputs_of_a_pattern::iterator i = o.begin(); i != o.end(); i++) this->insert_pattern(w, i->first, i->second); } @# virtual ~Multi_output_pattern_manipulator() { // nothing to do here } @ Several values may be public-available. Self-commenting, isn't it? @q'@> @= public:@/ virtual Tindex get_max_in_alph(void) const { return words.get_max_in_alph(); } @# virtual Tindex get_pat_count(void) { return words.get_pat_count(); } @ Walking through the manipulator language. First, call the |init_walk_through| method and then you may |get_next_pattern|. The |get_next_pattern| procedure returns patterns with their outputs one by one. The return value is |true|. When no pattern remains, |false| is returned and |w| is an empty vector, |o| is empty, too. This process may be broken by deleting and inserting patterns. The operation of changing outputs of an {\it existing\/} word is safe. For full details, consult documentation of |get_next_pattern| method of the parent class. And again (it starts to be boring): never forget to call |init_walk_through| first. @= public:@/ inline virtual void init_walk_through(void) { words.init_walk_through(); } @# virtual bool get_next_pattern(vector &w, Outputs_of_a_pattern &o) { bool ret_val; Outputs_of_patterns::iterator i; ret_val = words.get_next_pattern(w, i); o = *i; return ret_val; } @ If we want to know the output of a word, we fill a |multimap| with (position, value) pairs. There may be more values per position, we cannot choose the highest one as we do not know the order. Note this is not the ``hyphenate'' procedure, this returns outputs of all patterns matching the beginning of the |w|. Well, how shall we do it? We get all outputs of the |w| from the |words| manipulator. It is a |vector| of |set| |iterator|s. Dereferencing them, we get |multimap|s having pairs (position, value). We collect them and put them into the |multimap|. FIXME: It would be nicer to use a bit more optimal structure than vector when asking the |words| for outputs. There is a lot of |out_inf_zero| paddings. @= public:@/ void word_output(const vector &w, Outputs_of_a_pattern &o) { vector out_pointers; // iterators to outputs words.word_output(w, out_pointers); o.clear(); // no outputs for (vector::const_iterator i = out_pointers.begin(); i != out_pointers.end(); i++) { // go through all outputs for (Outputs_of_a_pattern::const_iterator j = (*i)->begin(); j != (*i)->end(); j++) { // go through the map o.insert(*j); } } } @ We may also want only the last output. We get output of the |w| from the |words| manipulator. It is a |set| |iterator|. Dereferencing it, we get a |multimap| having pairs (position, value). We put them into the |multimap|. @= public:@/ void word_last_output(const vector &w, Outputs_of_a_pattern &o) { Outputs_of_patterns::iterator i; words.word_last_output(w, i); o = *i; } @ We insert patterns in this way. We find the previous output set of the pattern, we copy the outputs, update them and insert into the set again. We are not allowed to do it in-place as we cannot change outputs of other patterns. The |p| parameter is word position associated with the |v| value. If the |with_erase| is |true| the previous outputs are deleted. The value defaults to |false|. @= public:@/ void insert_pattern(const vector &w, const Tindex &p, const Tout_alph &v, bool with_erase = false) { Outputs_of_a_pattern o; word_last_output(w, o); // now we have the outputs of the word if (with_erase) o.erase(p); o.insert(make_pair(p, v)); // we add the new output to the copy and put it back words.hard_insert_pattern(w, outputs.insert(o).first); } @ Deleting patterns may be done in several ways. Here all outputs with given value |v| are removed. Nothing else is changed. (It is PATGENs deleting bad patterns.) We walk through all the words, all their outputs. We take a copy of the output, delete unwanted values from there (by copying again) and put the output back using the |hard_insert_pattern| method of |word|. This overwrites the outputs at once. @= public:@/ void delete_values(const Tout_alph &v) { vector w; Outputs_of_a_pattern o; Outputs_of_a_pattern n; init_walk_through(); while (get_next_pattern(w, o)) { n.clear(); for (Outputs_of_a_pattern::iterator i = o.begin(); i != o.end(); i++) if (i->second != v) n.insert(*i); // delete given output words.hard_insert_pattern(w, outputs.insert(n).first); // put it back } } @ We may also want to delete an output of a pattern on given position. This deletes all the output(s) of word |w| on position |p|. @= public:@/ void delete_position(const vector &w, const Tindex &p) { Outputs_of_a_pattern o; word_last_output(w, o); // outputs of a word o.erase(p); // erase all outputs with |p| as the index words.hard_insert_pattern(w, outputs.insert(o).first); // put it back } @ The |delete_pattern| procedure removes all outputs of a word |w|. @= public:@/ void delete_pattern(vector &w) { words.hard_insert_pattern(w, outputs.get_empty_iter()); } @ The |delete_hanging| procedure removes the unused manipulator fields which have been left on their places after pattern deletions. This does not change the manipulator language. Now the procedure does not dispose the outputs. We assume that the number of deleted outputs is much smaller to the number of outputs and the deleting procedure would be quite expensive. @= public:@/ virtual void delete_hanging(void) { words.delete_hanging(); } @ Printing statistics (ehm, nothing to say here). @= public:@/ void print_statistics(void) const { words.print_statistics(); outputs.print_statistics(); } @* Competitive multi output pattern manipulator. The multi output pattern manipulator does not know anything about the order of output symbols. To make the pattern output handling easy and to avoid unpleasant pattern output constructing in higher application levels, we provide following object enlarging the capabilities of the manipulator. The output information is a number (with standard order) in most cases, so this will be usually more suitable solution for multi output pattern handling. Now we suppose the |Tout_alph| to have the ``|<|'' operation defined. The knowledge of order of the symbols lets us to make the |competitive_pattern_output| routine with at most one output information per position. I personally prefer longer names, but overfull hboxes made me to use such a terrible abbreviation for this class. @f Tindex int @f Tin_alph int @f Tout_alph int @= template@/ class Competitive_multi_out_pat_manip:@/ public Multi_output_pattern_manipulator { @@/ @@/ @@/ }; @ Constructor and destructor have nothing interesting. They propagate the values to their parents. Also the copy constructor may be used to rebuild the data structures in more efficient way. @= public:@/ Competitive_multi_out_pat_manip(const Tin_alph &max_i_a): Multi_output_pattern_manipulator(max_i_a) { } Competitive_multi_out_pat_manip(Competitive_multi_out_pat_manip &old): Multi_output_pattern_manipulator(old) { } ~Competitive_multi_out_pat_manip() { // nothing to do here } @ We need to collect the intercharacter values in a bit different way, we take the new one if and only if the new one wins over the old one. Here we compute the word output |o| for the word |w|. Therefore we put the shift |s| value so that the position information may be stored according to the original word. Strictly speaking, all the positions in the word outputs are increased with |s|, the values are not affected by this process. Note that this method does not delete the previously collected outputs, only changes the ones on appropriate positions. Go on reading the next method to understand what is this index mess good for. Only values $<$|ignore_bigger| are collected and taken into account. @= protected:@/ void competitive_word_output(const vector &w, Outputs_of_a_pattern &o, const Tindex &s, const Tout_alph &ignore_bigger) { vector out_pointers; // iterators to outputs Outputs_of_a_pattern::iterator oi; words.word_output(w, out_pointers); for (vector::const_iterator i = out_pointers.begin(); i != out_pointers.end(); i++) { // go through all outputs for (Outputs_of_a_pattern::const_iterator j = (*i)->begin(); j != (*i)->end(); j++) { // go through the map if (j->second >= ignore_bigger) continue; // value to ignore oi = o.find(s + (j->first)); // find that position if (oi == o.end()) // has not been there before o.insert(make_pair(s + (j->first), j->second)); else { // has been there if (oi->second < j->second) {// we'll help o.erase(s + (j->first)); o.insert(make_pair(s + (j->first), j->second)); } } } } } @ This procedure computes all outputs |o| of all patterns matching the word |w|. The output contains at most one value per position, if there were more values per position, the maximum is output. We go through all the postfixes of the word and collect (in the competitive sense) all the outputs using the previous routine. Only values $<$|ignore_bigger| are collected. It means if a smaller value should be overridden by value $\ge$|ignore_bigger|, the original value is left. If all outputs are needed, suply the |ignore_bigger| value with something bigger than anything in the manipulator. @= public:@/ void competitive_pattern_output(const vector &w, Outputs_of_a_pattern &o, const Tout_alph &ignore_bigger) { o.clear(); Tindex s = 0; // shift for the whole word for (vector::const_iterator i = w.begin(); i != w.end(); i++) { // for all postfixes vector v(i, w.end()); // take the postfix competitive_word_output(v, o, s, ignore_bigger); s++; } } @* Simple translation service. This service is meant usually for testing purposes. @(ptl_sts.h@>= #ifndef PTL_STS_H #define PTL_STS_H #include #include @@; #endif @ Simple translation service. We map the application used values into unsigned numbers for the purpose of manipulating patterns. Therefore we provide the translation service. The service is initialized with a set of external symbols and it builds a bijection from the set into the numbers $1,\ldots,n$, where $n$ is the number of symbols used. The bijection is internally stored as array of external type values and as |map| for the inverse. The mapping into external value is computed in constant time, the inversion is computed in at most logarithmic time to the number of used symbols. Note that this time may be optimized if we know the external type. The |Texternal| is type of external values, this type must be able to be |map|ped. @f Texternal int @= template@/ class Simple_translation_service { public:@/ typedef unsigned Tinternal; protected:@/ Texternal *to_external; map to_internal; Tinternal last_used_internal; @@; @@; @@; @@; }; @ The constructor of |Simple_translation_service| builds up the translation and the inverse translation tables. The parameter is a |set| of external values. The destructor cleans up used storage. @= public:@/ Simple_translation_service(const set& ext_set) { to_internal.clear(); last_used_internal = 0; to_external = new Texternal[ext_set.size() + 1]; for (set::const_iterator i = ext_set.begin(); i != ext_set.end(); i++) { last_used_internal++; to_external[last_used_internal] = *i; to_internal.insert(make_pair(*i, last_used_internal)); } } @# virtual ~Simple_translation_service() { delete[] to_external; } @ The user may want to know the last used internal code. @= public:@/ inline virtual Tinternal get_last_used_internal(void) const { return last_used_internal; } @ The |internal| function returns internal code of external value. Note that no range checks are done for the reason of speed. This function is nevertheless logarithmic to the number of internal values used so we do not want to slow it down more. The second function is overloaded for handling vectors. @= public:@/ inline virtual Tinternal internal(const Texternal& e) const { map::const_iterator it = to_internal.find(e); return (*it).second; } @# inline virtual vector internal(const vector& ve) const { vector vi; for(vector::const_iterator i = ve.begin(); i != ve.end(); i++) { vi.push_back(internal(*i)); } return vi; } @ This function returns external values for internal ones. It may be used for single values or for vectors. No range checks are performed for efficiency reasons. @= public:@/ inline virtual Texternal external(const Tinternal& i) const { return to_external[i]; } @# inline virtual vector external(const vector& vi) const { vector ve; for(vector::const_iterator i = vi.begin(); i != vi.end(); i++) { ve.push_back(external(*i)); } return ve; } @** The generator companion. The generator suite fixes the pattern generating strategy, nevertheless most of the code may be reused in many cases. Terminological note: we speak about ``hyphenation'', ``hyphenating points'' and so on. We are convinced this makes the code easier to read, although the better terms would speak about ``points of interest'' we want to find in our data. This would be quite hard to type and hard to say, we think. Moreover it sounds terribly. @ We start with the Hyphenable word, |Hword| in short. We think about it as a fill-in form containing the input word in the beginning together with its hyphenation information (information on correct points of interest, if gentle reader wants to) which is later filled with the results of pattern application on the word. The odd hyphenation values allow hyphenation, the even values disallow. @ The patterns are generated in series of passes through the input file. In each pass, we deal with pattern candidates of certain length and certain hyphen position. We collect statistics for that type of patterns taking into account the effect of patterns collected in previous passes. At the end of a pass the candidates are checked and new patterns are added to the pattern set. The |left_hyphen_min| and |right_hyphen_min| values say the left and right border of a word where hyphenating is ignored. Pattern candidates are selected one level at a time, in order of increasing hyphenating value. Pattern candidates at each level are chosen in order of increasing pattern length. Candidates at each level applying to different intersymbol positions are chosen in separate passes through the input data file. The selection of patterns out of candidates is controlled by the |good_wt|, |bad_wt|, and |thresh| parameters. A pattern is selected if |good * good_wt - bad * bad_wt >= thresh|, where |good| and |bad| are the numbers of times the pattern is/is not good at the particular point. For inhibiting levels, |good| is the number of errors inhibited, and |bad| is the number of previously found (good) hyphens inhibited. @ The generator consists of three parts: the Pass, the Level and the Generator. The Pass is the basic unit of the pattern generating process. It reads the list of input words and chooses the candidate patterns with certain length and dot position. At the end of a pass the candidates are collected, it means the good ones are inserted into the pattern structure, the bad ones are inserted too, but with special ``hyphenation information'', to make them affect the following passes of that level and to be able to delete them later. The Level generates passes for the current level. It reads the pattern length range and for all the pattern lengths and all dot positions there creates a Pass. At the end of a Level the bad patterns are deleted. The Generator creates Levels. It reads the level range and for all the levels creates a Level. Before beginning the generating process the list of patterns may be read in. It makes step-wise generating of the levels possible, as the runs are quite time-consuming. In the end of the generating process the input word list may be hyphenated with the set of the patterns just created. It allows the user to see the work of them. @ Now we must say something about the input and output. The generator reads input data file (Word input file), writes hyphenated word list (Word output file), reads patterns (Pattern input file), and writes patterns (Pattern output file). All the services use the Translate. The Translate reads the translate file which is used to define input transformation on input file and the alphabet. The Translate is constructed with a file name of the translate file. The files get the reference to the Translate and the file name to be constructed. The interface between the Translate and the file object is their internal problem, the parts of the generator never touch the Translate. The only exception is that Translate must provide methods |get_max_in_alph|, |get_right_hyphen_min|, and |get_left_hyphen_min|. The values obtained are needed to construct the pattern storage. The origin of that solution comes from OPATGEN, where the values depend on settings in the translate file. The input files must provide the |bool get(THword &w)| operation returning |true| if the |Hword| has been successfully read and |false| at the end of the file. The output files must provide the |put(THword &w)| operation to write the |Hword| into the file. @* Hword. The |Hword| (as hyphenated/hyphenable word) is a sequence of characters together with the hyphenation information. Let us define the special information. @(ptl_hwrd.h@>= #ifndef PTL_HWRD_H #define PTL_HWRD_H #include #include #include "ptl_ga.h" @@; #endif @ The first thing we want to know is the type of the hyphen point. The values in |Thyf_type| mean (in order of appearance) no hyphen is here, this is an erroneous hyphen, this is a correct one and this is the correct one but found sometimes in the past. Those values are defined as a part of this object to ensure consistency. The |Hword| itself is a collection of the sequence of characters and the hyphenation information. The hyphenation information is stored in the following public-accessible growing arrays: \item{$\bullet$} |dots| is the hyphen type \item{$\bullet$} |dotw| is the dot weight of |Twt_type| \item{$\bullet$} |hval| is the hyphenation value (simply the level number) of |Tval_type| \item{$\bullet$} |no_more| is the flag indicating the ``knocked-out'' position The remaining template types are the word alphabet (|Tin_alph|) and the |Tindex| type for indexing the growing arrays (|unsigned| seems to be a good choice). Please follow the convention that the values applying between the characters |word[i]| and |word[i+1]| is stored in |dots[i]|, |dotw[i]| and so on. The fields are indexed from zero, except the word itself, which is indexed from one. This makes storing information about the position left to the leftmost words character possible. @f Hword int @f Tindex int @f Tin_alph int @f Twt_type int @f Tval_type int @f Thyf_type int @= template@/ class Hword:@/ public std::vector { public:@/ typedef enum { no_hyf, err_hyf, is_hyf, found_hyf } Thyf_type; @# Growing_array dots; Growing_array dotw; Growing_array hval; Growing_array no_more; @# @@; @@; @@; #ifdef DEBUG @ //Test code! #endif }; @ In the constructor we set up the values for embedded objects. The constructor of the word is called and the default values to return when accessing previously unacessed field of the arrays are set as follows. The |dots| defaults to |no_hyf|, the |dotw| should be usually 1 and the |hval| should be zero. The positions are not knocked out, so |false|. Nevertheless we provide the constructor with parameters, where the user may set the default values. @= Hword(): std::vector(),@/ dots(no_hyf), dotw(1), hval(0), no_more(false) { } @# Hword(const Thyf_type &s, const Twt_type &w, const Tval_type &l, const char &e): std::vector(),@/ dots(s), dotw(w), hval(l), no_more(e) { } @ Operator |[]|. We want the |Hword| to be indexed from~1. @= public:@/ inline Tin_alph &operator[](const Tindex &i) { return std::vector::operator[](i - 1); } @ If the |Hword| is cleared, we must also clear the associated information. @= public:@/ void clear(void) { std::vector::clear(); // clear the word itself dots.clear(); dotw.clear(); hval.clear(); no_more.clear(); } @ Testing code---writing the word content. @= void print(void) { cout<<"Hword"; for (std::vector::iterator i = this->begin(); i != this->end(); i++) cout<<" "<<*i; cout<size(); i++) cout<<" "<size(); i++) cout<<" "<size(); i++) cout<<" "<size(); i++) if (no_more[i]) cout<<" t"; else cout<<" f"; cout<= #ifndef PTL_GEN_H #define PTL_GEN_H #include #include #include #include "ptl_exc.h" #include "ptl_mopm.h" #include "ptl_hwrd.h" @@; @@; @@; #endif @* Pass. The pass is the basic unit of pattern generating process. We go through the input data and collect candidates of one length and one hyphenating position. After that we choose good candidates and make them patterns. @= template @/ class Pass { @@; @@; @# @@; @@; @@; @@; @@; @@; @@; }; @ A number (mess, if you want to) of values follows. @= protected:@/ TTranslate &translate; // the translate service const Tval_type hyph_level; // current hyphenating symbol const Tval_type hopeless_hyph_val; // fake number for bad patterns const Tindex left_hyphen_min; // ignore leftmost part of words const Tindex right_hyphen_min; // the same for right const Tindex pat_len; // length of candidates we're dealing with const Tindex pat_dot; // current position const Tcount_type good_wt, bad_wt, thresh; // pattern choosing weights TCompetitive_multi_out_pat_manip &patterns; // patterns @# Tcount_type good_count, bad_count, miss_count; // statistics, hyphen counts TCandidate_count_structure candidates; // candidates we collect @# TWord_input_file word_input; // the file we read @# Tindex hyf_min, hyf_max, hyf_len; // see constructor for explanation Tindex dot_min, dot_max, dot_len; // see constructor for explanation typename THword::Thyf_type good_dot, bad_dot; // see constructor for explanation @ We simply set up the values. Moreover---to avoid unnecessary tests---the variables |hyf_min|, |hyf_max|, and |hyf_len| are set up such that only positions from |hyf_min| up to word length minus |hyf_max| of the word need to be checked. The words with length $<$|hyf_len| need not be checked at all. The |dot_min|, |dot_max|, and |dot_len| values are analogous and they mean limits of legal dots. The |good_dot| and |bad_dot| values depend on hyphenation level and are used in |do_word| routine. @= public:@/ Pass(TTranslate &tra, const char *i_d_f_n,@/ const Tval_type &l, const Tval_type &h,@/ const Tindex &lhm, const Tindex &rhm,@/ const Tindex &p_l, const Tindex &p_d,@/ const Twt_type &g_w, const Twt_type &b_w, const Twt_type &t,@/ TCompetitive_multi_out_pat_manip &pat): translate(tra),@/ hyph_level(l), hopeless_hyph_val(h),@/ left_hyphen_min(lhm), right_hyphen_min(rhm),@/ pat_len(p_l), pat_dot(p_d),@/ good_wt(g_w), bad_wt(b_w), thresh(t),@/ patterns(pat),@# good_count(0), bad_count(0), miss_count(0),@/ candidates(patterns.get_max_in_alph(), 0, 0),@/ word_input(translate, i_d_f_n) { hyf_min = left_hyphen_min + 1; hyf_max = right_hyphen_min + 1; hyf_len = hyf_min + hyf_max;@# dot_min = pat_dot; dot_max = pat_len - pat_dot; if (dot_min < hyf_min) dot_min = hyf_min; if (dot_max < hyf_max) dot_max = hyf_max; dot_len = dot_min + dot_max;@# if (hyph_level % 2 == 1) { good_dot = THword::is_hyf; bad_dot = THword::no_hyf; } else { good_dot = THword::err_hyf; bad_dot = THword::found_hyf; } } @ This procedure uses current patterns to hyphenate a word. The hyphenation value applying between the characters |word[i]| and |word[i+1]| is stored in |hval[i]|. The bad patterns at this level are ignored. Moreover, |no_more[i]| is set to |true| iff the position is ``knocked out'' by either good or bad pattern at this level. It means, the pattern with current length and hyphen position is superstring of a pattern at this level. In that case we don't have to collect count statistics for that pattern as it can't be chosen in this pass. In addition, there is no need to insert that pattern into the count repository. Note that we have to hyphenate again to get the bad patterns of this level into account. @= public:@/ void hyphenate(THword &w) { TOutputs_of_a_pattern o; typename TOutputs_of_a_pattern::iterator i; patterns.competitive_pattern_output(w, o, hopeless_hyph_val); for (i = o.begin(); i != o.end(); i++) { // go through the output w.hval[i->first] = i->second; // copy it into |w| } patterns.competitive_pattern_output(w, o, hopeless_hyph_val + 1); // now compute the output including the bad values for (i = o.begin(); i != o.end(); i++) { // go through the output if (i->second >= hyph_level && i->first >= dot_min && i->first <= w.size() - dot_max) { vector subw; for (Tindex j = i->first + 1 - pat_dot; j <= i->first + pat_len - pat_dot; j++) { subw.push_back(w[j]); } TOutputs_of_a_pattern subwo; patterns.competitive_pattern_output(subw, subwo, hopeless_hyph_val + 1); typename TOutputs_of_a_pattern::iterator val_on_pat_dot = subwo.find(pat_dot); /* The value has been obtained by competitive outputting, it means there is at most one value on each position. */ if (val_on_pat_dot != subwo.end()) if (val_on_pat_dot->second >= hyph_level) w.no_more[i->first] = true; } } } @ The |change_dots| procedure updates the meaning of hyphen values. Initially the hyphens in the word list are represented by |is_hyf| and non-hyphen positions by |no_hyf|. Now we change this values according to the hyphens found by current patterns. So we change |no_hyf| into |err_hyf| and |is_hyf| into |found_hyf|. Moreover we collect statistics about the number of good, bad and missed hyphens. @= public:@/ void change_dots(THword &w) { for (Tindex i = w.size() - hyf_max; i >= hyf_min; i--) { if (w.hval[i] % 2 == 1) { if (w.dots[i] == THword::no_hyf) w.dots[i] = THword::err_hyf; else if (w.dots[i] == THword::is_hyf) w.dots[i] = THword::found_hyf; } @# if (w.dots[i] == THword::found_hyf) good_count += w.dotw[i]; else if (w.dots[i] == THword::err_hyf) bad_count += w.dotw[i]; else if (w.dots[i] == THword::is_hyf) miss_count += w.dotw[i]; } } @ For each dot position of current word, we first check if we need to consider it. It might be knocked out or we don't care about. For example, when considering hyphenating patterns, there's no need to count hyphens already found. If a relevant dot is found, we increment the counts in the candidate store (which inserts first if necessary). (Why does cweave think |fpos| is a type? We must tell explicitly it is not.) @f fpos dpos @= protected:@/ void do_word(THword &w) { for (Tindex dpos = w.size() - dot_max; dpos >= dot_min; dpos--) { @@; Tindex spos = dpos - pat_dot; // compute the subword range Tindex fpos = spos + pat_len; spos++; vector subw; // compute the subword for (Tindex i = spos; i <= fpos; i++) subw.push_back(w[i]); if (w.dots[dpos] == good_dot) { candidates.increment_counts(subw, w.dotw[dpos], 0); // good } else { candidates.increment_counts(subw, 0, w.dotw[dpos]); // bad } } } @ If the dot position is knocked out or a ``do not care'', we skip this position. @= if (w.no_more[dpos]) continue; if ((w.dots[dpos] != good_dot) && (w.dots[dpos] != bad_dot)) continue; @ Printing pass statistics. @= public:@/ void print_pass_statistics(void) { cout< 0) { cout<<100.0 * good_count / float(good_count + miss_count)<<" % "; cout<<100.0 * bad_count / float(good_count + miss_count)<<" % "; cout<<100.0 * miss_count / float(good_count + miss_count)<<" % "<= protected:@/ void do_dictionary(void) { THword w; while(word_input.get(w)) { // process words until end of file if (w.size() >= hyf_len) { // don't hyphenate short words hyphenate(w); change_dots(w); } if (w.size() >= dot_len) do_word(w); // process if reasonable }@# #ifdef DEBUG // test code TOutputs_of_a_pattern o; vector word; patterns.init_walk_through(); cout<<"Patterns in the pattern manipulator:"<::iterator i = word.begin(); i!=word.end(); i++) cout<<*i<<" "; cout<<"... has "; for (typename TOutputs_of_a_pattern::const_iterator i = o.begin(); i!=o.end(); i++) cout<<"("<first<<","<second<<") "; cout< vect; Tcount_type a; Tcount_type b; candidates.init_walk_through(); cout<<"Candidates:"<::iterator i = vect.begin(); i != vect.end(); i++) cout<<*i<<" "; cout<<"with g,b: "<= protected:@/ bool collect_candidates(Tcount_type &level_pattern_count) { cout<<"Collecting candidates"< w; // word Tcount_type g, b; // good and bad from pattern Tcount_type good_pat_count = 0; Tcount_type bad_pat_count = 0; // numbers of patterns added at the end of a pass candidates.init_walk_through(); while (candidates.get_next_pattern(w, g, b)) { if (good_wt * g < thresh) { // hopeless pattern patterns.insert_pattern(w, pat_dot, hopeless_hyph_val); bad_pat_count++; } else { /* now we test |good_wt * g - bad_wt * b >= thresh| but we may not deal with negative numbers (it took me four hours to find) */ if (good_wt * g >= thresh + bad_wt * b) { // good pattern patterns.insert_pattern(w, pat_dot, hyph_level, 1); // we may erase the previous outputs good_pat_count++; good_count += g; bad_count += b; } else more_to_come = true; } }@# cout< 0) { cout<<"efficiency = " <= public:@/ bool do_all(Tcount_type &level_pattern_count) { cout<>pat_start; cin>>pat_finish; if (pat_start < 1 || pat_start > pat_finish) { cout<<"Specify two integers satisfying 1<=pat_start<=pat_finish "; pat_start = 0; } } while (pat_start < 1);@# do { // read the weights cout<<"good weight, bad weight, threshold: "; cin>>good_wt; cin>>bad_wt; cin>>thresh; if (good_wt < 1 || bad_wt < 1 || thresh <1) { cout<<"Specify three integers: good weight, bad weight, threshold>=1 "; good_wt = 0; } } while (good_wt < 1); } @ In a given level, the patterns of given length are generated with dot positions in an ``organ-pipe'' fashion. For example, for |pat_len == 4| we choose patterns for different positions in the order 2, 1, 3, 0, 4. For all positions passes are generated. The array |more_this_level| remembers which positions are permanently ``knocked out'', this is, if there are no possible good patterns at that dot position, we do not have to consider longer patterns at this level containing that position. At the end of a level the bad patterns are deleted. @= public:@/ void do_all(void) { cout< more_this_level(true); for (Tindex pat_len = pat_start; pat_len <= pat_finish; pat_len++) { // for all pattern lengths Tindex pat_dot = pat_len / 2; Tindex dot1 = pat_dot * 2; do { // for all positions pat_dot = dot1 - pat_dot; dot1 = pat_len * 2 - dot1 - 1; if (more_this_level[pat_dot]) { // which are not knocked out TPass pass(translate, word_input_file_name, hyph_level, hopeless_hyph_val, left_hyphen_min, right_hyphen_min, pat_len, pat_dot, good_wt, bad_wt, thresh, patterns); more_this_level[pat_dot] = pass.do_all(level_pattern_count); } } while (pat_dot != pat_len);@# for (Tindex k = more_this_level.size(); k >= 1; k--) if (more_this_level[k - 1] != true) more_this_level[k] = false; } // OK, we have generated @#@; Tindex old_p_c = patterns.get_pat_count(); patterns.delete_values(hopeless_hyph_val); // now delete bad patterns cout<= template @/ class Generator { @@; @@; @@; @@; @@; @@; }; @ The generator needs to know the hyphenation level range, multiple output pattern store, and values |left_hyphen_min| and |right_hyphen_min| telling the bounds where hyphenation should be ignored. @= protected:@/ TTranslate translate; const char *name; const char *word_input_file_name; const char *pattern_input_file_name; const char *pattern_output_file_name; TCompetitive_multi_out_pat_manip patterns; Tval_type hyph_start, hyph_finish; Tindex left_hyphen_min, right_hyphen_min; @ In the constructor we create the translate service, pattern structure, set the values, and read the hyphenation level range. @= public:@/ Generator(const char *dic, const char *pat, const char *out, const char *tra): translate(tra), word_input_file_name(dic), pattern_input_file_name(pat), pattern_output_file_name(out), patterns(translate.get_max_in_alph()), left_hyphen_min(translate.get_left_hyphen_min()), right_hyphen_min(translate.get_right_hyphen_min()) { do { // read the level range cout<<"hyph_start, hyph_finish: "; cin>>hyph_start; cin>>hyph_finish; if ((hyph_start < 1) || (hyph_finish < 1)) { hyph_start = 0; cout<<"Specify two integers satisfying 1<=hyph_start, hyph_finish " <= void read_patterns(void) { vector v; TOutputs_of_a_pattern o; TPattern_input_file file(translate, pattern_input_file_name); /* open the file */ while (file.get(v, o)) { // read the file if (v.size() > 0) { // avoid strange things for (typename TOutputs_of_a_pattern::iterator i = o.begin(); i != o.end(); i++) { // go through the outputs if (i->second >= hyph_start) { throw Patlib_error("! The patterns to be read in contain "@/ "hyphenation value bigger than hyph_start."); } patterns.insert_pattern(v, i->first, i->second); } } } } @ At the very end of work we output the patterns. @= void output_patterns(void) { TPattern_output_file file(translate, pattern_output_file_name); vector v; TOutputs_of_a_pattern o; patterns.init_walk_through(); while (patterns.get_next_pattern(v, o)) { file.put(v, o); } } @ At the very end of a pass the word list may be hyphenated with the patterns just made. So we prepare the input and output file. The pass we use to hyphenate must have |left_hyphen_min| and |right_hyphen_min| set to correct values. The only problem is with the level number. Think about a special kind of usage to hyphenate a word list with a set of patterns. This can be simulated setting the level range like 6,~1, causing the levels itselves to be skipped. Therefore we take normally as the level number the last level we did. In the special case where the |hyph_finish| is less than |hyph_start| we increase the value to the bigger one. (This feature has been added for Petr Sojka.) The rest of values is clean or makes no effect to the |hyphenate| and |change_dots| process. In the end, the statistics of hyphens found, wrong, and missed is printed. The output file is named ``pattmp.n'', where ``n'' is the last level number. If bigger than 999, sorry MS~DOS users :-).@^MS DOS@> @= public:@/ void hyphenate_word_list(void) { string s; cout<<"hyphenate word list ? "; cin>>s; if (!(s == "y" || s == "Y")) return; Tval_type level_value = hyph_finish; if (hyph_start > hyph_finish) level_value = hyph_start; Tval_type fake_level_value = 2 * ((level_value / 2) + 1); char file_name[100]; // Is there a less stupid way to do this? sprintf(file_name, "pattmp.%d", level_value); cout<<"Writing file "< 2) { pass.hyphenate(w); pass.change_dots(w); } o_f.put(w); } pass.print_pass_statistics(); } @ The |hopeless_fake_number| is a value which is even and greater than any hyphenating value used. So we compute it and create levels. In the end the outputs are output and the word list is hyphenated. @= public:@/ void do_all(void) { read_patterns(); // read in patterns cout<