Hyphenation algorithm for Romanian language words 
                         V.Demidova,T.Verlan


The problem of correct hyphenation of the Romanian language words 
becomes urgent in view of lack of such algorithm just for Romanian
language in many widely used systems of automated text processing.
The algorithm of correct hyphenation of the Romanian language words 
is based on classical rules of word division into
syllables which are based on letters' phonetic significance. 
   The classical rules base on a vowel sequence - simple  and complex  -
semivowels, and consonants - also simple and complex, located
between two vowels. Also in the algorithm is considered exceptions
of these rules, such as for the following combinations of
consonants: '"bl", "br", "cl", "cr", "fl", "fr", "hl", "pl", "pr",
"tl", "tr", "vl", "vr", "lpt", "mpt", "nct", "ndv", "rtf", "stm",
"ngstr" etc. The rules described above were maximally taken into consideration. 
However, the specific character of the Romanian language does not permit completely
formalize them. The vowels present main difficulty for division
into syllables in the Romanian language. They can be simple and
complex (so called semivowels), stressed and unstressed, and the
division rules depend on that, to which category the given vowel
belongs. Besides this the ambiguity arises because of the mode in
which different vowels combinations are perceived by ear. When the
word is entered, all that we can to find out about it is the
consequence of vowels and consonants. The phonetic information is
not accessible for us. Therefore, we can not implement the above
described rules in all their completeness. However many situations
are quite solvable, but rather by an artificial way. 
    Problem of diphthongs and threephthongs is the most difficult one in the
process of word division into syllables for the Romanian language. 
    A difficult situation is with preffixes. When there is one of the
combinations "an", "in" at the beginning of the word, and it is
followed by the vowel, then an ambiguity appears. To avoid the
ambiguity, and in the first place taking into consideration the
problem of the word division from line to line, when one letter is
not left on a line, we have decided to reject the first hyphen.
    Thus, the algorithm of division into syllables of rather extensive
class of the Romanian language words is obtained. Certainly, we have not 
pretentious to the completeness. However, the testing showed, that 70% 
of words in the texts from the scientific and art literature are 
divided correctly. It is essential. The algorithm can be developed 
more over, but to include some additions the further
analysis of the database is necessary. It is the routine work, which 
takes a lot of time though.
     However, the main and most frequently met letter combinations our 
algorithm processes correctly. Therefore, it is effective enough.

Related link: <a href=""></a>