I18n, localization, and character set projects

Česká stránka

Cstools

I maintain Perl package Cstools that contains conversion program cstocs and sorting utility cssort that implements Czech four-pass collation approximating Czech standard. Functions can of course be called directly from Perl. According to my knowledge, this is the only maintained distribution of cstocs.

Arbitrary collation support for PostgreSQL

PostgreSQL, at least until version 8.0, has rather weak support for various collating sequences. What you get when you do

select * from table order by column

is hardcoded in the database cluster at the initdb time. Yet, it is reasonable request to want one select to order by using English rules, another one to run with German rules and yet another with Czech ones, without having to dump, initdb, restore.

The distribution postgresql-nls-string (version 8.02, for 8.0+) defines functions nls_string and nls_value which allow collation to be set at runtime:

select * from table order by nls_string(name, 'en_US.UTF-8')
select * from table order by nls_value(name, 'cs_CZ.UTF-8')
select * from table order by nls_string(name, 'C')

The nls_value returns bytea while nls_string makes the result octal-encoded string.

Please, read the README and INSTALL files included in the distribution. If you use PostgreSQL server version 7.4, download postgresql-nls-string 0.53.

Czech support in MySQL

Czech collation support for MySQL

Current versions of MySQL (3.23+) contain support for Czech collation approximating Czech standard. The support has to be compiled in the server.

To set the Czech collation as the default one, compile the server with ./configure's option --with-charset=czech. In this case, all sortings on character columns will use Czech rules. When a compile-time option ./configure --with-extra-charsets=all is used, server will support multiple character sets and collations and actual variant can be set upon server startup, with run-time parameter --default-character-set=czech. The default is again given by the parameter --with-charset. When you change the character set used by the server, indexes have to be regenerated, see Chapter The Character Set Used for Data and Sorting in MySQL documentation. Server parameters can also be specified in configuration file, typically in /etc/my.cnf, using

[mysqld]
default-character-set=czech

This Czech collation table implements case sensitive order of letters. MySQL manual talks about case insensitivity but that only holds in the default (Latin1) situation.

Character sets

Sorting uses the ISO-8859-2 character set. It your data on the client side is in character set Windows-1250 (often, people will realize this when words with letters š and ž get sorted incorrectly — ISO-8859-2 and Windows-1250 are similar but not exactly the same), on-line translation of character sets between the server and the client can be set. Server has to have this feature compiled in, the easiest way is to remove comment in file sql/convert.cc

/* #define DEFINE_ALL_CHARACTER_SETS */

before compilation. Then, in the client, issue command

SET CHARACTER SET cp1250_latin2

Client will work with data in Windows-1250 and server will store it in ISO-8859-2.

Server messages

The MySQL distribution contains message catalogue translated to Czech, translated messages will get turned on at server start-up with parameter --language=czech.

Support for half-Czech collation in MySQL in Windows-1250

The following distribution contains file strings/ctype-win1250ch.c, that implements simpler two-pass sorting similar to the Czech one. In this collation, ch is sorted correctly but the primary ordering is case insensitive and is in the Windows-1250 character set. Included in the distribution are also patches of the Configure.in and sql/share/charsets/Index.

Support for nearly complete UCA UTF-8 collation in MySQL

The following distribution contains support for ordering based on UCA algorithm. Included are two algorithms, case sensitive and case insensitive.

Both (collation) character set win1250ch and utf8ad should also be coming included with all current versions of MySQL by now.

Stripping diacritics in MySQL (il2 to ascii)

I wrote functions that can be used in MySQL to convert text to plain ASCII, and conversions between ISO-8859-2 and Windows-1250. The distribution is called udf_charsets, and also contains README with installation instructions.

Czech strxfrm and strcoll implementing four-pass collation

I wrote an implementation of function strxfrm that converts Czech (ISO-8859-2) text to sequence that can be compared using strcmp. The conversion is defined in such a way that it as closely matches Czech standard (ČSN 97 6030) and it's interpretation by Petr Olšák. If you have some problem with the result, for example you are not happy with numbers ordered only after letters z and  ž, read the standard first. The same algorithm is used in the czech character set in MySQL and in module Cz::Sort.

The file also contains function strcoll that compares two strings without a need of previous conversion, in constant memory. The file is translated using cc -c -o csort.o csort.c and using ld -shared -o csort.so csort.o we turn it into a shared library. We then use it for example by setting environment variable export LD_PRELOAD=/cesta/k/csort.so which ensures that instead of the default strxfrm and strcoll the Czech ones will be used.

Included is also the definition table used in generating the code of the functions, Perl script for conversion to C source and also examples of correct Czech ordering.

iso8859-2/XLC_LOCALE for IRIX 6.2

I played with X locales on IRIXu 6.2, the result is file XLC_LOCALE, that belongs to /usr/lib/X11/locale/iso8859-2. That fixes behavior of netscape 3.04 in forms (see cstest.html) where otherwise ISO-8859-1 fonts are used, in fixed cut and paster, xman, gimp — to name only those that otherwise cause problems. Of course, the basic locales and locale.alias and locale.dir have to be correct first.