I maintain Perl package Cstools that contains conversion program cstocs and sorting utility cssort that implements Czech four-pass collation approximating Czech standard. Functions can of course be called directly from Perl. According to my knowledge, this is the only maintained distribution of cstocs.
PostgreSQL, at least until version 8.0, has rather weak support for various collating sequences. What you get when you do
select * from table order by column
is hardcoded in the database cluster at the initdb time.
Yet, it is reasonable request to want one select to order by using
English rules, another one to run with German rules and yet another
with Czech ones, without having to dump, initdb, restore.
The distribution
postgresql-nls-string
(version 8.02, for 8.0+)
defines functions nls_string and nls_value
which allow collation to be set at runtime:
select * from table order by nls_string(name, 'en_US.UTF-8')
select * from table order by nls_value(name, 'cs_CZ.UTF-8')
select * from table order by nls_string(name, 'C')
The nls_value returns bytea while nls_string
makes the result octal-encoded string.
Please, read the README and INSTALL files included in the distribution. If you use PostgreSQL server version 7.4, download postgresql-nls-string 0.53.
Current versions of MySQL (3.23+) contain support for Czech collation approximating Czech standard. The support has to be compiled in the server.
To set the Czech collation as the default one, compile the server
with ./configure's option --with-charset=czech.
In this case, all sortings on character columns will use Czech rules.
When a compile-time option
./configure --with-extra-charsets=all is used, server will
support multiple character sets and collations and actual variant can
be set upon server startup, with run-time parameter
--default-character-set=czech. The default is again given
by the parameter --with-charset.
When you change the character set used by the server, indexes have to
be regenerated, see
Chapter
The Character Set Used for Data and Sorting in MySQL documentation.
Server parameters can also be specified in configuration file, typically
in /etc/my.cnf, using
[mysqld] default-character-set=czech
This Czech collation table implements case sensitive order of letters. MySQL manual talks about case insensitivity but that only holds in the default (Latin1) situation.
Sorting uses the ISO-8859-2 character set. It your data on the client
side is in
character set Windows-1250 (often, people will realize this when
words with letters š and ž get sorted incorrectly
— ISO-8859-2 and Windows-1250 are similar but not exactly the
same), on-line translation of character sets between the server and
the client can be set. Server has to have this feature compiled in,
the easiest way is to remove comment in file sql/convert.cc
/* #define DEFINE_ALL_CHARACTER_SETS */
before compilation. Then, in the client, issue command
SET CHARACTER SET cp1250_latin2
Client will work with data in Windows-1250 and server will store it in ISO-8859-2.
The MySQL distribution contains message catalogue translated to Czech,
translated messages will get turned on at server start-up with parameter
--language=czech.
The following
distribution
contains file
strings/ctype-win1250ch.c,
that implements simpler two-pass sorting similar to the Czech one. In
this collation, ch is sorted correctly but the primary ordering is case
insensitive and is in the Windows-1250 character set. Included in
the distribution are also patches of the
Configure.in and sql/share/charsets/Index.
The following distribution contains support for ordering based on UCA algorithm. Included are two algorithms, case sensitive and case insensitive.
Both (collation) character set win1250ch and utf8ad should also be coming included with all current versions of MySQL by now.
I wrote functions that can be used in MySQL to convert text to plain ASCII, and conversions between ISO-8859-2 and Windows-1250. The distribution is called udf_charsets, and also contains README with installation instructions.
I wrote an implementation of function strxfrm
that converts Czech (ISO-8859-2) text to sequence that can be compared
using strcmp. The conversion is defined in such a way
that it as closely matches Czech standard (ČSN 97 6030) and it's
interpretation by Petr Olšák. If you have some problem with the result,
for example you are not happy with numbers ordered only after
letters z and ž, read the standard first. The same algorithm is
used in the czech character set in MySQL and in module
Cz::Sort.
The file also contains function strcoll that compares
two strings without a need of previous conversion, in constant
memory. The file is translated using cc -c -o csort.o csort.c
and using
ld -shared -o csort.so csort.o we turn it into a shared
library. We then use it for example by setting environment variable
export LD_PRELOAD=/cesta/k/csort.so which ensures that
instead of the default strxfrm and strcoll
the Czech ones will be used.
Included is also the definition table used in generating the code of the functions, Perl script for conversion to C source and also examples of correct Czech ordering.
I played with X locales on IRIXu 6.2, the result is file XLC_LOCALE, that belongs to /usr/lib/X11/locale/iso8859-2. That fixes behavior of netscape 3.04 in forms (see cstest.html) where otherwise ISO-8859-1 fonts are used, in fixed cut and paster, xman, gimp — to name only those that otherwise cause problems. Of course, the basic locales and locale.alias and locale.dir have to be correct first.