Mon, 17 Jul 2006
IS in UTF-8
Our Information System is running with UTF-8 support even at the application layer since Friday. Finally the work which took the most of my work time is almost finished. Now we are fixing the parts of the system which are not running directly in Apache (cron jobs, etc), and minor glitches which survived our prior testing.
We do not allow arbitrary characters everywhere, because we must maintain some attributes in the form suitable for printing through TeX or exporting to the external systems, which are ISO 8859-2 or Windows-1250-based mostly. We allow almost all of Latin-1 and Latin-2 characters in most applications, though.
While it has been hard to convert the whole system to UTF-8, I must say that the UTF-8 support in Perl is well architected (and from what I have read, definitely better than in other scripting languages).
2 replies for this story:
Milan Zamazal wrote: UTF-8 support in Perl
Do you speak just about input/output encoding or generally about support of Unicode characters in data? If the latter, could you please provide pointer where I could look how it is done? I'd like to look at it for curiosity -- in Python Unicode support was added in a very stupid way: by introducing new data type, *different* from string.
Yenya wrote: Re: UTF-8 support in Perl
See the section 2 of Adelton's tutorial on Perl: http://www.fi.muni.cz/~adelton/perl/europen2004/tutorial.html. In short, strings in Perl 5.8+ can have an attribute "this string is composed of characters" or "this string is composed of bytes". So internally strings can be UTF-8 or binary data, with implicit or explicit conversions, reencoding to different encodings on file input/output, etc. Plus few minor add-ons like matching unicode characters in regexps based on (for example) unicode attributes. In most cases, it is "do what I mean".