Tue, 30 May 2006
MIME::Words and UTF-8
We use the
MIME::Words
package from CPAN to handle encoding
and decoding
the RFC 1522-style
e-mail headers (those =?UTF-8?Q?something=20something?=
-like texts).
Long time ago I have found that this package had a bug - when encoding two
adjacent words the inner whitespace should be added to the first or the second
word, because the whitespace between the two adjacent encoded words is discarded
during decoding. When moving our system to UTF-8, I have decided to install
a new MIME::Words
module, and I wondered whether this bug is fixed.
In the manpage, they wrote:
It does not comply with the RFC-1522 rules regarding the use of encoded words in message headers. You may want to roll your own variant, using encoded_mimeword(), for your application. Thanks to Jan Kasprzak for reminding me about this problem.
So they did not fix the problem reported 3-5 years ago, they just acknowledged its existence (even with my name :-). The module also does not handle multi-byte characters (in UTF-8 strings) correctly, and defaults to the ISO-8859-1 encoding instead.
I have decided to fix this module, solving both the problem of two adjacent encoded words, and the problems of encoding/decoding from/to the multibyte strings. Here is the patch for MIME::Words and UTF-8. Hopefully they will apply it soon.
9 replies for this story:
adelton wrote: Encode.pm
I'd say that Encode with the encoding 'MIME-Header' is my favorite way to go nowadays.
Yenya wrote: Re: Encode.pm
Hmm, it apparently works, but it does not say a word about it in the Encode.pm manpage. Moreover, it encodes the whole string as UTF-8 Base-64, which misses the whole point of it (encode only non-ascii parts). So you cannot do Encode::encode("MIME-Headers", $the_whole_message_headers_part);
Yenya wrote: Encode.pm
It apparently has broken line wrapping: it produces lines > 75 chars, and when the encoded line is slightly over 80 chars, the result is "newline-space-encodedword-newline-space-encodedword" instead of "encodedword-newline-space-encodedword".
adelton wrote: ASCII parts
perl -CSAD -MEncode -e 'use utf8; print Encode::encode("MIME-Header", "Krtek Simonsen"), "\n";' Krtek Simonsen So for ASCII-only data, it gives you that string back. As for the Base-64, you can use MIME-Q to have the string more readable: perl -CSAD -MEncode -e 'use utf8; print Encode::encode("MIME-Q", "krteček"), "\n";' =?UTF-8?Q?krte=C4=8Dek?=
adelton wrote: Formatting
Blueeehee -- how do you post a snippet of code here?
Yenya wrote: Re:: Formatting
Try encoding "maličký ježeček a jeste nejaky ascii text" - it gets encoded to Base64 as a whole. With MIME-Q it is even worse.
adelton wrote: As for the folding ...
As for the folding, yes, MIME-B does start with newline+space, while MIME-Q does not. Therefore MIME-Q is nicer. ;-) Anyway, since Encode is in the standard Perl distribution, I think that this module should be used whenever it is good enough for the task at hand. So maybe patching Encode::MIME::Header (instead of MIME::Words) is more useful?
adelton wrote: Patch for Encode::MIME::Header
Does this change to $re_especials look reasonable? How is it with wrapping unencoded words? my $re_especials = qr{$re_encoded_word|(?:\ +|^)[\x00-\x7f]+(?:\ +|$)|$especials}xo;
Yenya wrote: Re: Patch for Encode::MIME::Header
I will test it (hopefully) soon, sorry for the delay