Yenya's World

Tue, 30 May 2006

MIME::Words and UTF-8

We use the MIME::Words package from CPAN to handle encoding and decoding the RFC 1522-style e-mail headers (those =?UTF-8?Q?something=20something?=-like texts). Long time ago I have found that this package had a bug - when encoding two adjacent words the inner whitespace should be added to the first or the second word, because the whitespace between the two adjacent encoded words is discarded during decoding. When moving our system to UTF-8, I have decided to install a new MIME::Words module, and I wondered whether this bug is fixed.

In the manpage, they wrote:

It does not comply with the RFC-1522 rules regarding the use of encoded words in message headers. You may want to roll your own variant, using encoded_mimeword(), for your application. Thanks to Jan Kasprzak for reminding me about this problem.

So they did not fix the problem reported 3-5 years ago, they just acknowledged its existence (even with my name :-). The module also does not handle multi-byte characters (in UTF-8 strings) correctly, and defaults to the ISO-8859-1 encoding instead.

I have decided to fix this module, solving both the problem of two adjacent encoded words, and the problems of encoding/decoding from/to the multibyte strings. Here is the patch for MIME::Words and UTF-8. Hopefully they will apply it soon.

Section: /computers (RSS feed) | Permanent link | 9 writebacks

9 replies for this story:

adelton wrote: Encode.pm

I'd say that Encode with the encoding 'MIME-Header' is my favorite way to go nowadays.

Yenya wrote: Re: Encode.pm

Hmm, it apparently works, but it does not say a word about it in the Encode.pm manpage. Moreover, it encodes the whole string as UTF-8 Base-64, which misses the whole point of it (encode only non-ascii parts). So you cannot do Encode::encode("MIME-Headers", $the_whole_message_headers_part);

Yenya wrote: Encode.pm

It apparently has broken line wrapping: it produces lines > 75 chars, and when the encoded line is slightly over 80 chars, the result is "newline-space-encodedword-newline-space-encodedword" instead of "encodedword-newline-space-encodedword".

adelton wrote: ASCII parts

perl -CSAD -MEncode -e 'use utf8; print Encode::encode("MIME-Header", "Krtek Simonsen"), "\n";' Krtek Simonsen So for ASCII-only data, it gives you that string back. As for the Base-64, you can use MIME-Q to have the string more readable: perl -CSAD -MEncode -e 'use utf8; print Encode::encode("MIME-Q", "krteček"), "\n";' =?UTF-8?Q?krte=C4=8Dek?=

adelton wrote: Formatting

Blueeehee -- how do you post a snippet of code here?

Yenya wrote: Re:: Formatting

Try encoding "maličký ježeček a jeste nejaky ascii text" - it gets encoded to Base64 as a whole. With MIME-Q it is even worse.

adelton wrote: As for the folding ...

As for the folding, yes, MIME-B does start with newline+space, while MIME-Q does not. Therefore MIME-Q is nicer. ;-) Anyway, since Encode is in the standard Perl distribution, I think that this module should be used whenever it is good enough for the task at hand. So maybe patching Encode::MIME::Header (instead of MIME::Words) is more useful?

adelton wrote: Patch for Encode::MIME::Header

Does this change to $re_especials look reasonable? How is it with wrapping unencoded words? my $re_especials = qr{$re_encoded_word|(?:\ +|^)[\x00-\x7f]+(?:\ +|$)|$especials}xo;

Yenya wrote: Re: Patch for Encode::MIME::Header

I will test it (hopefully) soon, sorry for the delay

Reply to this story:

 
Name:
URL/Email: [http://... or mailto:you@wherever] (optional)
Title: (optional)
Comments:
Key image: key image (valid for an hour only)
Key value: (to verify you are not a bot)

About:

Yenya's World: Linux and beyond - Yenya's blog.

Links:

RSS feed

Jan "Yenya" Kasprzak

The main page of this blog

Categories:

Archive:

Blog roll:

alphabetically :-)