[Mailman-i18n] Unicode in headers

Martin von Loewis loewis@informatik.hu-berlin.de
Sat, 21 Sep 2002 23:08:36 +0200 (CEST)


>     from email.Header import Header
>     h = Header(u'[P\xf6stal]', 'us-ascii')
>     s = str(h)
[...]
> But I think this may not be the right thing to do.  For one thing,
> we're saying we want the header to be in the us-ascii character set.

I think you are confusing issues here: You are *not* saying that you
want the header to be in us-ascii. Instead, (to quote the docstring)

        Specify both s's character set, and the default character set by
        setting the charset argument to a Charset object 

You need this argument to specify the encoding of the string *you are
passing*, not (primarily) of the resulting Header. Since the argument
is a Unicode string and not a byte string, the encoding argument is
superfluous.

Now, the documentation also says that it uses the argument as the "default
character set". By that, it does *not* mean that the entire header is going
to be encoding in that encoding. Instead, it means that this value is used
if later append calls do not declare an encoding.

> My proposal is to do a type check in Header.__str__() so that if the
> value of self.encode() returns a unicode string, we will coerce it to
> an 8-bit string like so:

This is evil. You are losing data without any need.

Instead, I propose the following procedure:
- if a Unicode argument is passed to Header.__init__ or Header.append,
  take the encoding only as a hint. As an argument to __init__, also
  record it as the default for later .append calls.
- when encoding the header, encode all Unicode strings with the hint.
  If that fails, encode them as UTF-8.

Regards,
Martin