[Python-3000] email libraries: use byte or unicode strings?

Thu Nov 6 03:09:49 CET 2008

Glenn Linderman writes:
 > On approximately 11/5/2008 2:59 PM, came the following characters from 
 > the keyboard of Andrew McNamara:
 > >> I would find
 > >>
 > >> 	message[b'Subject'] = b'Hello'
 > >>
 > >> to be totally gross.

Indeed.

 > >> Depending on the level of email interface, there should be no interface 
 > >> that cannot be expressed in terms of Unicode, plus an encoding to use 
 > >> for the associated data.  Even 8-bit binary can be translated into a 
 > >> sequence of Unicode codepoints with the same numeric value, for example. 

Also totally gross.  RFC 2821 is bytes, RFC 2822 is Unicode (in
spirit, even though headers are limited to ASCII), RFC 2045-and-the-
cast-of-thousands interfaces the two.  We can't really get around
this, IMO.

 > > One significant problem is that the email module is intended to be
 > > able to work with malformed e-mail without mangling it too badly. The
 > > malformed e-mail should also make a round-trip through the email module
 > > without being further mangled.
 > 
 > This is an interesting perspective... "stuff em" does come to mind :)

Not acceptable in Japan, or anywhere that Microsoft beta products are
used, for that matter.  (At one point Outhouse Excess betas were
sending HTML *with tags in unibyte ASCII and element content in
little-endian UTF-16*.)

 > But I'm not at all clear on what you mean by a round-trip through the 
 > email module.

Bounce messages, for example.

 > I guess I'm not terribly concerned about the readability of improperly 
 > encoded email messages, whether they are spam or ham.

I'm fine with *your* lack of concern if you don't need it, but an
email module that doesn't care really is not acceptable in any of the
Asian cultures; they have more characters to worry about than the Bush
administration has "suspicious foreign elements".  Although the
various standards are far better at keeping track of their charges
than the Department of Homeland Security, you still get junk in
messages, and codecs are of varying quality in error-handling.

If you want to restrict yourself to the Unicode-feasible layer, then
it would be very cool if you would watch for any leakage of bytes or
encoding-related lossage into that layer, and scream bloody murder if
they do.  (Eg, the APIs that handle well-formed messages should never
ever raise UnicodeError or codec errors themselves.)

 > is an area of problems at present), and I was hoping that the interfaces 
 > that would be presented by Python 3.0 mail APIs would be in terms of 
 > Unicode,

For the applications I guess you have in mind, they can and should
be.  But there is no reason why Python can't be used for RFC
2821-level bit-flicking transport protocol.  I don't see a way at
present to separate that level from the email module because of the
Postel Principle; you can get anything in email and you have to live
with it.  The various API layers are going to need to cooperate
closely, and given how specialized and crufty the bytes-to-Unicode
relationship is, I think the lexing/parsing layer probably should be
allowed to have a pretty fluid API for quite a while.

There need to be two (and I would say three is better) sets of APIs:
byte-oriented for handling the wire protocol, Unicode-oriented for
handling well-formed messages (both presentation and composition), and
(probably) a "codec" layer which handles nastiness in the transition.

 > for the convenience of being abstracted away from the plethora of
 > encodings that are defined at the mail transport layer.

But handling those is definitely in the domain of the email module.
Any attachments of documents in legacy encodings will need to deal
with them explicitly in composition of Content-Type headers, etc.