[Python-3000] email libraries: use byte or unicode strings?

Wed Nov 5 23:59:47 CET 2008

>I would find
>
>	message[b'Subject'] = b'Hello'
>
>to be totally gross.
>
>While RFC Email is all ASCII, except if 8bit transfer is legal, there 
>are internal encoding provided that permit the expression of Unicode in 
>nearly any component of the email, except for header identifiers.  But 
>there are never Unicode characters in the transfer, as they always get 
>encoded (there can be UTF-8 byte sequences, of course, if 8bit transfer 
>is legal; if it is not, then even UTF-8 byte sequences must be further 
>encoded).
>
>Depending on the level of email interface, there should be no interface 
>that cannot be expressed in terms of Unicode, plus an encoding to use 
>for the associated data.  Even 8-bit binary can be translated into a 
>sequence of Unicode codepoints with the same numeric value, for example. 

One significant problem is that the email module is intended to be
able to work with malformed e-mail without mangling it too badly. The
malformed e-mail should also make a round-trip through the email module
without being further mangled.

I think this requires the underlying processing to be all based on bytes,
but doesn't preclude layers on top that parse the charset hints. The
rules about encoding are strict, but not always followed. For instance,
the headers *must* be ASCII (the header body can, however, be encoded -
see rfc2047). Spammers often ignore this, and you might be inclined to
say "stuff em'", but this would make the SpamBayes authors rather unhappy.

One solution is to provide two sets of classes - the underlying
bytes-based one, and another unicode-based one, built on top of the
bytes classes, that implements the same API, but that may fail due to
encoding errors.

-- 
Andrew McNamara, Senior Developer, Object Craft
http://www.object-craft.com.au/