[Python-3000] email libraries: use byte or unicode strings?

Thu Nov 6 07:04:44 CET 2008

On approximately 11/5/2008 6:09 PM, came the following characters from 
the keyboard of Stephen J. Turnbull:
> Glenn Linderman writes:
>  > On approximately 11/5/2008 2:59 PM, came the following characters from 
>  > the keyboard of Andrew McNamara:
>  > >> I would find
>  > >>
>  > >> 	message[b'Subject'] = b'Hello'
>  > >>
>  > >> to be totally gross.
> 
> Indeed.
> 
>  > >> Depending on the level of email interface, there should be no interface 
>  > >> that cannot be expressed in terms of Unicode, plus an encoding to use 
>  > >> for the associated data.  Even 8-bit binary can be translated into a 
>  > >> sequence of Unicode codepoints with the same numeric value, for example. 
> 
> Also totally gross.  RFC 2821 is bytes, RFC 2822 is Unicode (in
> spirit, even though headers are limited to ASCII), RFC 2045-and-the-
> cast-of-thousands interfaces the two.  We can't really get around
> this, IMO.
> 
>  > > One significant problem is that the email module is intended to be
>  > > able to work with malformed e-mail without mangling it too badly. The
>  > > malformed e-mail should also make a round-trip through the email module
>  > > without being further mangled.
>  > 
>  > This is an interesting perspective... "stuff em" does come to mind :)
> 
> Not acceptable in Japan, or anywhere that Microsoft beta products are
> used, for that matter.  (At one point Outhouse Excess betas were
> sending HTML *with tags in unibyte ASCII and element content in
> little-endian UTF-16*.)

So I would hope that the users of such Betas would quickly discover that 
  they were producing garbage, report it to M$, and go back to using a 
release version with only the usual expectation of bugs, 
inconsistencies, standards violations, and security exploits, but not 
expect that Beta software is, or should be, fully compatible with other 
applications that handle proper email.

Did Python's 2.x mail library handle the data that you describe?  Did 
anyone seriously expect it to?  Did Mozilla clients handle it?  Can you 
provide a list of email clients that handled it gracefully, other than 
the same Outhouse Excess client that produced it?  And if not, why would 
you expect Python's 3.0 mail library to handle it?

>  > But I'm not at all clear on what you mean by a round-trip through the 
>  > email module.
> 
> Bounce messages, for example.

OK, my other reply just now described a way to handle that.

>  > I guess I'm not terribly concerned about the readability of improperly 
>  > encoded email messages, whether they are spam or ham.
> 
> I'm fine with *your* lack of concern if you don't need it, but an
> email module that doesn't care really is not acceptable in any of the
> Asian cultures; they have more characters to worry about than the Bush
> administration has "suspicious foreign elements".  Although the
> various standards are far better at keeping track of their charges
> than the Department of Homeland Security, you still get junk in
> messages, and codecs are of varying quality in error-handling.
> 
> If you want to restrict yourself to the Unicode-feasible layer, then
> it would be very cool if you would watch for any leakage of bytes or
> encoding-related lossage into that layer, and scream bloody murder if
> they do.  (Eg, the APIs that handle well-formed messages should never
> ever raise UnicodeError or codec errors themselves.)

Sure, and I'm fine with your concern about being able to reasonably 
handle invalid messages.  I'm concerned about that too.  But I'm not 
sure that the mixed single byte and double byte bug you describe above 
is in the realm of reasonable... even so, it could be handled by 
transliterating it using the Latin-1 transform; there'd be lots of 
gibberish, but it wouldn't create exceptions.  The reader would quickly 
gather that the message is a mess, and be able to report that to the 
sender, who should know that they are using Beta software, and should 
try resending with a production version, and report the bug to M$.

>  > is an area of problems at present), and I was hoping that the interfaces 
>  > that would be presented by Python 3.0 mail APIs would be in terms of 
>  > Unicode,
> 
> For the applications I guess you have in mind, they can and should
> be.  But there is no reason why Python can't be used for RFC
> 2821-level bit-flicking transport protocol.  

The term "bit-flicking" is foreign to me; it does not appear in the mail 
RFCs.  Hence, I have little clue what you are talking about here.  There 
is no reason that RFC 2821 couldn't be implemented with a Unicode 
interface, as far as I can see.

> I don't see a way at
> present to separate that level from the email module because of the
> Postel Principle; you can get anything in email and you have to live
> with it.  The various API layers are going to need to cooperate
> closely, and given how specialized and crufty the bytes-to-Unicode
> relationship is, I think the lexing/parsing layer probably should be
> allowed to have a pretty fluid API for quite a while.
> 
> There need to be two (and I would say three is better) sets of APIs:
> byte-oriented for handling the wire protocol, Unicode-oriented for
> handling well-formed messages (both presentation and composition), and
> (probably) a "codec" layer which handles nastiness in the transition.

I see no reason the wire protocol cannot be implemented with Unicode 
APIs.  Granted, the wire protocol is defined in terms of bytes, but the 
set of legal commands and responses are in the ASCII subset; with 
encoding violations, the illegal commands and responses may be in the 
Latin-1 subset, or some other code page (default system code page?). 
But the API could speak Unicode, and do the appropriate translations. 
Or in some cases, inappropriate translations.

>  > for the convenience of being abstracted away from the plethora of
>  > encodings that are defined at the mail transport layer.
> 
> But handling those is definitely in the domain of the email module.
> Any attachments of documents in legacy encodings will need to deal
> with them explicitly in composition of Content-Type headers, etc.

Definitely in the domain of the email module.  Not clearly necessary to 
expose in the API.  Binary attachments being delivered as bytes, yes; a 
way of obtaining the whole email message in the form of its wire 
protocol, yes; a way of obtaining the whole set of headers in the form 
of its wire protocol, for use in bounce messages, yes; what else could 
be usefully provided as bytes, that cannot be equally well handled by 
returning bytes translitered to Unicode?

Please be specific; just mentioning bit-flicking, or error cases, or bad 
encoding sounds terrible, but provides little information as to how it 
can be handled via some theoretical bytes interface that cannot be 
handled equally as effectively (although perhaps not equally 
efficiently) via a transliterated Unicode data stream.

-- 
Glenn -- http://nevcal.com/
===========================
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking