[Python-Dev] Patch making the current email package (mostly) support bytes

Wed Oct 6 20:31:34 CEST 2010

R. David Murray writes:

 >   5.  Return the content, with non-ASCII bytes replaced with ?
 >   characters.

That hadn't occurred to me (and it makes me sick to contemplate it).

That said, this is probably good enough for Mailman-like apps to limp
along for "most" users.  It's certainly good enough for the "might
kick your wife and elope with your dog" alpha ports of Mailman to
Python 3 (well, as certain as I can be; of course in the end Barry
decides).  Assuming reasonable backward compatibility of the API, of
course!

 > In other words, my proposed patch only makes email5 1/8 to 1/4
 > broken, instead of half broken as it is now.  But not un-broken
 > enough for Mailman, it sounds like.

IMO, not in the long run.  But realistically, in the applications I
know of, most desired traffic is conformant, and since there aren't
any Python 3 email apps yet, this isn't even a regression. :-/

I do think that it's important that the parsed object be able to tell
you what fields are there (except if the field name itself is invalid)
and return field bodies parsed as far as possible.

 > If we go this route (as opposed to only handling headers with 8bit data by
 > sanitizing them), then we need to think about the email5 header parsers
 > as well (decode_header and parseaddr).  They are of course going to have
 > the same problems as the rest of the email package with parsing bytes,
 > and you are suggesting that access to those header 8bit bytes is needed.

Yes, that would be preferable to replacing them with ASCII junk.

But I don't see any problem with parsing them; they're syntactically
insignificant by definition.  The problem is purely on output: do I
get verbatim escaped bytes, a sanitized str, or an exception?

 > One option would be to add a keyword to the get and get_all methods
 > that instructs it to return the string with the surrogate-escaped
 > bytes, which can then be passed onward to decode_header, parseaddr,
 > or a custom decoder.  Then I need to look at what needs to be added
 > to those methods to handle the escaped bytes, and from what you say
 > they too need a keyword telling them to preserve the escaped bytes
 > on output (a "yes I know what I'm doing" flag...
 > 'preserve_escaped_bytes=True'?).

The need is not absolute, but I would have a strong preference for
being able to get at those bytes.

 > Does my proposal make sense?  But note, it raises exactly the backward
 > compatibility concerns you mention in your next email (that I will reply
 > to next).  It is an open question whether it is worth opening that door
 > in order to be able to do extended handling on non-RFC conforming email
 > (as opposed to just sanitizing it and soldering on).

Well, maybe not.  However, it is not obvious to me that you won't run
into these issues again in Email6.  Applications that think of email
as textual objects are going to want to make their own choices about
handling of non-conforming email, and it's likely to be massively
inconvenient to say "OK, but you have to use bytes interfaces
exclusively, because the str interfaces don't handle that."