[Python-Dev] Patch making the current email package (mostly) support bytes

Wed Oct 6 23:39:14 CEST 2010

On Thu, 07 Oct 2010 03:31:34 +0900, "Stephen J. Turnbull" <stephen at xemacs.org> wrote:
> R. David Murray writes:
> 
>  >   5.  Return the content, with non-ASCII bytes replaced with ?
>  >   characters.
> 
> That hadn't occurred to me (and it makes me sick to contemplate it).
> 
> That said, this is probably good enough for Mailman-like apps to limp
> along for "most" users.  It's certainly good enough for the "might
> kick your wife and elope with your dog" alpha ports of Mailman to
> Python 3 (well, as certain as I can be; of course in the end Barry
> decides).  Assuming reasonable backward compatibility of the API, of
> course!

Yeah, "good enough" is pretty much the goal here.

>  > In other words, my proposed patch only makes email5 1/8 to 1/4
>  > broken, instead of half broken as it is now.  But not un-broken
>  > enough for Mailman, it sounds like.
> 
> IMO, not in the long run.  But realistically, in the applications I
> know of, most desired traffic is conformant, and since there aren't
> any Python 3 email apps yet, this isn't even a regression. :-/
> 
> I do think that it's important that the parsed object be able to tell
> you what fields are there (except if the field name itself is invalid)
> and return field bodies parsed as far as possible.

Well, email doesn't currently parse the bodies any further by itself.
You have to call parsing routines to get further parsing.  So maybe
what I should do is work on finalizing the patch without addressing the
'give me the escaped bytes issue', and then prepare a follow on patch
that adds that keyword and adjusts the header parsing helpers accordingly.

>  > If we go this route (as opposed to only handling headers with 8bit data by
>  > sanitizing them), then we need to think about the email5 header parsers
>  > as well (decode_header and parseaddr).  They are of course going to have
>  > the same problems as the rest of the email package with parsing bytes,
>  > and you are suggesting that access to those header 8bit bytes is needed.
> 
> Yes, that would be preferable to replacing them with ASCII junk.
> 
> But I don't see any problem with parsing them; they're syntactically
> insignificant by definition.  The problem is purely on output: do I
> get verbatim escaped bytes, a sanitized str, or an exception?

Right, the needed changes should be sanitizing by default, and providing
the keyword to get the escaped bytes.  Mostly it'll be writing tests :)

>  > Does my proposal make sense?  But note, it raises exactly the backward
>  > compatibility concerns you mention in your next email (that I will reply
>  > to next).  It is an open question whether it is worth opening that door
>  > in order to be able to do extended handling on non-RFC conforming email
>  > (as opposed to just sanitizing it and soldering on).
> 
> Well, maybe not.  However, it is not obvious to me that you won't run
> into these issues again in Email6.  Applications that think of email
> as textual objects are going to want to make their own choices about
> handling of non-conforming email, and it's likely to be massively
> inconvenient to say "OK, but you have to use bytes interfaces
> exclusively, because the str interfaces don't handle that."

The strategy in email6 so far is for the application program to be
able to access *any piece* of the parsed data as either text or bytes,
and for the header parsers to record defects when there are non-ASCII
bytes where there aren't supposed to be.  So the application can check
for defects and retrieve, say, the comment field that has the non-ASCII
*as bytes* and decode it.  Or, if it doesn't care about parsing them,
it just modifies the fields it wants to modify that *are* valid, and the
invalid non-ASCII comment gets carried along and emitted when the message
is serialized as bytes.

This is more or less what we are talking about enabling in email5 with
the 'escape_bytes=True' keyword, it's just a less structured and more
error prone approach to it than what we have planned for email6.

--
R. David Murray                                      www.bitdance.com