[Python-Dev] Patch making the current email package (mostly) support bytes

Tue Oct 5 17:05:23 CEST 2010

On Tue, 05 Oct 2010 22:05:33 +1000, Nick Coghlan wrote:
> On Tue, Oct 5, 2010 at 3:41 PM, Stephen J. Turnbull <stephen at xemacs.org> wrote:
> > R. David Murray writes:
> > > Only if the email package contains a coding error would the
> > > surrogates escape and cause problems for user code.
> >
> > I don't think it is reasonable to internalize surrogates that way;
> > some applications *will* want to look at them and do something useful
> > with them (delete them or replace them with U+FFFD or ...). However,
> > I argue below that the presence of surrogates already means the user
> > code is under fire, and this puts the problem in a canonical form so
> > the user code can prepare for it (if that is desirable).
> 
> Hang on here, this objection doesn't seem to quite mesh with what RDM
> is proposing (and the similar trick I am considering for
> urllib.parse).

[snip Nick's clear explanation of the issue and using surrogates to
allow string-based algorithms to work]

> My understanding is that email6 in 3.3 will essentially follow that
> same model. What I believe RDM is suggesting is an in-between approach
> for the 3.2 email module:
> 
> - if you pass in bytes data that isn't 7-bit clean and naively use the
> str APIs to access the headers, then it will complain loudly if it is
> about to return escaped data (but will decode the body in accordance
> with the Content Transfer Encoding)

Almost correct.  What it will do when it does not have the information
needed to decode the bytes correctly (ie: the message is not RFC
compliant) is to replace the unknown bytes with '?' characters.  This
means that you can render a "dirty" email to the terminal, for example,
and the invalid bytes will show as '?'s.[*]

> - if you pass in bytes data and know what you are doing, then you can
> access that raw bytes data and do your own decoding

With the current patch this is a true statement for message bodies, but
not for message headers.  There is no easy way to add access to the bytes
version of headers to the email5 API, but since any such data would be
non-RFC compliant anyway, that will just have to be good enough for now.

> I've probably grossly oversimplified what RDM is suggesting, but it
> sounds plausible as a useful interim stepping stone to the more
> comprehensive type separation in email6.

The more I look at the patch the more I think this can be an internal
implementation detail in email5 just like you might do for urllib.
So the email5 API will have a way to put bytes in, a way to get decoded
data out, and a way to get a bytes out (except for individual header
values).  The model object will be the same no matter what you put in
or take out.  The additional methods added to the email5 API to make
this possible will be:

    message_from_bytes (and Parser.parsebytes)
    message_from_binary_file
    Feedparser.feedbytes
    BytesGenerator

message_from_bytes and message_from_binary_file are currently part
of the proposed email6 API, and I was thinking about some version of
Feedparser.feedbytes[**].  BytesGenerator wasn't, but now perhaps it
will be (and certainly will be in the backward compatibility interface).

--
R. David Murray                                      www.bitdance.com

[*] Why '?' and not the unicode invalid character character?  Well, the
email5 Generate.flatten can be used to generate data for transmission over
the wire *if* the source is RFC compliant and 7bit-only, and this would
be a normal email5 usage pattern (that is, smtplib.SMTP.sendmail expects
ASCII-only strings as input!).  So the data generated by Generator.flatten
should not include unicode...which raises a problem for CTE 8bit sections
that the patch doesn't currently address.

[**] Benjamin asked how the patch would affect backward compatibility
support in email6, and I said it wouldn't make it harder.  However,
if feedbytes calls can be mixed with feed calls, which in the simplest
implementation they could be, then if email6 does *not* use surrogates
internally its feedparser algorithm would need to be considerably
more complicated to be backward compatible with this.  So when I add
Feedparser.parsebytes to my patch, I am at least initially going to
disallow mixing calls to feed and feedbytes.  Which is another reason
to add that method so as to keep the use of the surrogateescape an
implementation detail.