[Python-Dev] Patch making the current email package (mostly) support bytes

Sun Oct 3 01:00:27 CEST 2010

A while back on some issue or another I remember telling someone that
if there was any sort of clever hack that would allow the current email
package (email5) to work with bytes we would have implemented it.

Well, I've come up with a clever hack.

The idea came out of a conversation with Antoine.  I was saying that it
was ironic that Unicode could only be used as a 7bit-clean data
transmission channel for email, and he remarked that by using
surrogate escape you *could* use unicode as a transmission channel
for 8bit data.  At first I dismissed this observation as irrelevant
to email, since email has to transform the 8bit data at some point.

But I started thinking.  And then I started experimenting.  And it turns
out that it works.

The clever hack (thanks ultimately to Martin) is to accept 8bit data
by encoding it using the ASCII codec and the surrogateescape error
handler.  Then, inside the email module at any point where bytes might
be meaningful or might be about to escape, it can check to see if there
are any surrogates and act accordingly.

The API additions are few, and in fact for most programs (he says bravely,
not really knowing) there are really only two changes you need to make
when converting a program that handles bytes data to py3k.  The first
is the encoding of binary input data as mentioned.  The second is that
when you want to get the bytes back out, you use the new BytesGenerator
instead of Generator.  BytesGenerator is just like Generator except
that it writes bytes to its file argument instead of strings, and it
recovers any bytes that were in the original input.

So given this sequence:

    msg = email.msg_from_file(open('myfile',
                                   encoding='ascii',
                                   errors='surrogateescape'))
    email.generator.BytesGenerator(open('myfile2', 'wb')).flatten(msg)

myfile and myflie2 will theoretically be identical (modulo universal
newline and _mangle_from issues).

I've additionally added a 'message_from_bytes' convenience function.

One nice feature of this patch is that once you've got the model built
from surrogateescaped input, if you do a get_payload() on a message body
whose ContentTransferEncoding is '8bit' you will get the body decoded
to unicode using the charset declared in the Content-Type header
(assuming Python supports that charset).

You can always get at the bytes version of the body of a message part by
using get_payload(decode=True) [*].  You can't really get at the bytes
version of message headers, though...for safety if you access a header
whose value contains non-ASCII chars (that aren't RFC2047 encoded to be
ASCII) the 8bit characters get replaced with '?'s.  (But BytesGenerator
will emit the original 8bit characters if the headers haven't been
modified.)

I do not propose that this is a *good* API, since it has the classic
problem that if there are coding bugs in the email module strings may
"escape" that have surrogates in them and we end up with programs that
work most of the time....except when they fail with mysterious errors
because of unusual bytes input data.  On the other hand you always
*know* when you have bytes data in an unknown encoding (because they
are surrogate escaped), so it is ever so much better than the Python2
situation.

The advantage of this patch is that it means Python3.2 can have an
email module that is capable of handling a significant proportion of the
applications where the ability to process binary email data is required.

I've uploaded the patch to issue 4661 (http://bugs.python.org/issue4661).
I uploaded it to rietveld as well just before Martin's announcement.
After the announcement I uploaded the svn patch to the tracker, so
hopefully there will be an automated review button as well.  Here
is your chance to exercise the new review tools :)

This patch does break two of Barry's patch-for-review rules: it is
more than 800 lines of diff (but not a lot more, and less than 800
if you count only code diff and not docs), and it did not have a very
extensive design discussion beforehand.  I did talk with people on IRC,
particularly Barry, before finishing the patch, and I did post a summary
to the email-sig mailing list (but got no response).

Now it is time to see what the wider community thinks.  There is some
question of whether this is a bending of the string/bytes separation
that doesn't belong as part of the standard library, but after working
my way through it I think it is a fairly clean hack[**], and most
likely a case where practicality beats purity.

Regardless of whether or not this patch or a descendant thereof is
accepted I still intend to continue working on email6.  There are many
other bugs in the current email package that require a rewrite of parts
of its infrastructure, and the email-sig is agreed that the email API
needs revision quite apart from the bytes/string issues.  However, there
is something pleasing about the simplicity of this way of handling bytes
that I intend to consider carefully while we work further on email6.

--David

[*] It is counterintuitive that 'decode=True' gives you bytes and
'decode=False' gives you strings, but in this case 'decode' refers to
the ContentTransferEncoding...and this confusion is one of the reasons
I think the email API needs a big overhaul.

[**] There are a couple places where generator pokes into the internals of
Message in a way it hasn't before, but this could be fixed by defining a
'bytes access' API on Message, which would probably be a good idea anyway.
There is also the possibility of wrapping up the 'ascii+surrogateesape'
stuff inside APIs that accept input data, to hide that 'implementation
detail' from the email package user.