[Python-Dev] Patch making the current email package (mostly) support bytes

Fri Oct 8 18:06:29 CEST 2010

Barry Warsaw writes:
 > On Oct 07, 2010, at 04:40 AM, Stephen J. Turnbull wrote:

 > I'm fairly certain that most of the modern causes of [Unicode
 > errors in Mailman] are post-parse modifications of the message.
 > IOW, in Mailman's architecture, we try to parse the raw data into a
 > Message object tree very early in the pipeline, and then a pickled
 > version of that gets passed between the queue runners.
 > 
 > Where we've gotten into trouble before has been things like adding
 > the Subject prefixes and such.

Not to mention those wonderful unremovable addresses containing TAB
etc.

But I'm pretty sure I've seen reports at least in 2.1.9, and probably
more recently than that, where there was 8-bit content in a header of
the incoming message and Mailman blew up on that.  This is stuff that
should have been shunted explicitly, but instead managed to get out of
the parser and then blow up.  I don't think the errors I'm thinking
about were due to Mailman manipulations, but rather insufficient
paranoia in handling incoming hazmat.

 > That seems like application logic that the email package can't
 > really get involved with, and indeed Mailman has built up a raft of
 > defense for failures of this kind.

But adding Subject prefixes and the like shouldn't be a problem as
long is the internal representation of each message object (bytes vs
str) is fixed and the representation is opaque, so that the module can
do appropriate conversions when necessary.  The problem that you face
in Python 2 is that that separation is not properly made, and the same
values in the message object can often serve as text and as wire
format, and it's hard to tell which is which.   The Unicode handling
is tacked on as an afterthought.

That mess is entirely unnecessary in Python 3.  Text and wire format
can be easily distinguished with three different representations of
email: Unicode for the conceptual RFC 822 layer (of course this is an
extension, because RFC 822 itself is strictly limited to the ASCII
subset), bytes for wire format, and Message objects for modern
structured mail (including MIME, etc).

*If* email6 is reengineered with that kind of structure, then you
should be able to dispense with almost all of the raft of defense,
because the email module will give you well-behaved Message objects,
whose text components (including the header) are well-behaved
character strings that mix seamlessly with other character strings.
Maybe even in email5 ....