[Python-Dev] Patch making the current email package (mostly) support bytes

Wed Oct 6 15:55:00 CEST 2010

R. David Murray writes:

 > version of headers to the email5 API, but since any such data would
 > be non-RFC compliant anyway, [access to non-conforming headers by
 > reparsing the bytes] will just have to be good enough for now.

But that's potentially unpleasant for, say, Mailman.  AFAICS, what
you're saying is that Mailman will have to implement a full header
parser and repair module, or shunt (and wait for administrator
intervention on) any mail that happens to contain even one byte of
non-RFC-conforming content in a header it cares about.  (Note that
we're not talking about moderator-level admins here; we're talking
about the Big Cheese with access to the command line on the list
host.)  That's substantially worse than the current system, where (in
theory, and in actual practice where it distributes its own version of
email) it can trap the Unicode exception on a per-header basis.

I also worry about the implications for backwards compatibility.
Eventually email-N needs to handle non-conforming mail in a sensible
way, or anybody who gets spam (ie, everybody) and wants a reliable
email system will need to implement their own.  If you punt completely
on handling non-conforming mail now, when is it going to be done?  And
when it is done, will the backward-compatible interface be able to
access the robust implementation, or will people who want robust APIs
have to use rather different ones?  The way you're going right now, I
have to worry about the answer to the second question, at least.

 > [*] Why '?' and not the unicode invalid character character?  Well, the
 > email5 Generate.flatten can be used to generate data for transmission over
 > the wire *if* the source is RFC compliant and 7bit-only, and this would
 > be a normal email5 usage pattern (that is, smtplib.SMTP.sendmail expects
 > ASCII-only strings as input!).  So the data generated by Generator.flatten
 > should not include unicode...

I don't understand this at all.  Of course the byte stream generated
by Generator.flatten won't contain Unicode (in the headers, anyway);
it will contain only ASCII (that happens to conform to QP or Base64
encoding of Unicode in some appropriate UTF in many cases).  Why is
U+FFFD REPLACEMENT CHARACTER any different from any other non-ASCII
character in this respect?

(Surely you are not saying that Generator.flatten can't DTRT with
non-ASCII content *at all*?)

The only thing I can think of is that you might not want to introduce
non-ASCII characters into a string that looks like it might simply be
corrupted in transmission (eg, it contains only one non-ASCII byte).
That's reasonable; there are a lot of people who don't have to deal
with anything but ASCII and occasionally Latin-1, and they don't like
having Unicode crammed down their throats.

 > which raises a problem for CTE 8bit sections
 > that the patch doesn't currently address.

AFAIK, there's no requirement, implied or otherwise, that a conforming
implementation *produce* CTE 8bit.  So just don't do that; that will
keep smtplib happy, no?