[Python-Dev] Patch making the current email package (mostly) support bytes

Wed Oct 6 19:09:25 CEST 2010

On Wed, 06 Oct 2010 22:55:00 +0900, "Stephen J. Turnbull" <stephen at xemacs.org> wrote:
> R. David Murray writes:
> 
>  > version of headers to the email5 API, but since any such data would
>  > be non-RFC compliant anyway, [access to non-conforming headers by
>  > reparsing the bytes] will just have to be good enough for now.
> 
> But that's potentially unpleasant for, say, Mailman.  AFAICS, what
> you're saying is that Mailman will have to implement a full header
> parser and repair module, or shunt (and wait for administrator
> intervention on) any mail that happens to contain even one byte of
> non-RFC-conforming content in a header it cares about.  (Note that

No, it just means that such bytes would not be preserved for presentation
in the web UI.  They'd show up as '?'s  (Or U+FFFDs, perhaps, if I change
DeocdedGenerator to use U+FFFD instead of ?s for the unknown bytes).
As long as BytesGenerator is used on the output side to send the messages,
the bytes will be preserved and presented to the moderator in their email.

So the only parsing issue is if Mailman cares about *the non-ASCII
bytes* in the headers it cares about.  If it has to modify headers that
contain non-ASCII bytes (for example, addresses and Subject) and cares
about preserving the non-ASCII bytes, then there is indeed an issue;
see previous email for a possible way around that.

> we're not talking about moderator-level admins here; we're talking
> about the Big Cheese with access to the command line on the list
> host.)  That's substantially worse than the current system, where (in
> theory, and in actual practice where it distributes its own version of
> email) it can trap the Unicode exception on a per-header basis.

I thought mailman no longer distributed its own version of email?
And the email API currently promises not to raise during parsing,
which is a contract my patch does not change.

> I also worry about the implications for backwards compatibility.
> Eventually email-N needs to handle non-conforming mail in a sensible
> way, or anybody who gets spam (ie, everybody) and wants a reliable
> email system will need to implement their own.  If you punt completely
> on handling non-conforming mail now, when is it going to be done?  And

We're (in the current patch) not punting on handling non-conforming
email, we're punting on handling non-conforming bytes *if the headers
that contain them need to be modified*.  The headers can still be
modified, you just (currently) lose the non-ASCII bytes in the process.

> when it is done, will the backward-compatible interface be able to
> access the robust implementation, or will people who want robust APIs
> have to use rather different ones?  The way you're going right now, I
> have to worry about the answer to the second question, at least.

Well, this is still theory given the current state of the email6
code, but I *think* that working email5 code, even after this patch,
will continue to work using email6's backward compatibility interface.
And robustness is not the issue, only extended-beyond-the-RFCs handling
of non-conforming bytes would be an issue.

*But*, as I implied in my previous email, if we allow the surrogates
out so that custom header parsers can use them, then making *that*
code continue to work may require an extra layer in the compatibility
interface to produce the surrogateescaped strings.  Still, at the moment
I can't see any theoretical reason why that would not be possible,
so it may be worth the risk.

>  > [*] Why '?' and not the unicode invalid character character?  Well, the
>  > email5 Generate.flatten can be used to generate data for transmission over
>  > the wire *if* the source is RFC compliant and 7bit-only, and this would
>  > be a normal email5 usage pattern (that is, smtplib.SMTP.sendmail expects
>  > ASCII-only strings as input!).  So the data generated by Generator.flatten
>  > should not include unicode...
> 
> I don't understand this at all.  Of course the byte stream generated
> by Generator.flatten won't contain Unicode (in the headers, anyway);
> it will contain only ASCII (that happens to conform to QP or Base64
> encoding of Unicode in some appropriate UTF in many cases).  Why is
> U+FFFD REPLACEMENT CHARACTER any different from any other non-ASCII
> character in this respect?
>
> (Surely you are not saying that Generator.flatten can't DTRT with
> non-ASCII content *at all*?)

Yes, that is *exactly* what I am saying:

>>> m = email.message_from_string("""\
... From: pÃ¶stal
...   
... """)
>>> str(m)
Traceback (most recent call last):
  ....
UnicodeEncodeError: 'ascii' codec can't encode character '\xf6' in position 1: ordinal not in range(128)

Remember, email5 is a direct translation of email4, and email4 only
handled ASCII and oh-by-the-way-if-there-are-bytes-along-for-the-
-ride-fine-we'll-pass-then-along.  So if you want to put non-ASCII
data into a message you have to encode it properly to ASCII in
exactly the same way that you did in email4:

>>> m = email.message.Message()
>>> m['From'] = email.header.Header("pÃ¶stal", charset='utf-8')
>>> str(m)
'From: =?utf-8?q?p=C3=B6stal?=\n\n'

> The only thing I can think of is that you might not want to introduce
> non-ASCII characters into a string that looks like it might simply be
> corrupted in transmission (eg, it contains only one non-ASCII byte).
> That's reasonable; there are a lot of people who don't have to deal
> with anything but ASCII and occasionally Latin-1, and they don't like
> having Unicode crammed down their throats.
> 
>  > which raises a problem for CTE 8bit sections
>  > that the patch doesn't currently address.
> 
> AFAIK, there's no requirement, implied or otherwise, that a conforming
> implementation *produce* CTE 8bit.  So just don't do that; that will
> keep smtplib happy, no?

Yes, exactly.  I need to fix the patch to recode using, say,
quoted-printable in that case.  DecodedGenerator could still produce the
unicode, though, which is what I believe we want.  (Although that raises
the question of whether DecodedGenerator should also decode the RFC2047
encoded headers....but that raises a backward compatibility issue).

--David