[Python-Dev] Patch making the current email package (mostly) support bytes
Stephen J. Turnbull
stephen at xemacs.org
Wed Oct 6 21:40:03 CEST 2010
R. David Murray writes:
> So the only parsing issue is if Mailman cares about *the non-ASCII
> bytes* in the headers it cares about. If it has to modify headers that
> contain non-ASCII bytes (for example, addresses and Subject) and cares
> about preserving the non-ASCII bytes, then there is indeed an issue;
> see previous email for a possible way around that.
> I thought mailman no longer distributed its own version of email?
I believe so; the point is that it could do so again.
> And the email API currently promises not to raise during parsing,
> which is a contract my patch does not change.
Which is a contract that has historically been broken frequently.
Unhandled UnicodeErrors have been one of the most common causes of
queue stoppage in Mailman (exceeded only by configuration errors
AFAICS). I haven't seen any reports for a while, but with the email
package being reengineered from the ground up, the possibility of
regression can't be ignored.
Granted, there should be no regression problem in the current model
for Email5, AIUI.
> We're (in the current patch) not punting on handling non-conforming
> email, we're punting on handling non-conforming bytes *if the headers
> that contain them need to be modified*. The headers can still be
> modified, you just (currently) lose the non-ASCII bytes in the process.
Modified *or examined*. I can't think of any important applications
offhand that *need* to examine the non-ASCII bytes (in particular,
Mailman doesn't need to do that). Verbatim copying of the bytes
themselves is almost always the desired usage.
> And robustness is not the issue, only extended-beyond-the-RFCs handling
> of non-conforming bytes would be an issue.
And with that, I'm certain that Jon Postel is really dead. :-(
> > (Surely you are not saying that Generator.flatten can't DTRT with
> > non-ASCII content *at all*?)
> Yes, that is *exactly* what I am saying:
> >>> m = email.message_from_string("""\
> ... From: pöstal
> ... """)
> >>> str(m)
> Traceback (most recent call last):
> UnicodeEncodeError: 'ascii' codec can't encode character '\xf6' in position 1: ordinal not in range(128)
But that's not interesting; you did that with Python 3. We want to
know what people porting from Python 2 will expect. So, in 2.5.5 or
2.6.6 on Mac, with email v4.0.2, it *doesn't* raise, it returns
wideload:~ 4:14$ python
Python 2.5.5 (r255:77872, Jul 13 2010, 03:03:57)
[GCC 4.0.1 (Apple Inc. build 5490)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import email
>>> m=email.message_from_string('From: pöstal\n\n')
'From nobody Thu Oct 7 04:18:25 2010\nFrom: p\xc3\xb6stal\n\n'
That's hardly helpful! Surely we can and should do better than that
now, especially since UTF-8 (with a proper CTE) is now almost
universally acceptable to MUAs. When would it be a problem for that
'From nobody Thu Oct 7 04:18:25 2010\nFrom: =?UTF-8?Q?p=C3=B6stal?=\n\n'
> Remember, email5 is a direct translation of email4, and email4 only
> handled ASCII and oh-by-the-way-if-there-are-bytes-along-for-the-
> -ride-fine-we'll-pass-then-along. So if you want to put non-ASCII
> data into a message you have to encode it properly to ASCII in
> exactly the same way that you did in email4:
But if you do it right, then it will still work in a version that just
encodes non-ASCII characters in UTF-8 with the appropriate CTE. Since
you'll never be passing it non-ASCII characters, it's already ASCII
and UTF-8, and no CTE will be needed.
> Yes, exactly. I need to fix the patch to recode using, say,
> quoted-printable in that case.
It really should check for proportions of non-ASCII. QP would be
horrible for Japanese or Chinese.
> DecodedGenerator could still produce the unicode, though, which is
> what I believe we want. (Although that raises the question of
> whether DecodedGenerator should also decode the RFC2047 encoded
> headers....but that raises a backward compatibility issue).
Can't really help you there. While I would want the RFC 2047 headers
decoded if I were writing new code (which is generally the case for
me), I haven't really wrapped my head around the issues of porting old
code using Python2 str to Python3 str here. My intuition says "no
problem" (there won't be any MIME-words so the app won't try to decode
them), but I'm not real sure of that. ;-)
More information about the Python-Dev