Re: [Python-Dev] email package status in 3.X

[copied to pydev from email-sig because of the broader scope] Well, it looks like I've stumbled onto the "other shoe" on this issue--that the email package's problems are also apparently behind the fact that CGI binary file uploads don't work in 3.1 (http://bugs.python.org/issue4953). Yikes. I trust that people realize this is a show-stopper for broader Python 3.X adoption. Why 3.0 was rolled out anyhow is beyond me; it seems that it would have been better if Python developers had gotten their own code to work with 3.X, before expecting the world at large to do so. FWIW, after rewriting Programming Python for 3.1, 3.x still feels a lot like a beta to me, almost 2 years after its release. How did this happen? Maybe nobody is using 3.X enough to care, but I have a feeling that issues like this are part of the reason why. No offense to people who obviously put in an incredible amount of work on 3.X. As someone who remembers 0.X, though, it's hard not to find the current situation a bit disappointing. --Mark Lutz (http://learning-python.com, http://rmi.net/~lutz)
-----Original Message----- From: lutz@rmi.net To: "R. David Murray" <rdmurray@bitdance.com> Subject: Re: email package status in 3.X Date: Sun, 13 Jun 2010 15:30:06 -0000
Come to think of it, here was another oddness I just recalled: this may have been reported already, but header decoding returns mixed types depending upon the structure of the header. Converting to a str for display isn't too difficult to handle, but this seems a bit inconsistent and contrary to Python's type neutrality:
from email.header import decode_header S1 = 'Man where did you get that assistant?' S2 = '=?utf-8?q?Man_where_did_you_get_that_assistant=3F?=' S3 = 'Man where did you get that =?UTF-8?Q?assistant=3F?='
# str: don't decode()
decode_header(S1) [('Man where did you get that assistant?', None)]
# bytes: do decode()
decode_header(S2) [(b'Man where did you get that assistant?', 'utf-8')]
# bytes: do decode(), using raw-unicode-escape applied in package
decode_header(S3) [(b'Man where did you get that', None), (b'assistant?', 'utf-8')]
I can work around this with the following code, but it feels a bit too tightly coupled to the package's internal details (further evidence that email.* can be made to work as is today, even if it may be seen as less than ideal aesthetically):
parts = email.header.decode_header(rawheader) decoded = [] for (part, enc) in parts: # for all substrings if enc == None: # part unencoded? if not isinstance(part, bytes): # str: full hdr unencoded decoded += [part] # else do unicode decode else: decoded += [part.decode('raw-unicode-escape')] else: decoded += [part.decode(enc)] return ' '.join(decoded)
Thanks, --Mark Lutz (http://learning-python.com, http://rmi.net/~lutz)
-----Original Message----- From: lutz@rmi.net To: "R. David Murray" <rdmurray@bitdance.com> Subject: Re: email package status in 3.X Date: Sat, 12 Jun 2010 16:52:32 -0000
Hi David,
All sounds good, and thanks again for all your work on this.
I appreciate the difficulties of moving this package to 3.X in a backward-compatible way. My suggestions stem from the fact that it does work as is today, albeit in a less than ideal way.
That, and I'm seeing that Python 3.X in general is still having a great deal of trouble gaining traction in the "real world" almost 2 years after its release, and I'd hate to see further disincentives for people to migrate. This is a bigger issue than both the email package and this thread, of course.
3) Type-dependent text part encoding
... So, in the next releases of Python all MIMEText input should be string, and it will fail if you pass bytes. I consider this as email previously not living up to its published API, but do you think I should hack in a way for it to accept bytes too, for backward compatibility in the 3 line?
Decoding can probably be safely delegated to package clients. Typical email clients will probably have str for display of the main text. They may wish to read attachments in binary mode, but can always read in text mode instead or decode manualy, because they need a known encoding to send the part correctly (my client has to ask or use configurations in some cases).
B/W compatibility probably isn't a concern; I suspect that my temporary workaround will still work with your patch anyhow, and this code didn't work at all for some encodings before.
There are some additional cases that now require decoding per mail headers today due to the str/bytes split, but these are just a normal artifact of supporting Unicode character sets in general, ans seem like issues for package client to resolve (e.g., the bytes returned for decoded payloads in 3.X didn't play well with existing str-based text processing code written for 2.X).
I'm not following you here. Can you give me some more specific examples? Even if these "normal artifacts" must remain with the current API, I'd like to make things as easy as practical when using the new API.
This was just a general statement about things in my own code that didn't jive with the 3.X string model. For instance, line wrapping logic assumed str; tkinter text widgets do much better rendering str than the bytes fetched for decoded payloads; and my Pyedit text editor component had to be overhauled to handle display/edit/save of payloads of arbitrary encodings. If I remember any more specific issues with the email package itself, I'll forward your way.
I'll watch for an opportunity to get the book's new PyMailGUI client code to you as a candidate test case, but please ping me about it later if I haven't acted on this. It works well, but largely because of all the work that went into the email package underlying it.
Thanks, --Mark Lutz (http://learning-python.com, http://rmi.net/~lutz)
-----Original Message----- From: "R. David Murray" <rdmurray@bitdance.com> To: lutz@rmi.net Subject: Re: email package status in 3.X Date: Thu, 10 Jun 2010 10:18:48 -0400
On Thu, 10 Jun 2010 09:21:52 -0400, lutz@rmi.net wrote:
In other words, some of my concern may have been a bit premature. I hope that in the future we'll either strive for compatibility or keep the current version around; it's a lot of very useful code.
The plan is to have a compatibility layer that will accept calls based on the old API and forward appropriately to the new API. So far I'm thinking I can succeed in doing this in a fairly straightforward manner, but I won't know for sure until I get some more pieces in place.
In fact, I recommend that any new email package be named distinctly,
I'm going to avoid that if I can (though the PyPI package will be named email6 when we publish it for public testing). If, however, it turns out that I can't correctly support both the old and the new API, then I'll have to do that.
and that the current package be retained for a number of releases to come. After all the breakages that 3.X introduced in general, doing the same to any email-based code seems a bit too much, especially given that the current package is largely functional as is. To me, after having just used it extensively, fixing its few issues seems a better approach than starting from scratch.
Well, the thing is, as you found, existing 2.x code needs to be fixed to correctly handle the distinction between strings and bytes no matter what. The goal is to make it easier to write correct programs, while providing the compatibility layer to make porting smoother. But I doubt that any non-trivial 2.x email program will port without significant changes, even if the compatibility layer is close to 100% compatible with the current Python3 email package, simply because the previous conflation of text and bytes must be untangled in order to work correctly in Python3, and email involves lots of transitions between text and bytes.
As for "starting from scratch", it is true that the current plan involves considerable changes in the recommended API (in the direction of greater flexibility and power), but I'm hoping that significant portions of the code will carry forward with minor changes, and that this will make it easier to support the old API.
As far as other issues, the things I found are described below my signature. I don't know what the utf-8 issue is that you refer too; I'm able to parse and send with this encoding as is without problems (both payloads and headers), but I'm probably not using the interfaces you fixed, and this may be the same as one of item listed.
It is, see below.
Another thought: it might be useful to use the book's email client as a sort of test case for the package; it's much more rigorous in the new edition because it now has to be given 3.X'Unicode model (it's abut 4,900 lines of code, though not all is email-related). I'd be happy to donate the code as soon as I find out what the copyright will be this time around; it will be at O'Reilly's site this Fall in any event.
That would be great. I am planning to write my own sample ap to demonstrate the new API, but if I can use yours to test the compatibility layer that will help a lot, since I otherwise have no Python3 email application to test against unless I port something from Python2.
Major issues I found... ------------------------------------------------------------------ 1) Str required for parsing, but bytes returned from poplib
The initial decode from bytes to str of full mail text; in retrospect, probably not a major issue, since original email standards called for ASCII. A 8-bit encoding like Latin-1 is probably sufficient for most conforming mails. For the book, I try a set of different encodings, beginning with an optional configuration module setting, then ascii, latin-1, and utf-8; this is probably overkill, but a GUI has to be defensive.
This works (mostly) for conforming email, but some important Python email applications need to deal with non-conforming email. That's where the inability to parse bytes directly really causes problems.
2) Binary attachments encoding
The binary attachments byte-to-str issue that you've just fixed. As I mentioned, I worked around this by passing in a custom encoder that calls the original and runs an extra decode step. Here's what my fix looked like in the book; your patch may do better, and I will minimally add a note about the 3.1.3 and 3.2 fix for this:
Yeah, our patch was a lot simpler since we could fix the encoding inside the loop producing the encoded lines :)
3) Type-dependent text part encoding
There's a str/bytes confusion issue related to Unicode encodings in text payload generation: some encodings require the payload to be str, but others expect bytes. Unfortunately, this means that clients need to know how the package will react to the encoding that is used, and special-case based upon that.
This was the UTF-8 bug I fixed. I shouldn't have called it "the UTF-8 bug", because it applies equally to the other charsets that use base64, as you note. I called it that because UTF-8 was where the problem was noticed and is mentioned in the title of the bug report.
I had a suspicion that the quoted-printable encoding wasn't being done correctly either, so to hear that it is working for you is good news. There may still be bugs to find there, though.
So, in the next releases of Python all MIMEText input should be string, and it will fail if you pass bytes. I consider this as email previously not living up to its published API, but do you think I should hack in a way for it to accept bytes too, for backward compatibility in the 3 line?
There are some additional cases that now require decoding per mail headers today due to the str/bytes split, but these are just a normal artifact of supporting Unicode character sets in general, ans seem like issues for package client to resolve (e.g., the bytes returned for decoded payloads in 3.X didn't play well with existing str-based text processing code written for 2.X).
I'm not following you here. Can you give me some more specific examples? Even if these "normal artifacts" must remain with the current API, I'd like to make things as easy as practical when using the new API.
Thanks for all your feedback!
--David

On Thu, Jun 17, 2010 at 6:48 AM, <lutz@rmi.net> wrote:
I trust that people realize this is a show-stopper for broader Python 3.X adoption. Why 3.0 was rolled out anyhow is beyond me; it seems that it would have been better if Python developers had gotten their own code to work with 3.X, before expecting the world at large to do so.
FWIW, after rewriting Programming Python for 3.1, 3.x still feels a lot like a beta to me, almost 2 years after its release. How did this happen? Maybe nobody is using 3.X enough to care, but I have a feeling that issues like this are part of the reason why.
No offense to people who obviously put in an incredible amount of work on 3.X. As someone who remembers 0.X, though, it's hard not to find the current situation a bit disappointing.
Agreed, but the binary/text distinction in 2.x (or rather, the lack thereof) makes the unicode handling situation so hopelessly confused that there is a lot of 2.x code (including in the standard library) that silently mixes the two, often without really testing the consequences (as clearly happened here). 3.x was rolled out anyway because the vast majority of it works. Obviously people affected by the problems specific to the email package and any other binary vs text parsing problems that are still lingering are out of luck at the moment, but leaving 3.x sitting on a shelf indefinitely would hardly have inspired anyone to clean it up. My personal perspective is that a lot of that code was likely already broken in hard to detect ways when dealing with mixed encodings - releasing 3.x just made the associated errors significantly easier to detect. If we end up being able to add your email client code to the standard library's unit test suite, that should help the situation immensely. Regards, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

Nick Coghlan <ncoghlan@gmail.com> wrote:
My personal perspective is that a lot of that code was likely already broken in hard to detect ways when dealing with mixed encodings - releasing 3.x just made the associated errors significantly easier to detect.
I have to agree with this, and not just about encodings. I think much of the stdlib code dealing with all aspects of HTTP (urllib and the http package which now includes cgi) is kind of shaky. And it affects (infects) other parts of the stdlib, too; sockets are hacked to support the read-after-close paradigm that httplib uses, for instance. Which means that SSL and other socket-using code also has to support it, etc. Some of this was cleaned up in the move to 3.x, but more work needs to be done. Cudos to the folks working on httplib2 (http://code.google.com/p/httplib2/) and WSGI. There's a related meta-issue having to do with antique protocols. FTP, for instance, was designed when the Internet had only 19 nodes connected together with custom-built refrigerator-sized routers. A very early experiment in application protocols. It does a few odd things that we've since learned to be inefficient/unwise/unnecessary. Does it make sense that Python support every part of it? On the other hand, it was fairly static when the Python support was added (unlike HTTP, which was under very active development!) so that module is pretty robust. Bill

2010/6/17 Bill Janssen <janssen@parc.com>:
There's a related meta-issue having to do with antique protocols.
Can I know what meta-issue are you talking about exactly?
FTP, for instance, was designed when the Internet had only 19 nodes connected together with custom-built refrigerator-sized routers. A very early experiment in application protocols. It does a few odd things that we've since learned to be inefficient/unwise/unnecessary. Does it make sense that Python support every part of it?
Being FTP protocol still quite widespread I'd say it makes a lot of sense. That aside, what parts of urllib/http* are penalized because of FTP support? --- Giampaolo http://code.google.com/p/pyftpdlib http://code.google.com/p/psutil

Giampaolo Rodolà <g.rodola@gmail.com> wrote:
2010/6/17 Bill Janssen <janssen@parc.com>:
There's a related meta-issue having to do with antique protocols.
Can I know what meta-issue are you talking about exactly?
Giampaolo, I believe that you and I have already discussed this on one of the FTP issues. Bill

2010/6/18 Bill Janssen <janssen@parc.com>:
Giampaolo Rodolà <g.rodola@gmail.com> wrote:
2010/6/17 Bill Janssen <janssen@parc.com>:
There's a related meta-issue having to do with antique protocols.
Can I know what meta-issue are you talking about exactly?
Giampaolo, I believe that you and I have already discussed this on one of the FTP issues.
Bill
I only remember a discussion in which I was against removing OOB data support from asyncore in order to support certain parts of the FTP protocol using it, but that's all. I don't see how urlib or any other stdlib module is supposed to be penalized by FTP protocol in any way. --- Giampaolo http://code.google.com/p/pyftpdlib http://code.google.com/p/psutil

On Jun 16, 2010, at 08:48 PM, lutz@rmi.net wrote:
Well, it looks like I've stumbled onto the "other shoe" on this issue--that the email package's problems are also apparently behind the fact that CGI binary file uploads don't work in 3.1 (http://bugs.python.org/issue4953). Yikes.
I trust that people realize this is a show-stopper for broader Python 3.X adoption.
We know it, we have extensively discussed how to fix it, we have IMO a good design, and we even have someone willing and able to tackle the problem. We need to find a sufficient source of funding to enable him to do the work it will take, and so far that's been the biggest stumbling block. It will take a focused and determined effort to see this through, and it's obvious that volunteers cannot make it happen. I include myself in the latter category, as I've tried and failed at least twice to do it in my spare time. -Barry

On Thu, Jun 17, 2010 at 08:43, Barry Warsaw <barry@python.org> wrote:
On Jun 16, 2010, at 08:48 PM, lutz@rmi.net wrote:
Well, it looks like I've stumbled onto the "other shoe" on this issue--that the email package's problems are also apparently behind the fact that CGI binary file uploads don't work in 3.1 (http://bugs.python.org/issue4953). Yikes.
I trust that people realize this is a show-stopper for broader Python 3.X adoption.
We know it, we have extensively discussed how to fix it, we have IMO a good design, and we even have someone willing and able to tackle the problem. We need to find a sufficient source of funding to enable him to do the work it will take, and so far that's been the biggest stumbling block. It will take a focused and determined effort to see this through, and it's obvious that volunteers cannot make it happen. I include myself in the latter category, as I've tried and failed at least twice to do it in my spare time.
And in general I think this is the reason some modules have not transitioned as well as others: there are only so many of us. The stdlib passes its test suite, but obviously some unit tests do not cover enough of the code in the ways people need it covered. As for using Python 3 for my code, I do and have since Python 3 became more-or-less usable. I just happen to not work with internet-related stuff in my day-to-day work. Plus we have needed to maintain FOUR branches for a while. That is a nasty time sink when you are having to port bug fixes and such. It also means that python-dev has been focused on making sure Python 2.7 is a solid release instead of getting to focus on the stdlib in Python 3. This a nasty chicken-and-egg issue; we could ignore Python 2 and focus on Python 3, but then the community would complain about us not supporting the transition from 2 to 3 better, but obviously focusing on 2 has led to 3 not getting enough TLC. Once Python 2.7 is done and out the door the entire situation for Python 3 should start to improve as python-dev as whole will have a chance to begin to focus solely on Python 3.

Barry Warsaw wrote:
On Jun 16, 2010, at 08:48 PM, lutz@rmi.net wrote:
Well, it looks like I've stumbled onto the "other shoe" on this issue--that the email package's problems are also apparently behind the fact that CGI binary file uploads don't work in 3.1 (http://bugs.python.org/issue4953). Yikes.
I trust that people realize this is a show-stopper for broader Python 3.X adoption.
We know it, we have extensively discussed how to fix it, we have IMO a good design, and we even have someone willing and able to tackle the problem. We need to find a sufficient source of funding to enable him to do the work it will take, and so far that's been the biggest stumbling block. It will take a focused and determined effort to see this through, and it's obvious that volunteers cannot make it happen. I include myself in the latter category, as I've tried and failed at least twice to do it in my spare time.
-Barry
Lest the readership think that the PSF is unaware of this issue, allow me to point out that we have already partially funded this effort, and are still offering R. David Murray some further matching funds if he can raise sponsorship to complete the effort (on which he has made a very promising start). We are also attempting to enable tax-deductible fund raising to increase the likelihood of David's finding support. Perhaps we need to think about a broader campaign to increase the quality of the python 3 libraries. I find it very annoying that the #python IRC group still has "Don't use Python 3" in it's topic. They adamantly refuse to remove it until there is better library support, and they are the guys who see the issues day in day out so it is hard to argue with them (and I don't think an autocratic decision-making process would be appropriate). regards Steve -- Steve Holden +1 571 484 6266 +1 800 494 3119 See Python Video! http://python.mirocommunity.org/ Holden Web LLC http://www.holdenweb.com/ UPCOMING EVENTS: http://holdenweb.eventbrite.com/ "All I want for my birthday is another birthday" - Ian Dury, 1942-2000

David and his Google Summer of Code student, Shashwat Anand. You can read Shashwat's weekly progress updates at http://l0nwlf.in/ or subscribe to http://twitter.com/l0nwlf for more micro updates. We have more than 30 paid students working on Python 3 tasks this year, most of them participating under the PSF umbrella but also a few with 3rd party projects such as Mercurial porting those various packages to Py3. Given all this "on the horizon" work, I think the Py3 package situation will look a lot brighter by Python 3.2's release. On Thu, Jun 17, 2010 at 10:32 PM, Steve Holden <steve@holdenweb.com> wrote:
Lest the readership think that the PSF is unaware of this issue, allow me to point out that we have already partially funded this effort, and are still offering R. David Murray some further matching funds if he can raise sponsorship to complete the effort (on which he has made a very promising start).
We are also attempting to enable tax-deductible fund raising to increase the likelihood of David's finding support.

On Jun 18, 2010, at 11:32 AM, Steve Holden wrote:
Lest the readership think that the PSF is unaware of this issue, allow me to point out that we have already partially funded this effort, and are still offering R. David Murray some further matching funds if he can raise sponsorship to complete the effort (on which he has made a very promising start).
Right, sorry, I didn't mean to imply the PSF isn't doing anything. More that we need a coordinated effort among all the companies and organizations that use Python to help fund Python 3 library development (and not just in the stdlib). I think the PSF is best suited to coordinating and managing those efforts, and through its tax-exempt status, collecting and distributing donations specifically targeted to Python 3 work. -Barry

lutz@rmi.net writes:
FWIW, after rewriting Programming Python for 3.1, 3.x still feels a lot like a beta to me, almost 2 years after its release.
Email, of course, is a big wart. But guess what? Python 2's email module doesn't actually work! Sure, the program runs most of the time, but every program that depends on email must acquire inches of armorplate against all the things that can go wrong. You simply can't rely on it to DTRT except in a pre-MIME, pre-HTML, ASCII-only world. Although they're often addressing general problems, these hacks are *not* integrated back into the email module in most cases, but remain app-specific voodoo. If you live in Kansas, sure, you can concentrate on dodging tornados and completely forget about Unicode and MIME and text/bogus content. For the rest of the world, though, the problem is not Python 3. It's STD 11 (which still points at RFC 822, dated 1982!) It's really inappropriate to point at the email module, whose developers are trying *not* to punt on conformance and robustness, when even the IETF can only "run in circles, scream and shout"! Maybe there are other problems with Python 3 that deserve to be pointed at, but given the general scarcity of resources I think the email module developers are working on the right things. Unlike many other modules, email really needs to be rewritten from the ground (Python 3) up, because of the centrality of bytes/unicode confusion to all email problems. Python 3 completely changes the assumptions there; a Python 2-style email module really can't work properly. Then on top of that, today we know a lot more about handling issues like text/html content and MIME in general than when the Python 2 email module was designed. New problems have arisen over the period of Python 3 development, like "domain keys", which email doesn't handle out of the box AFAIK, but email for Python 3 should IMHO. Should Python 3 have been held back until email was fixed? Dunno, but I personally am very glad it was not; where I have a choice, I always use Python 3 now, and have yet to run into a problem. I expect that to change if I can find the time to get involved in email and Mailman 3 development, of course.<wink>
participants (9)
-
Arc Riley
-
Barry Warsaw
-
Bill Janssen
-
Brett Cannon
-
Giampaolo Rodolà
-
lutz@rmi.net
-
Nick Coghlan
-
Stephen J. Turnbull
-
Steve Holden