[Python-Dev] email package status in 3.X

Wed Jun 16 22:48:49 CEST 2010

[copied to pydev from email-sig because of the broader scope]

Well, it looks like I've stumbled onto the "other shoe" on this
issue--that the email package's problems are also apparently 
behind the fact that CGI binary file uploads don't work in 3.1
(http://bugs.python.org/issue4953).  Yikes.

I trust that people realize this is a show-stopper for broader
Python 3.X adoption.  Why 3.0 was rolled out anyhow is beyond 
me; it seems that it would have been better if Python developers
had gotten their own code to work with 3.X, before expecting the 
world at large to do so.

FWIW, after rewriting Programming Python for 3.1, 3.x still feels
a lot like a beta to me, almost 2 years after its release.  How
did this happen?  Maybe nobody is using 3.X enough to care, but 
I have a feeling that issues like this are part of the reason why.

No offense to people who obviously put in an incredible amount of
work on 3.X.  As someone who remembers 0.X, though, it's hard not
to find the current situation a bit disappointing.

--Mark Lutz  (http://learning-python.com, http://rmi.net/~lutz)

> -----Original Message-----
> From: lutz at rmi.net
> To: "R. David Murray" <rdmurray at bitdance.com>
> Subject: Re: email package status in 3.X
> Date: Sun, 13 Jun 2010 15:30:06 -0000
> 
> Come to think of it, here was another oddness I just recalled: this 
> may have been reported already, but header decoding returns mixed types
> depending upon the structure of the header.  Converting to a str for 
> display isn't too difficult to handle, but this seems a bit inconsistent
> and contrary to Python's type neutrality:
> 
> >>> from email.header import decode_header
> >>> S1 = 'Man where did you get that assistant?'
> >>> S2 = '=?utf-8?q?Man_where_did_you_get_that_assistant=3F?='
> >>> S3 = 'Man where did you get that =?UTF-8?Q?assistant=3F?='
> 
> # str: don't decode()
> >>> decode_header(S1)
> [('Man where did you get that assistant?', None)]
> 
> # bytes: do decode()
> >>> decode_header(S2)
> [(b'Man where did you get that assistant?', 'utf-8')]
> 
> # bytes: do decode(), using raw-unicode-escape applied in package
> >>> decode_header(S3)
> [(b'Man where did you get that', None), (b'assistant?', 'utf-8')]
> 
> I can work around this with the following code, but it 
> feels a bit too tightly coupled to the package's internal details
> (further evidence that email.* can be made to work as is today, 
> even if it may be seen as less than ideal aesthetically):
> 
> parts = email.header.decode_header(rawheader)
> decoded = []
> for (part, enc) in parts:                      # for all substrings
>     if enc == None:                            # part unencoded?
>         if not isinstance(part, bytes):        # str: full hdr unencoded
>             decoded += [part]                  # else do unicode decode
>         else:
>             decoded += [part.decode('raw-unicode-escape')]
>     else:
>         decoded += [part.decode(enc)]
> return ' '.join(decoded)
> 
> Thanks,
> --Mark Lutz  (http://learning-python.com, http://rmi.net/~lutz)
> 
> 
> > -----Original Message-----
> > From: lutz at rmi.net
> > To: "R. David Murray" <rdmurray at bitdance.com>
> > Subject: Re: email package status in 3.X
> > Date: Sat, 12 Jun 2010 16:52:32 -0000
> > 
> > Hi David,
> > 
> > All sounds good, and thanks again for all your work on this.
> > 
> > I appreciate the difficulties of moving this package to 3.X
> > in a backward-compatible way.  My suggestions stem from the fact 
> > that it does work as is today, albeit in a less than ideal way.
> > 
> > That, and I'm seeing that Python 3.X in general is still having
> > a great deal of trouble gaining traction in the "real world" 
> > almost 2 years after its release, and I'd hate to see further 
> > disincentives for people to migrate.  This is a bigger issue
> > than both the email package and this thread, of course.
> > 
> > > > 3) Type-dependent text part encoding
> > > > 
> > > ...
> > > So, in the next releases of Python all MIMEText input should be string,
> > > and it will fail if you pass bytes.  I consider this as email previously
> > > not living up to its published API, but do you think I should hack
> > > in a way for it to accept bytes too, for backward compatibility in the
> > > 3 line?
> > 
> > Decoding can probably be safely delegated to package clients.
> > Typical email clients will probably have str for display of the
> > main text.  They may wish to read attachments in binary mode, but
> > can always read in text mode instead or decode manualy, because 
> > they need a known encoding to send the part correctly (my client 
> > has to ask or use configurations in some cases).
> > 
> > B/W compatibility probably isn't a concern; I suspect that my 
> > temporary workaround will still work with your patch anyhow, 
> > and this code didn't work at all for some encodings before.
> > 
> > > > There are some additional cases that now require decoding per mail 
> > > > headers today due to the str/bytes split, but these are just a 
> > > > normal artifact of supporting Unicode character sets in general,
> > > > ans seem like issues for package client to resolve (e.g., the bytes 
> > > > returned for decoded payloads in 3.X didn't play well with existing 
> > > > str-based text processing code written for 2.X).
> > > 
> > > I'm not following you here.  Can you give me some more specific
> > > examples?  Even if these "normal artifacts" must remain with
> > > the current API, I'd like to make things as easy as practical when
> > > using the new API.
> > 
> > This was just a general statement about things in my own code that
> > didn't jive with the 3.X string model.  For instance, line wrapping 
> > logic assumed str; tkinter text widgets do much better rendering str 
> > than the bytes fetched for decoded payloads; and my Pyedit text editor
> > component had to be overhauled to handle display/edit/save of payloads 
> > of arbitrary encodings.  If I remember any more specific issues with 
> > the email package itself, I'll forward your way.
> > 
> > I'll watch for an opportunity to get the book's new PyMailGUI 
> > client code to you as a candidate test case, but please ping 
> > me about it later if I haven't acted on this.  It works well,
> > but largely because of all the work that went into the email 
> > package underlying it.
> > 
> > Thanks,
> > --Mark Lutz  (http://learning-python.com, http://rmi.net/~lutz)
> > 
> > 
> > > -----Original Message-----
> > > From: "R. David Murray" <rdmurray at bitdance.com>
> > > To: lutz at rmi.net
> > > Subject: Re: email package status in 3.X
> > > Date: Thu, 10 Jun 2010 10:18:48 -0400
> > > 
> > > On Thu, 10 Jun 2010 09:21:52 -0400, lutz at rmi.net wrote:
> > > > In other words, some of my concern may have been a bit premature.  
> > > > I hope that in the future we'll either strive for compatibility 
> > > > or keep the current version around; it's a lot of very useful code.
> > > 
> > > The plan is to have a compatibility layer that will accept calls based
> > > on the old API and forward appropriately to the new API.  So far I'm
> > > thinking I can succeed in doing this in a fairly straightforward manner,
> > > but I won't know for sure until I get some more pieces in place.
> > > 
> > > > In fact, I recommend that any new email package be named distinctly, 
> > > 
> > > I'm going to avoid that if I can (though the PyPI package will be
> > > named email6 when we publish it for public testing).  If, however,
> > > it turns out that I can't correctly support both the old and the
> > > new API, then I'll have to do that.
> > > 
> > > > and that the current package be retained for a number of releases to
> > > > come.  After all the breakages that 3.X introduced in general, doing
> > > > the same to any email-based code seems a bit too much, especially 
> > > > given that the current package is largely functional as is.  To me,
> > > > after having just used it extensively, fixing its few issues seems 
> > > > a better approach than starting from scratch.
> > > 
> > > Well, the thing is, as you found, existing 2.x code needs to be fixed to
> > > correctly handle the distinction between strings and bytes no matter what.
> > > The goal is to make it easier to write correct programs, while providing
> > > the compatibility layer to make porting smoother.  But I doubt that any
> > > non-trivial 2.x email program will port without significant changes,
> > > even if the compatibility layer is close to 100% compatible with the
> > > current Python3 email package, simply because the previous conflation
> > > of text and bytes must be untangled in order to work correctly in
> > > Python3, and email involves lots of transitions between text and bytes.
> > > 
> > > As for "starting from scratch", it is true that the current plan involves
> > > considerable changes in the recommended API (in the direction of greater
> > > flexibility and power), but I'm hoping that significant portions of the
> > > code will carry forward with minor changes, and that this will make it
> > > easier to support the old API.
> > > 
> > > > As far as other issues, the things I found are described below my
> > > > signature.  I don't know what the utf-8 issue is that you refer 
> > > > too; I'm able to parse and send with this encoding as is without 
> > > > problems (both payloads and headers), but I'm probably not using the
> > > > interfaces you fixed, and this may be the same as one of item listed.
> > > 
> > > It is, see below.
> > > 
> > > > Another thought: it might be useful to use the book's email client 
> > > > as a sort of test case for the package; it's much more rigorous in 
> > > > the new edition because it now has to be given 3.X'Unicode model 
> > > > (it's abut 4,900 lines of code, though not all is email-related).
> > > > I'd be happy to donate the code as soon as I find out what the 
> > > > copyright will be this time around; it will be at O'Reilly's site
> > > > this Fall in any event.
> > > 
> > > That would be great.  I am planning to write my own sample ap to
> > > demonstrate the new API, but if I can use yours to test the compatibility
> > > layer that will help a lot, since I otherwise have no Python3 email
> > > application to test against unless I port something from Python2.
> > > 
> > > > Major issues I found...
> > > > ------------------------------------------------------------------
> > > > 1) Str required for parsing, but bytes returned from poplib
> > > > 
> > > > The initial decode from bytes to str of full mail text; in 
> > > > retrospect, probably not a major issue, since original email 
> > > > standards called for ASCII.  A 8-bit encoding like Latin-1 is
> > > > probably sufficient for most conforming mails.  For the book,
> > > > I try a set of different encodings, beginning with an optional
> > > > configuration module setting, then ascii, latin-1, and utf-8;
> > > > this is probably overkill, but a GUI has to be defensive.
> > > 
> > > This works (mostly) for conforming email, but some important Python email
> > > applications need to deal with non-conforming email.  That's where the
> > > inability to parse bytes directly really causes problems.
> > > 
> > > > 2) Binary attachments encoding
> > > > 
> > > > The binary attachments byte-to-str issue that you've just
> > > > fixed.  As I mentioned, I worked around this by passing in a 
> > > > custom encoder that calls the original and runs an extra decode
> > > > step.  Here's what my fix looked like in the book; your patch 
> > > > may do better, and I will minimally add a note about the 3.1.3
> > > > and 3.2 fix for this:
> > > 
> > > Yeah, our patch was a lot simpler since we could fix the encoding inside
> > > the loop producing the encoded lines :)
> > > 
> > > > 3) Type-dependent text part encoding
> > > > 
> > > > There's a str/bytes confusion issue related to Unicode encodings
> > > > in text payload generation: some encodings require the payload to
> > > > be str, but others expect bytes.  Unfortunately, this means that 
> > > > clients need to know how the package will react to the encoding 
> > > > that is used, and special-case based upon that.  
> > > 
> > > This was the UTF-8 bug I fixed.  I shouldn't have called it "the UTF-8
> > > bug", because it applies equally to the other charsets that use base64,
> > > as you note.  I called it that because UTF-8 was where the problem was
> > > noticed and is mentioned in the title of the bug report.
> > > 
> > > I had a suspicion that the quoted-printable encoding wasn't being done
> > > correctly either, so to hear that it is working for you is good news.
> > > There may still be bugs to find there, though.
> > > 
> > > So, in the next releases of Python all MIMEText input should be string,
> > > and it will fail if you pass bytes.  I consider this as email previously
> > > not living up to its published API, but do you think I should hack
> > > in a way for it to accept bytes too, for backward compatibility in the
> > > 3 line?
> > > 
> > > > There are some additional cases that now require decoding per mail 
> > > > headers today due to the str/bytes split, but these are just a 
> > > > normal artifact of supporting Unicode character sets in general,
> > > > ans seem like issues for package client to resolve (e.g., the bytes 
> > > > returned for decoded payloads in 3.X didn't play well with existing 
> > > > str-based text processing code written for 2.X).
> > > 
> > > I'm not following you here.  Can you give me some more specific
> > > examples?  Even if these "normal artifacts" must remain with
> > > the current API, I'd like to make things as easy as practical when
> > > using the new API.
> > > 
> > > Thanks for all your feedback!
> > > 
> > > --David
> > > 
> > 
> > 
> > 
> > 
>