[Email-SIG] email package status in 3.X

Sun Jun 13 17:30:06 CEST 2010

Come to think of it, here was another oddness I just recalled: this 
may have been reported already, but header decoding returns mixed types
depending upon the structure of the header.  Converting to a str for 
display isn't too difficult to handle, but this seems a bit inconsistent
and contrary to Python's type neutrality:

>>> from email.header import decode_header
>>> S1 = 'Man where did you get that assistant?'
>>> S2 = '=?utf-8?q?Man_where_did_you_get_that_assistant=3F?='
>>> S3 = 'Man where did you get that =?UTF-8?Q?assistant=3F?='

# str: don't decode()
>>> decode_header(S1)
[('Man where did you get that assistant?', None)]

# bytes: do decode()
>>> decode_header(S2)
[(b'Man where did you get that assistant?', 'utf-8')]

# bytes: do decode(), using raw-unicode-escape applied in package
>>> decode_header(S3)
[(b'Man where did you get that', None), (b'assistant?', 'utf-8')]

I can make this work around this with the following code, but it 
feels a bit too tightly coupled to the package's internal details
(further evidence that email.* can be made to work as is today, 
even if it may be seen as less than ideal aesthetically):

parts = email.header.decode_header(rawheader)
decoded = []
for (part, enc) in parts:                      # for all substrings
    if enc == None:                            # part unencoded?
        if not isinstance(part, bytes):        # str: full hdr unencoded
            decoded += [part]                  # else do unicode decode
        else:
            decoded += [part.decode('raw-unicode-escape')]
    else:
        decoded += [part.decode(enc)]
return ' '.join(decoded)

Thanks,
--Mark Lutz  (http://learning-python.com, http://rmi.net/~lutz)

> -----Original Message-----
> From: lutz at rmi.net
> To: "R. David Murray" <rdmurray at bitdance.com>
> Subject: Re: email package status in 3.X
> Date: Sat, 12 Jun 2010 16:52:32 -0000
> 
> Hi David,
> 
> All sounds good, and thanks again for all your work on this.
> 
> I appreciate the difficulties of moving this package to 3.X
> in a backward-compatible way.  My suggestions stem from the fact 
> that it does work as is today, albeit in a less than ideal way.
> 
> That, and I'm seeing that Python 3.X in general is still having
> a great deal of trouble gaining traction in the "real world" 
> almost 2 years after its release, and I'd hate to see further 
> disincentives for people to migrate.  This is a bigger issue
> than both the email package and this thread, of course.
> 
> > > 3) Type-dependent text part encoding
> > > 
> > ...
> > So, in the next releases of Python all MIMEText input should be string,
> > and it will fail if you pass bytes.  I consider this as email previously
> > not living up to its published API, but do you think I should hack
> > in a way for it to accept bytes too, for backward compatibility in the
> > 3 line?
> 
> Decoding can probably be safely delegated to package clients.
> Typical email clients will probably have str for display of the
> main text.  They may wish to read attachments in binary mode, but
> can always read in text mode instead or decode manualy, because 
> they need a known encoding to send the part correctly (my client 
> has to ask or use configurations in some cases).
> 
> B/W compatibility probably isn't a concern; I suspect that my 
> temporary workaround will still work with your patch anyhow, 
> and this code didn't work at all for some encodings before.
> 
> > > There are some additional cases that now require decoding per mail 
> > > headers today due to the str/bytes split, but these are just a 
> > > normal artifact of supporting Unicode character sets in general,
> > > ans seem like issues for package client to resolve (e.g., the bytes 
> > > returned for decoded payloads in 3.X didn't play well with existing 
> > > str-based text processing code written for 2.X).
> > 
> > I'm not following you here.  Can you give me some more specific
> > examples?  Even if these "normal artifacts" must remain with
> > the current API, I'd like to make things as easy as practical when
> > using the new API.
> 
> This was just a general statement about things in my own code that
> didn't jive with the 3.X string model.  For instance, line wrapping 
> logic assumed str; tkinter text widgets do much better rendering str 
> than the bytes fetched for decoded payloads; and my Pyedit text editor
> component had to be overhauled to handle display/edit/save of payloads 
> of arbitrary encodings.  If I remember any more specific issues with 
> the email package itself, I'll forward your way.
> 
> I'll watch for an opportunity to get the book's new PyMailGUI 
> client code to you as a candidate test case, but please ping 
> me about it later if I haven't acted on this.  It works well,
> but largely because of all the work that went into the email 
> package underlying it.
> 
> Thanks,
> --Mark Lutz  (http://learning-python.com, http://rmi.net/~lutz)
> 
> 
> > -----Original Message-----
> > From: "R. David Murray" <rdmurray at bitdance.com>
> > To: lutz at rmi.net
> > Subject: Re: email package status in 3.X
> > Date: Thu, 10 Jun 2010 10:18:48 -0400
> > 
> > On Thu, 10 Jun 2010 09:21:52 -0400, lutz at rmi.net wrote:
> > > In other words, some of my concern may have been a bit premature.  
> > > I hope that in the future we'll either strive for compatibility 
> > > or keep the current version around; it's a lot of very useful code.
> > 
> > The plan is to have a compatibility layer that will accept calls based
> > on the old API and forward appropriately to the new API.  So far I'm
> > thinking I can succeed in doing this in a fairly straightforward manner,
> > but I won't know for sure until I get some more pieces in place.
> > 
> > > In fact, I recommend that any new email package be named distinctly, 
> > 
> > I'm going to avoid that if I can (though the PyPI package will be
> > named email6 when we publish it for public testing).  If, however,
> > it turns out that I can't correctly support both the old and the
> > new API, then I'll have to do that.
> > 
> > > and that the current package be retained for a number of releases to
> > > come.  After all the breakages that 3.X introduced in general, doing
> > > the same to any email-based code seems a bit too much, especially 
> > > given that the current package is largely functional as is.  To me,
> > > after having just used it extensively, fixing its few issues seems 
> > > a better approach than starting from scratch.
> > 
> > Well, the thing is, as you found, existing 2.x code needs to be fixed to
> > correctly handle the distinction between strings and bytes no matter what.
> > The goal is to make it easier to write correct programs, while providing
> > the compatibility layer to make porting smoother.  But I doubt that any
> > non-trivial 2.x email program will port without significant changes,
> > even if the compatibility layer is close to 100% compatible with the
> > current Python3 email package, simply because the previous conflation
> > of text and bytes must be untangled in order to work correctly in
> > Python3, and email involves lots of transitions between text and bytes.
> > 
> > As for "starting from scratch", it is true that the current plan involves
> > considerable changes in the recommended API (in the direction of greater
> > flexibility and power), but I'm hoping that significant portions of the
> > code will carry forward with minor changes, and that this will make it
> > easier to support the old API.
> > 
> > > As far as other issues, the things I found are described below my
> > > signature.  I don't know what the utf-8 issue is that you refer 
> > > too; I'm able to parse and send with this encoding as is without 
> > > problems (both payloads and headers), but I'm probably not using the
> > > interfaces you fixed, and this may be the same as one of item listed.
> > 
> > It is, see below.
> > 
> > > Another thought: it might be useful to use the book's email client 
> > > as a sort of test case for the package; it's much more rigorous in 
> > > the new edition because it now has to be given 3.X'Unicode model 
> > > (it's abut 4,900 lines of code, though not all is email-related).
> > > I'd be happy to donate the code as soon as I find out what the 
> > > copyright will be this time around; it will be at O'Reilly's site
> > > this Fall in any event.
> > 
> > That would be great.  I am planning to write my own sample ap to
> > demonstrate the new API, but if I can use yours to test the compatibility
> > layer that will help a lot, since I otherwise have no Python3 email
> > application to test against unless I port something from Python2.
> > 
> > > Major issues I found...
> > > ------------------------------------------------------------------
> > > 1) Str required for parsing, but bytes returned from poplib
> > > 
> > > The initial decode from bytes to str of full mail text; in 
> > > retrospect, probably not a major issue, since original email 
> > > standards called for ASCII.  A 8-bit encoding like Latin-1 is
> > > probably sufficient for most conforming mails.  For the book,
> > > I try a set of different encodings, beginning with an optional
> > > configuration module setting, then ascii, latin-1, and utf-8;
> > > this is probably overkill, but a GUI has to be defensive.
> > 
> > This works (mostly) for conforming email, but some important Python email
> > applications need to deal with non-conforming email.  That's where the
> > inability to parse bytes directly really causes problems.
> > 
> > > 2) Binary attachments encoding
> > > 
> > > The binary attachments byte-to-str issue that you've just
> > > fixed.  As I mentioned, I worked around this by passing in a 
> > > custom encoder that calls the original and runs an extra decode
> > > step.  Here's what my fix looked like in the book; your patch 
> > > may do better, and I will minimally add a note about the 3.1.3
> > > and 3.2 fix for this:
> > 
> > Yeah, our patch was a lot simpler since we could fix the encoding inside
> > the loop producing the encoded lines :)
> > 
> > > 3) Type-dependent text part encoding
> > > 
> > > There's a str/bytes confusion issue related to Unicode encodings
> > > in text payload generation: some encodings require the payload to
> > > be str, but others expect bytes.  Unfortunately, this means that 
> > > clients need to know how the package will react to the encoding 
> > > that is used, and special-case based upon that.  
> > 
> > This was the UTF-8 bug I fixed.  I shouldn't have called it "the UTF-8
> > bug", because it applies equally to the other charsets that use base64,
> > as you note.  I called it that because UTF-8 was where the problem was
> > noticed and is mentioned in the title of the bug report.
> > 
> > I had a suspicion that the quoted-printable encoding wasn't being done
> > correctly either, so to hear that it is working for you is good news.
> > There may still be bugs to find there, though.
> > 
> > So, in the next releases of Python all MIMEText input should be string,
> > and it will fail if you pass bytes.  I consider this as email previously
> > not living up to its published API, but do you think I should hack
> > in a way for it to accept bytes too, for backward compatibility in the
> > 3 line?
> > 
> > > There are some additional cases that now require decoding per mail 
> > > headers today due to the str/bytes split, but these are just a 
> > > normal artifact of supporting Unicode character sets in general,
> > > ans seem like issues for package client to resolve (e.g., the bytes 
> > > returned for decoded payloads in 3.X didn't play well with existing 
> > > str-based text processing code written for 2.X).
> > 
> > I'm not following you here.  Can you give me some more specific
> > examples?  Even if these "normal artifacts" must remain with
> > the current API, I'd like to make things as easy as practical when
> > using the new API.
> > 
> > Thanks for all your feedback!
> > 
> > --David
> > 
> 
> 
> 
>