[Email-SIG] email package status in 3.X
lutz at rmi.net
lutz at rmi.net
Sun Jun 13 17:30:06 CEST 2010
Come to think of it, here was another oddness I just recalled: this
may have been reported already, but header decoding returns mixed types
depending upon the structure of the header. Converting to a str for
display isn't too difficult to handle, but this seems a bit inconsistent
and contrary to Python's type neutrality:
>>> from email.header import decode_header
>>> S1 = 'Man where did you get that assistant?'
>>> S2 = '=?utf-8?q?Man_where_did_you_get_that_assistant=3F?='
>>> S3 = 'Man where did you get that =?UTF-8?Q?assistant=3F?='
# str: don't decode()
>>> decode_header(S1)
[('Man where did you get that assistant?', None)]
# bytes: do decode()
>>> decode_header(S2)
[(b'Man where did you get that assistant?', 'utf-8')]
# bytes: do decode(), using raw-unicode-escape applied in package
>>> decode_header(S3)
[(b'Man where did you get that', None), (b'assistant?', 'utf-8')]
I can make this work around this with the following code, but it
feels a bit too tightly coupled to the package's internal details
(further evidence that email.* can be made to work as is today,
even if it may be seen as less than ideal aesthetically):
parts = email.header.decode_header(rawheader)
decoded = []
for (part, enc) in parts: # for all substrings
if enc == None: # part unencoded?
if not isinstance(part, bytes): # str: full hdr unencoded
decoded += [part] # else do unicode decode
else:
decoded += [part.decode('raw-unicode-escape')]
else:
decoded += [part.decode(enc)]
return ' '.join(decoded)
Thanks,
--Mark Lutz (http://learning-python.com, http://rmi.net/~lutz)
> -----Original Message-----
> From: lutz at rmi.net
> To: "R. David Murray" <rdmurray at bitdance.com>
> Subject: Re: email package status in 3.X
> Date: Sat, 12 Jun 2010 16:52:32 -0000
>
> Hi David,
>
> All sounds good, and thanks again for all your work on this.
>
> I appreciate the difficulties of moving this package to 3.X
> in a backward-compatible way. My suggestions stem from the fact
> that it does work as is today, albeit in a less than ideal way.
>
> That, and I'm seeing that Python 3.X in general is still having
> a great deal of trouble gaining traction in the "real world"
> almost 2 years after its release, and I'd hate to see further
> disincentives for people to migrate. This is a bigger issue
> than both the email package and this thread, of course.
>
> > > 3) Type-dependent text part encoding
> > >
> > ...
> > So, in the next releases of Python all MIMEText input should be string,
> > and it will fail if you pass bytes. I consider this as email previously
> > not living up to its published API, but do you think I should hack
> > in a way for it to accept bytes too, for backward compatibility in the
> > 3 line?
>
> Decoding can probably be safely delegated to package clients.
> Typical email clients will probably have str for display of the
> main text. They may wish to read attachments in binary mode, but
> can always read in text mode instead or decode manualy, because
> they need a known encoding to send the part correctly (my client
> has to ask or use configurations in some cases).
>
> B/W compatibility probably isn't a concern; I suspect that my
> temporary workaround will still work with your patch anyhow,
> and this code didn't work at all for some encodings before.
>
> > > There are some additional cases that now require decoding per mail
> > > headers today due to the str/bytes split, but these are just a
> > > normal artifact of supporting Unicode character sets in general,
> > > ans seem like issues for package client to resolve (e.g., the bytes
> > > returned for decoded payloads in 3.X didn't play well with existing
> > > str-based text processing code written for 2.X).
> >
> > I'm not following you here. Can you give me some more specific
> > examples? Even if these "normal artifacts" must remain with
> > the current API, I'd like to make things as easy as practical when
> > using the new API.
>
> This was just a general statement about things in my own code that
> didn't jive with the 3.X string model. For instance, line wrapping
> logic assumed str; tkinter text widgets do much better rendering str
> than the bytes fetched for decoded payloads; and my Pyedit text editor
> component had to be overhauled to handle display/edit/save of payloads
> of arbitrary encodings. If I remember any more specific issues with
> the email package itself, I'll forward your way.
>
> I'll watch for an opportunity to get the book's new PyMailGUI
> client code to you as a candidate test case, but please ping
> me about it later if I haven't acted on this. It works well,
> but largely because of all the work that went into the email
> package underlying it.
>
> Thanks,
> --Mark Lutz (http://learning-python.com, http://rmi.net/~lutz)
>
>
> > -----Original Message-----
> > From: "R. David Murray" <rdmurray at bitdance.com>
> > To: lutz at rmi.net
> > Subject: Re: email package status in 3.X
> > Date: Thu, 10 Jun 2010 10:18:48 -0400
> >
> > On Thu, 10 Jun 2010 09:21:52 -0400, lutz at rmi.net wrote:
> > > In other words, some of my concern may have been a bit premature.
> > > I hope that in the future we'll either strive for compatibility
> > > or keep the current version around; it's a lot of very useful code.
> >
> > The plan is to have a compatibility layer that will accept calls based
> > on the old API and forward appropriately to the new API. So far I'm
> > thinking I can succeed in doing this in a fairly straightforward manner,
> > but I won't know for sure until I get some more pieces in place.
> >
> > > In fact, I recommend that any new email package be named distinctly,
> >
> > I'm going to avoid that if I can (though the PyPI package will be
> > named email6 when we publish it for public testing). If, however,
> > it turns out that I can't correctly support both the old and the
> > new API, then I'll have to do that.
> >
> > > and that the current package be retained for a number of releases to
> > > come. After all the breakages that 3.X introduced in general, doing
> > > the same to any email-based code seems a bit too much, especially
> > > given that the current package is largely functional as is. To me,
> > > after having just used it extensively, fixing its few issues seems
> > > a better approach than starting from scratch.
> >
> > Well, the thing is, as you found, existing 2.x code needs to be fixed to
> > correctly handle the distinction between strings and bytes no matter what.
> > The goal is to make it easier to write correct programs, while providing
> > the compatibility layer to make porting smoother. But I doubt that any
> > non-trivial 2.x email program will port without significant changes,
> > even if the compatibility layer is close to 100% compatible with the
> > current Python3 email package, simply because the previous conflation
> > of text and bytes must be untangled in order to work correctly in
> > Python3, and email involves lots of transitions between text and bytes.
> >
> > As for "starting from scratch", it is true that the current plan involves
> > considerable changes in the recommended API (in the direction of greater
> > flexibility and power), but I'm hoping that significant portions of the
> > code will carry forward with minor changes, and that this will make it
> > easier to support the old API.
> >
> > > As far as other issues, the things I found are described below my
> > > signature. I don't know what the utf-8 issue is that you refer
> > > too; I'm able to parse and send with this encoding as is without
> > > problems (both payloads and headers), but I'm probably not using the
> > > interfaces you fixed, and this may be the same as one of item listed.
> >
> > It is, see below.
> >
> > > Another thought: it might be useful to use the book's email client
> > > as a sort of test case for the package; it's much more rigorous in
> > > the new edition because it now has to be given 3.X'Unicode model
> > > (it's abut 4,900 lines of code, though not all is email-related).
> > > I'd be happy to donate the code as soon as I find out what the
> > > copyright will be this time around; it will be at O'Reilly's site
> > > this Fall in any event.
> >
> > That would be great. I am planning to write my own sample ap to
> > demonstrate the new API, but if I can use yours to test the compatibility
> > layer that will help a lot, since I otherwise have no Python3 email
> > application to test against unless I port something from Python2.
> >
> > > Major issues I found...
> > > ------------------------------------------------------------------
> > > 1) Str required for parsing, but bytes returned from poplib
> > >
> > > The initial decode from bytes to str of full mail text; in
> > > retrospect, probably not a major issue, since original email
> > > standards called for ASCII. A 8-bit encoding like Latin-1 is
> > > probably sufficient for most conforming mails. For the book,
> > > I try a set of different encodings, beginning with an optional
> > > configuration module setting, then ascii, latin-1, and utf-8;
> > > this is probably overkill, but a GUI has to be defensive.
> >
> > This works (mostly) for conforming email, but some important Python email
> > applications need to deal with non-conforming email. That's where the
> > inability to parse bytes directly really causes problems.
> >
> > > 2) Binary attachments encoding
> > >
> > > The binary attachments byte-to-str issue that you've just
> > > fixed. As I mentioned, I worked around this by passing in a
> > > custom encoder that calls the original and runs an extra decode
> > > step. Here's what my fix looked like in the book; your patch
> > > may do better, and I will minimally add a note about the 3.1.3
> > > and 3.2 fix for this:
> >
> > Yeah, our patch was a lot simpler since we could fix the encoding inside
> > the loop producing the encoded lines :)
> >
> > > 3) Type-dependent text part encoding
> > >
> > > There's a str/bytes confusion issue related to Unicode encodings
> > > in text payload generation: some encodings require the payload to
> > > be str, but others expect bytes. Unfortunately, this means that
> > > clients need to know how the package will react to the encoding
> > > that is used, and special-case based upon that.
> >
> > This was the UTF-8 bug I fixed. I shouldn't have called it "the UTF-8
> > bug", because it applies equally to the other charsets that use base64,
> > as you note. I called it that because UTF-8 was where the problem was
> > noticed and is mentioned in the title of the bug report.
> >
> > I had a suspicion that the quoted-printable encoding wasn't being done
> > correctly either, so to hear that it is working for you is good news.
> > There may still be bugs to find there, though.
> >
> > So, in the next releases of Python all MIMEText input should be string,
> > and it will fail if you pass bytes. I consider this as email previously
> > not living up to its published API, but do you think I should hack
> > in a way for it to accept bytes too, for backward compatibility in the
> > 3 line?
> >
> > > There are some additional cases that now require decoding per mail
> > > headers today due to the str/bytes split, but these are just a
> > > normal artifact of supporting Unicode character sets in general,
> > > ans seem like issues for package client to resolve (e.g., the bytes
> > > returned for decoded payloads in 3.X didn't play well with existing
> > > str-based text processing code written for 2.X).
> >
> > I'm not following you here. Can you give me some more specific
> > examples? Even if these "normal artifacts" must remain with
> > the current API, I'd like to make things as easy as practical when
> > using the new API.
> >
> > Thanks for all your feedback!
> >
> > --David
> >
>
>
>
>
More information about the Email-SIG
mailing list