[Python-Dev] Patch making the current email package (mostly) support bytes

Thu Oct 7 17:15:18 CEST 2010

On Thu, 07 Oct 2010 15:00:04 +0900, "Stephen J. Turnbull" <stephen at xemacs.org> wrote:
> R. David Murray writes:
> 
>  > > But that's not interesting; you did that with Python 3.  We want to
>  > Of course I did it with Python3.  It's the Python3 email codebase
>  > I'm working with (and have to work *around*).
> 
> Sure.  My point is that it has nothing to do with the expections of
> people trying to upgrade their apps to Python 3, and meeting those
> expectations is an important requirement of the specification of
> email5, right?

Well, not necessarily, no.  Python3 broke backward compatibility.
*Some* changes are going to have to be made in user code to make it
work with email5.  Where we can minimize those changes we should,
but it isn't a requirement, no.  With my patch, the minimization will
be message_from_string --> message_from_bytes, message_from_file -->
message_from_binary_file, and in some cases Generator --> BytesGenerator,
for those programs that need to deal with wire format data that is not
7bit clean.  Programs that only *generate* emails should need few
if any changes, but that is already true (that's the half of email
that is working :).

> Actually, in context we were not talking about a random character that
> came in from outside, we were talking about U+FFFD that *we*
> generated, and *know* that it's the only non-ASCII character in the
> string because we replaced all the others with it.

Ah, so that *was* what you were suggesting.

> Of course the best we can do with 'From: =?UNKNOWN?Q?p=C3=B6stal' or
> 'From: p\xc3\xb6stal' on input is to save the encoded or raw bytes
> representation and spit it back out on output.

Yes.  And I haven't actually dealt with what to do with non-ascii
characters or RFC2047 unknown-8bit characters when decoding
headers in email6.  In issue 6302 we are talking about adding a
decode_header_to_string method for email5 where the same issue arises,
and so we'll need to make a decision soon.  Presumably we'll use U+FFFD
to replace them (along with registering defects in email6).

> The MIME-charset = UNKNOWN dodge might be a better way of handling
> this.  The str is all ASCII, so won't raise exceptions unless the app
> itself objects to MIME encoded-words for some reason.  OTOH, the
> presence of encoded words will be a red flag to any human viewer, and
> after processing with .flatten(), the receiver is likely to DTRT (from
> the receiving human's point of view, per that human's configuration).

That is a very interesting idea.  It is the *right* thing to do, since it
would mean that a message parsed as bytes could be generated via Generator
and passed to, say, smtplib without losing any information.  However,
It's not exactly trivial to implement, since issues of runs of characters
and line re-wrapping need need to be dealt with.  Perhaps Header can be
made to handle bytes in order to do this; I'll have to look in to it.

>  > So you are suggesting that I should use U+FFFD encoded as UTF-8
>  > rather than '?' as the substitution character?  But earlier you said
>  > that people would probably rather not be forced to deal with Unicode
>  > just because there are invalid bytes in the message.  So that's
>  > probably not what you meant.
> 
> "Suggest" !=3D "recommend".  Talking to a wider base of users and
> developers, you might or might not find that to be a good idea.  I
> don't think the 800 million or so Chinese coming online in the next
> decade will much care whether you use U+FFFD or '?'.  The Japanese
> would prefer U+2639 WHITE FROWNING FACE or U+270C VICTORY HAND, no
> doubt ("crassly cute" is much beloved here).  Americans will likely
> prefer '?', as they probably have correspondents with legacy systems
> that won't like UTF-8 or perhaps don't have a font to display U+FFFD.

For the moment I think I'll stick with '?', with the idea of "fixing
that bug" by using the unknown charset trick at a later stage.

>  > Presumably you are suggesting that email5 be smart enough to turn my
>  > example into properly UTF-8/CTE encoded text.
> 
> No, in general that's undecidable without asking the originator,
> although humans can often make a good guess.  But not always: Japanese
> are fond of "four-character compound words", and I once found an
> 8-byte sequence (four 2-byte characters) that is idiomatic in both
> Shift JIS and EUC-JP.  Even a dictionary lookup can't determine the
> intended encoding for that sequence.

I was talking about unicode input, though, where you do know (modulo
the language differences that unicode hasn't yet sorted out).

> I'm only saying that any Unicode email-N generates itself can be
> properly encoded.

Agreed.

>  > But *that* problem is what email6 is trying to address.  It just
>  > doesn't look practical to address it directly in the email5 code
>  > base, because the email4 codebase that email5 inherits does not
>  > provide the correct distinction between bytes and text.  email5 is
>  > parsing the input stream *as if* it were ASCII-only CTE text.
> 
> I don't see how this is different from email6.  Just because email6 is
> trying to DTRT doesn't mean the spammers will, and even Emacs MUA
> developers occasionally screw this up in new products.  So email-N has
> to handle input streams that are *supposed* to be entirely ASCII except
> for message bodies that are properly marked as 8bit or binary CTE, but
> occasionally will not conform.

Right, but I was talking about my python3 example, where I was using
the email5 parser to (unsuccessfully) parse unicode.  *That's* the thing
email5 can't really handle, but email6 will be able to.

>  > Extending it to actually handle unicode input is a whole different
>  > kettle of sushi[*].
> 
> But this is not your problem in email5 AFAICS.

Right, but I thought you were suggesting it was.  My mistake.

>  > [*] And I've had an argument with someone who thinks email should
>  > *not* be extended to handle unicode messages with non-ASCII
>  > content, on the grounds that they aren't really email.
> 
> That's total nonsense.  Don't argue with people like that, educate
> them, and if that fails, ignore them.  There's good reason for not
> extending email5, ie, email4 didn't do it.  But that has nothing to do
> with what email "really is".

[ snip good supporting text ]

> In practice, email undoubtably has clients that want to manipulate
> bytes directly.  I can't blame them, but the RFCs have nothing to say
> about that, really.  RFC 822 and its family (including MIME) are about
> representing human media as octet streams compatible with protocols
> like RFC 821, and in Python the human medium for representing text is
> str.  The result of bytes manipulations should be "as if" the original
> stream was decoded, manipulated, and reencoded.  So direct bytes
> manipulation is an optimization.  The RFCs don't provide for it at
> all, AFAICS.
> 
> The same thing is true of URIs, except that RFC 3896 makes it fully
> explicit that URIs are conceptually text, not octets.  Again, there
> are many important use cases for bytes manipulation of URIs, but this
> is an optimization.

Thank you very much for this piece of perspective.  I hadn't thought
about it that clearly before, but what you say makes perfect sense to me,
and is in fact the implicit perspective I've been working from when
working on the email6 stuff.

--
R. David Murray                                      www.bitdance.com