[Python-3000] email libraries: use byte or unicode strings?
Glenn Linderman
v+python at g.nevcal.com
Thu Nov 6 06:27:32 CET 2008
On approximately 11/5/2008 4:24 PM, came the following characters from
the keyboard of Andrew McNamara:
>> But I'm not at all clear on what you mean by a round-trip through the
>> email module. Let me see... if you are creating an email, you (1)
>> should encode it properly (2) a round-trip is mostly meaningless, unless
>> you send it to yourself. So you probably mean email that is received,
>> and that you want to send on. In this case, there is already a
>> composed/encoded form of the email in hand; it could simply be sent as
>> is without decoding or re-encoding. That would be quite a clean round-trip!
>
> Imagine a mail proxy of some sort (SMTP or a list manager like Mailman) -
> you want to be able to parse a message, maybe make some minor changes
> (such as adding a "Received:" header, or stripping out illegal MIME types)
> and then emit something that differs from the original in only the ways
> that you specified.
Sure. Add header, delete header APIs would suffice for this. The APIs
could accept Unicode, but do bytes manipulations.
> Another example - image what an mail transport agent does with bounces:
> it wraps them in a MIME wrapper, but otherwise changes the structure
> as little as possible (because that would make later analysis of the
> bounce problematic).
So they usually truncate the size too, to 10K or less. Enough to get
all the headers. Some only send headers back. So it is no problem. A
"retrieve headers in binary from message" API, followed by "add this
chunk of binary as a MIME part" to the new bounce message under
construction. The first could be replaced by "retrieve message as
bytes" and "substr", as an alternative. So yes, some bytes APIs are
necessary for binary MIME parts and the whole message (as I mentioned
before), and there may be a few other special cases. But mostly, just
Unicode.
>> Notice that I said _nothing_ about the underlying processing in my
>> comments, only the API. I fully agree that some, perhaps most, of the
>> underlying processing has to be aware of bytes, and use and manipulate
>> bytes.
>
> The bytes API has to be accessible - there are many contexts in which
> you need to work at this level.
Maybe. I named a couple, you've named another, maybe there are a few
more. The only reason not to have a full bytes API is just the effort
to support it... if that can reasonably be avoided, why not? But I
doubt there are a lot of cases that _must_ be handled as bytes, and so
if we can identify the ones that indeed, must be, and supply them, the
rest can be Unicode.
>> Indeed, the headers must be ASCII, and once encoded, the header body is
>> also.
>
> Except when they're not. It's not uncommon in mail handling to get a
> valid message that doesn't conform to the specs (not just spam). You can
> either throw your hands up in the air and declare it irredeemably broken,
> or do your best to extract meaning from it. Invariably, it's the CEO's
> best mate who sent the malformed message, so you process it or find a
> new job.
This is where you use the Latin-1 conversion. Don't throw an error when
in doesn't conform, but don't go to heroic efforts to provide bytes
alternatives... just convert the bytes to Unicode, and the way the mail
RFCs are written, and the types of encodings used, it is mostly
readable. And if it isn't encoded, it is even more readable.
>> And so it is quite possible to misinterpret the improperly encoded
>> headers as 8-bit octets that correspond to Unicode codepoints (the
>> so-called "Latin-1" conversion). For spam, that is certainly good
>> enough. And roundtripping it says that if APIs are not used to change
>> it, you use the original binary for that header.
>
> Certainly, this is one approach, and users of the email module in the py3k
> standard lib are essentially doing this now.
And so how much is it a problem? What are the effects of the problem?
Does providing a bytes API solve the problem, or simply punt it to the
user? If it simply punts it to the user, are there significant benefits
to the coder-user of obtaining the data as bytes, vs. obtaining it as
bytes transliterated by the Latin-1 conversion to Unicode? If there are
significant benefits to the coder-user, what are they?
>>> One solution is to provide two sets of classes - the underlying
>>> bytes-based one, and another unicode-based one, built on top of the
>>> bytes classes, that implements the same API, but that may fail due to
>>> encoding errors.
>> I think you meant "decoding" errors, there?
>
> Well, yes and no. I meant that the encoding was done incorrectly.
Sure. The encoding wasn't done correctly, or wasn't done at all. But
that causes problems for the decoder, on the receiving side.
>> I guess I'm not terribly concerned about the readability of improperly
>> encoded email messages, whether they are spam or ham.
>
> You may not be, but other users of the module are.
Sure, but if it isn't properly encoded, then either it is an ASCII
superset, in which case the ASCII parts will be readable (at least), and
so with a little human cleverness, the non-ASCII parts can be
intuited. I'm not suggesting making it worse than what it already is,
in bytes form; just to translate the bytes to Unicode codepoints so that
they can be returned on a Unicode interface. If you return them in
bytes, what would you do besides that? If you would guess at an
encoding, and do a different decode, that can be done on the Unicode
transliteration just as easily as it can on the bytes form.
>> For ham, the correspondent should be informed that there are problems
>> with their software, so that they can upgrade or reconfigure it.
>
> How do you determine the correspondent if you can't parse their e-mail? 8-)
Email addresses are pretty standardized in format. Especially the
Errors header and the From header. So I think the correspondent's email
address will be reasonably interpretable even if their name is not, and
the body of their message is not.
I'm not saying all is wonderful if they didn't properly encode their
message, but I think you are exaggerating the problem... you can write
back to the email address, even if you can't read the message.
>> (Not that I don't understand those encodings, but it is something that
>> certainly can and should be mostly hidden under the covers.)
>
> You're talking about a utopian state that Unicode strives but fails to achieve.
Messages that are properly encoded can certainly achieve the Utopian
state under the covers.
Messages that are not properly encoded can be assumed to be Latin-1, and
converted to Unicode. They may not be perfectly readable in that state,
but face it, non-Unicode email clients did exactly that, but used
Latin-1 bytes directly (or some other encoding). And if you think it
would be helpful to have the default conversion to Unicode use some code
page other than Latin-1, such as the currently configured code page,
that is a fine alternative... and again, is much what happens today when
people communicate without doing the proper encoding. Two people that
use the same code page can communicate in that code page, but
communicating with people that use other code pages is problematical.
So no, Unicode doesn't solve the problems with buggy software, but it
can be used without making the problem worse, so using it generally
makes for a more convenient API.
Think about the coder of the Python-based email client. Given the
alternatives to use the Unicode API or the bytes API, how are they going
to choose to use one or the other? Code the application twice, once
with each API? No way! Too much work!
So they'll use the Unicode API for text, and the bytes APIs for binary
attachments, because that is what is natural.
If improperly encoded messages are received, and appropriate
transliterations are made so that the bytes get converted (default code
page) or passed through (Latin-1 transformation), then the data may be
somewhat garbled for characters in the non-ASCII subset. But that is
not different than the handling done by any 8-bit email client, nor, I
suspect (a little uncertainty here) different than the handling done by
Python < 3.0 mail libraries.
So that is not Utopian; Utopia can only be reached by following
standards. But I don't see it as terrible; it is no worse that what
happens today when the standards are not followed.
--
Glenn -- http://nevcal.com/
===========================
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking
More information about the Python-3000
mailing list