[Python-3000] email libraries: use byte or unicode strings?
Glenn Linderman
v+python at g.nevcal.com
Thu Nov 6 00:39:55 CET 2008
On approximately 11/5/2008 2:59 PM, came the following characters from
the keyboard of Andrew McNamara:
>> I would find
>>
>> message[b'Subject'] = b'Hello'
>>
>> to be totally gross.
>>
>> While RFC Email is all ASCII, except if 8bit transfer is legal, there
>> are internal encoding provided that permit the expression of Unicode in
>> nearly any component of the email, except for header identifiers. But
>> there are never Unicode characters in the transfer, as they always get
>> encoded (there can be UTF-8 byte sequences, of course, if 8bit transfer
>> is legal; if it is not, then even UTF-8 byte sequences must be further
>> encoded).
>>
>> Depending on the level of email interface, there should be no interface
>> that cannot be expressed in terms of Unicode, plus an encoding to use
>> for the associated data. Even 8-bit binary can be translated into a
>> sequence of Unicode codepoints with the same numeric value, for example.
>
> One significant problem is that the email module is intended to be
> able to work with malformed e-mail without mangling it too badly. The
> malformed e-mail should also make a round-trip through the email module
> without being further mangled.
This is an interesting perspective... "stuff em" does come to mind :)
But I'm not at all clear on what you mean by a round-trip through the
email module. Let me see... if you are creating an email, you (1)
should encode it properly (2) a round-trip is mostly meaningless, unless
you send it to yourself. So you probably mean email that is received,
and that you want to send on. In this case, there is already a
composed/encoded form of the email in hand; it could simply be sent as
is without decoding or re-encoding. That would be quite a clean round-trip!
> I think this requires the underlying processing to be all based on bytes,
Notice that I said _nothing_ about the underlying processing in my
comments, only the API. I fully agree that some, perhaps most, of the
underlying processing has to be aware of bytes, and use and manipulate
bytes.
> but doesn't preclude layers on top that parse the charset hints. The
> rules about encoding are strict, but not always followed. For instance,
> the headers *must* be ASCII (the header body can, however, be encoded -
> see rfc2047).
Indeed, the headers must be ASCII, and once encoded, the header body is
also.
> Spammers often ignore this, and you might be inclined to
> say "stuff em'", but this would make the SpamBayes authors rather unhappy.
And so it is quite possible to misinterpret the improperly encoded
headers as 8-bit octets that correspond to Unicode codepoints (the
so-called "Latin-1" conversion). For spam, that is certainly good
enough. And roundtripping it says that if APIs are not used to change
it, you use the original binary for that header.
> One solution is to provide two sets of classes - the underlying
> bytes-based one, and another unicode-based one, built on top of the
> bytes classes, that implements the same API, but that may fail due to
> encoding errors.
I think you meant "decoding" errors, there?
I guess I'm not terribly concerned about the readability of improperly
encoded email messages, whether they are spam or ham. For the purposes
of SpamBayes (which I assume is similar to spamassassin, only written in
Python), it doesn't matter if the data is readable, only that it is
recognizably similar. So a consistent mis-transliteration is as good a
a correct decoding.
For ham, the correspondent should be informed that there are problems
with their software, so that they can upgrade or reconfigure it. And a
mis-transliteration is likely the best that can be provided in that case
anyway... unless the mail API provides for ignoring the incoming
(incorrect or missing) encoding directives and using one provided by the
API, and the client can select a few until they stumble on one that
produces a readable result. But if the mis-transliteration is done
using the Latin-1 conversion to Unicode, the client, if it chooses to
want to do that sort of heuristic analysis, can reencode to Latin-1, and
then decode using some other encoding(s), independently of the mail APIs
providing such a facility.
I do hope to learn and use the Python mail APIs, and I was hoping to do
that in Python 3.0 (and am sorry, but not surprised, to hear that this
is an area of problems at present), and I was hoping that the interfaces
that would be presented by Python 3.0 mail APIs would be in terms of
Unicode, for the convenience of being abstracted away from the plethora
of encodings that are defined at the mail transport layer. (Not that I
don't understand those encodings, but it is something that certainly can
and should be mostly hidden under the covers.)
--
Glenn -- http://nevcal.com/
===========================
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking
More information about the Python-3000
mailing list