[Python-3000] email libraries: use byte or unicode strings?

Thu Nov 6 01:24:04 CET 2008

>But I'm not at all clear on what you mean by a round-trip through the 
>email module.  Let me see... if you are creating an email, you (1) 
>should encode it properly (2) a round-trip is mostly meaningless, unless 
>you send it to yourself.  So you probably mean email that is received, 
>and that you want to send on.  In this case, there is already a 
>composed/encoded form of the email in hand; it could simply be sent as 
>is without decoding or re-encoding.  That would be quite a clean round-trip!

Imagine a mail proxy of some sort (SMTP or a list manager like Mailman) - 
you want to be able to parse a message, maybe make some minor changes
(such as adding a "Received:" header, or stripping out illegal MIME types)
and then emit something that differs from the original in only the ways
that you specified.

Another example - image what an mail transport agent does with bounces:
it wraps them in a MIME wrapper, but otherwise changes the structure
as little as possible (because that would make later analysis of the
bounce problematic).

>Notice that I said _nothing_ about the underlying processing in my 
>comments, only the API.  I fully agree that some, perhaps most, of the 
>underlying processing has to be aware of bytes, and use and manipulate 
>bytes.

The bytes API has to be accessible - there are many contexts in which
you need to work at this level.

>Indeed, the headers must be ASCII, and once encoded, the header body is 
>also.

Except when they're not. It's not uncommon in mail handling to get a
valid message that doesn't conform to the specs (not just spam). You can
either throw your hands up in the air and declare it irredeemably broken,
or do your best to extract meaning from it. Invariably, it's the CEO's
best mate who sent the malformed message, so you process it or find a
new job.

>And so it is quite possible to misinterpret the improperly encoded 
>headers as 8-bit octets that correspond to Unicode codepoints (the 
>so-called "Latin-1" conversion).  For spam, that is certainly good 
>enough.  And roundtripping it says that if APIs are not used to change 
>it, you use the original binary for that header.

Certainly, this is one approach, and users of the email module in the py3k
standard lib are essentially doing this now.

>> One solution is to provide two sets of classes - the underlying
>> bytes-based one, and another unicode-based one, built on top of the
>> bytes classes, that implements the same API, but that may fail due to
>> encoding errors.
>
>I think you meant "decoding" errors, there?

Well, yes and no. I meant that the encoding was done incorrectly.

>I guess I'm not terribly concerned about the readability of improperly 
>encoded email messages, whether they are spam or ham.

You may not be, but other users of the module are.

>For ham, the correspondent should be informed that there are problems 
>with their software, so that they can upgrade or reconfigure it.

How do you determine the correspondent if you can't parse their e-mail? 8-)

>(Not that I don't understand those encodings, but it is something that
>certainly can and should be mostly hidden under the covers.)

You're talking about a utopian state that Unicode strives but fails to achieve.

-- 
Andrew McNamara, Senior Developer, Object Craft
http://www.object-craft.com.au/