[Python-3000] email libraries: use byte or unicode strings?

Thu Nov 6 06:27:32 CET 2008

On approximately 11/5/2008 4:24 PM, came the following characters from 
the keyboard of Andrew McNamara:
>> But I'm not at all clear on what you mean by a round-trip through the 
>> email module.  Let me see... if you are creating an email, you (1) 
>> should encode it properly (2) a round-trip is mostly meaningless, unless 
>> you send it to yourself.  So you probably mean email that is received, 
>> and that you want to send on.  In this case, there is already a 
>> composed/encoded form of the email in hand; it could simply be sent as 
>> is without decoding or re-encoding.  That would be quite a clean round-trip!
> 
> Imagine a mail proxy of some sort (SMTP or a list manager like Mailman) - 
> you want to be able to parse a message, maybe make some minor changes
> (such as adding a "Received:" header, or stripping out illegal MIME types)
> and then emit something that differs from the original in only the ways
> that you specified.

Sure.  Add header, delete header APIs would suffice for this.  The APIs 
could accept Unicode, but do bytes manipulations.

> Another example - image what an mail transport agent does with bounces:
> it wraps them in a MIME wrapper, but otherwise changes the structure
> as little as possible (because that would make later analysis of the
> bounce problematic).

So they usually truncate the size too, to 10K or less.  Enough to get 
all the headers.  Some only send headers back.  So it is no problem.  A 
"retrieve headers in binary from message" API, followed by "add this 
chunk of binary as a MIME part" to the new bounce message under 
construction.  The first could be replaced by "retrieve message as 
bytes" and "substr", as an alternative.  So yes, some bytes APIs are 
necessary for binary MIME parts and the whole message (as I mentioned 
before), and there may be a few other special cases.  But mostly, just 
Unicode.

>> Notice that I said _nothing_ about the underlying processing in my 
>> comments, only the API.  I fully agree that some, perhaps most, of the 
>> underlying processing has to be aware of bytes, and use and manipulate 
>> bytes.
> 
> The bytes API has to be accessible - there are many contexts in which
> you need to work at this level.

Maybe.  I named a couple, you've named another, maybe there are a few 
more.  The only reason not to have a full bytes API is just the effort 
to support it... if that can reasonably be avoided, why not?  But I 
doubt there are a lot of cases that _must_ be handled as bytes, and so 
if we can identify the ones that indeed, must be, and supply them, the 
rest can be Unicode.

>> Indeed, the headers must be ASCII, and once encoded, the header body is 
>> also.
> 
> Except when they're not. It's not uncommon in mail handling to get a
> valid message that doesn't conform to the specs (not just spam). You can
> either throw your hands up in the air and declare it irredeemably broken,
> or do your best to extract meaning from it. Invariably, it's the CEO's
> best mate who sent the malformed message, so you process it or find a
> new job.

This is where you use the Latin-1 conversion.  Don't throw an error when 
in doesn't conform, but don't go to heroic efforts to provide bytes 
alternatives... just convert the bytes to Unicode, and the way the mail 
RFCs are written, and the types of encodings used, it is mostly 
readable.  And if it isn't encoded, it is even more readable.

>> And so it is quite possible to misinterpret the improperly encoded 
>> headers as 8-bit octets that correspond to Unicode codepoints (the 
>> so-called "Latin-1" conversion).  For spam, that is certainly good 
>> enough.  And roundtripping it says that if APIs are not used to change 
>> it, you use the original binary for that header.
> 
> Certainly, this is one approach, and users of the email module in the py3k
> standard lib are essentially doing this now.

And so how much is it a problem?  What are the effects of the problem? 
Does providing a bytes API solve the problem, or simply punt it to the 
user?  If it simply punts it to the user, are there significant benefits 
to the coder-user of obtaining the data as bytes, vs. obtaining it as 
bytes transliterated by the Latin-1 conversion to Unicode?  If there are 
significant benefits to the coder-user, what are they?

>>> One solution is to provide two sets of classes - the underlying
>>> bytes-based one, and another unicode-based one, built on top of the
>>> bytes classes, that implements the same API, but that may fail due to
>>> encoding errors.
>> I think you meant "decoding" errors, there?
> 
> Well, yes and no. I meant that the encoding was done incorrectly.

Sure.  The encoding wasn't done correctly, or wasn't done at all.  But 
that causes problems for the decoder, on the receiving side.

>> I guess I'm not terribly concerned about the readability of improperly 
>> encoded email messages, whether they are spam or ham.
> 
> You may not be, but other users of the module are.

Sure, but if it isn't properly encoded, then either it is an ASCII 
superset, in which case the ASCII parts will be readable (at least), and 
  so with a little human cleverness, the non-ASCII parts can be 
intuited.  I'm not suggesting making it worse than what it already is, 
in bytes form; just to translate the bytes to Unicode codepoints so that 
they can be returned on a Unicode interface.  If you return them in 
bytes, what would you do besides that?  If you would guess at an 
encoding, and do a different decode, that can be done on the Unicode 
transliteration just as easily as it can on the bytes form.

>> For ham, the correspondent should be informed that there are problems 
>> with their software, so that they can upgrade or reconfigure it.
> 
> How do you determine the correspondent if you can't parse their e-mail? 8-)

Email addresses are pretty standardized in format.  Especially the 
Errors header and the From header.  So I think the correspondent's email 
address will be reasonably interpretable even if their name is not, and 
the body of their message is not.

I'm not saying all is wonderful if they didn't properly encode their 
message, but I think you are exaggerating the problem... you can write 
back to the email address, even if you can't read the message.

>> (Not that I don't understand those encodings, but it is something that
>> certainly can and should be mostly hidden under the covers.)
> 
> You're talking about a utopian state that Unicode strives but fails to achieve.

Messages that are properly encoded can certainly achieve the Utopian 
state under the covers.

Messages that are not properly encoded can be assumed to be Latin-1, and 
converted to Unicode.  They may not be perfectly readable in that state, 
but face it, non-Unicode email clients did exactly that, but used 
Latin-1 bytes directly (or some other encoding).  And if you think it 
would be helpful to have the default conversion to Unicode use some code 
page other than Latin-1, such as the currently configured code page, 
that is a fine alternative... and again, is much what happens today when 
people communicate without doing the proper encoding.  Two people that 
use the same code page can communicate in that code page, but 
communicating with people that use other code pages is problematical.

So no, Unicode doesn't solve the problems with buggy software, but it 
can be used without making the problem worse, so using it generally 
makes for a more convenient API.

Think about the coder of the Python-based email client.  Given the 
alternatives to use the Unicode API or the bytes API, how are they going 
to choose to use one or the other?  Code the application twice, once 
with each API?  No way!  Too much work!

So they'll use the Unicode API for text, and the bytes APIs for binary 
attachments, because that is what is natural.

If improperly encoded messages are received, and appropriate 
transliterations are made so that the bytes get converted (default code 
page) or passed through (Latin-1 transformation), then the data may be 
somewhat garbled for characters in the non-ASCII subset.  But that is 
not different than the handling done by any 8-bit email client, nor, I 
suspect (a little uncertainty here) different than the handling done by 
Python < 3.0 mail libraries.

So that is not Utopian; Utopia can only be reached by following 
standards.  But I don't see it as terrible; it is no worse that what 
happens today when the standards are not followed.

-- 
Glenn -- http://nevcal.com/
===========================
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking