[Email-SIG] fixing the current email module

Thu Oct 8 03:10:24 CEST 2009

On Oct 3, 2009, at 11:41 AM, Stephen J. Turnbull wrote:

> Barry Warsaw writes:
>
>> So the basic model is: accept strings or bytes at the edges,
>> process everything internally as bytes, output strings and bytes at
>> the edges.
>
> In a certain pedantic sense, that can't be right, because bytes alone
> can't represent strings.
>
> Practically, you are going need to say how a bytes or bytearray is to
> be interpreted as a string, and that is going to be one big mess.
> (MIME?)
>
> Going the other way around you have no such problem, or rather the
> trivial embedding works fine, except that you have to do a range check
> at some point before you convert to bytes.

So, I've taken at least two abortive attempts at updating the email  
package to Python 3, once using bytes internally and another time  
using strings internally.  Neither one was completely satisfying (to  
say the least).  I've also heard convincing arguments from folks in  
the Python community in both camps: "using anything other than strings  
internally is insane; no, using anything other than bytes internally  
is insane."

As for the internal representational format, I'll amend my previous  
statement and say that I'll keep an open mind, but one thing that  
seems very clear is that we have to be able to accept strings and  
bytes at the incoming edges, and produce strings and bytes at the  
outgoing edges.  In a future message, Stephen outlines some excellent  
use cases, to which I'll follow up when I get there.  But I think he  
generally hits the nail on the head and proves that we'll have both  
types at the edges.  That makes for very interesting API design!

There's "internal" and then there's the low-level representation that  
the model exposes.  Here I have more confidence that we need make  
things much more consistent.  The trick is to do that while still  
making things convenient.

For example, we currently represent header values as 8-bit strings or  
Header instances. The latter can contain triples of the individual  
chunks, e.g. (content, language, charset).  I think we need represent  
header values as instances in all cases because the type checking is  
error prone, but even then, it makes for difficult API choices.   
Still, if the fundamental atom of header values in the model is the  
Header, and we define both byte and string APIs for headers, then the  
internal representation matters less since only the email package  
implementers need to care.

But note that even in this limited case, neither bytes nor strings  
really works.  The internal representation is that triple (and in the  
current model an implicit triple where charset=us-ascii).  So  
internally the charset is carried along for the ride, as it must be.   
If the internal representation were just strings or bytes, we wouldn't  
know how to generate the other format, at least not idempotently (or  
as close as we can get).

Just to ramble a little longer, it's been argued that we should give  
up on idempotency, but I'm not convinced.  I think people want to see  
an email message they throw into the system come out the other end as  
closely as possible (well, /exactly/ for well-formed messages).

-Barry

-------------- next part --------------
A non-text attachment was scrubbed...
Name: PGP.sig
Type: application/pgp-signature
Size: 832 bytes
Desc: This is a digitally signed message part
URL: <http://mail.python.org/pipermail/email-sig/attachments/20091007/7fe35fba/attachment.pgp>