[Email-SIG] fixing the current email module
barry at python.org
Thu Oct 8 03:10:24 CEST 2009
On Oct 3, 2009, at 11:41 AM, Stephen J. Turnbull wrote:
> Barry Warsaw writes:
>> So the basic model is: accept strings or bytes at the edges,
>> process everything internally as bytes, output strings and bytes at
>> the edges.
> In a certain pedantic sense, that can't be right, because bytes alone
> can't represent strings.
> Practically, you are going need to say how a bytes or bytearray is to
> be interpreted as a string, and that is going to be one big mess.
> Going the other way around you have no such problem, or rather the
> trivial embedding works fine, except that you have to do a range check
> at some point before you convert to bytes.
So, I've taken at least two abortive attempts at updating the email
package to Python 3, once using bytes internally and another time
using strings internally. Neither one was completely satisfying (to
say the least). I've also heard convincing arguments from folks in
the Python community in both camps: "using anything other than strings
internally is insane; no, using anything other than bytes internally
As for the internal representational format, I'll amend my previous
statement and say that I'll keep an open mind, but one thing that
seems very clear is that we have to be able to accept strings and
bytes at the incoming edges, and produce strings and bytes at the
outgoing edges. In a future message, Stephen outlines some excellent
use cases, to which I'll follow up when I get there. But I think he
generally hits the nail on the head and proves that we'll have both
types at the edges. That makes for very interesting API design!
There's "internal" and then there's the low-level representation that
the model exposes. Here I have more confidence that we need make
things much more consistent. The trick is to do that while still
making things convenient.
For example, we currently represent header values as 8-bit strings or
Header instances. The latter can contain triples of the individual
chunks, e.g. (content, language, charset). I think we need represent
header values as instances in all cases because the type checking is
error prone, but even then, it makes for difficult API choices.
Still, if the fundamental atom of header values in the model is the
Header, and we define both byte and string APIs for headers, then the
internal representation matters less since only the email package
implementers need to care.
But note that even in this limited case, neither bytes nor strings
really works. The internal representation is that triple (and in the
current model an implicit triple where charset=us-ascii). So
internally the charset is carried along for the ride, as it must be.
If the internal representation were just strings or bytes, we wouldn't
know how to generate the other format, at least not idempotently (or
as close as we can get).
Just to ramble a little longer, it's been argued that we should give
up on idempotency, but I'm not convinced. I think people want to see
an email message they throw into the system come out the other end as
closely as possible (well, /exactly/ for well-formed messages).
-------------- next part --------------
A non-text attachment was scrubbed...
Size: 832 bytes
Desc: This is a digitally signed message part
More information about the Email-SIG