[Python-Dev] Polymorphic best practices [was: (Not) delaying the 3.2 release]

Fri Sep 17 00:05:12 CEST 2010

On Thu, 16 Sep 2010 16:51:58 -0400
"R. David Murray" <rdmurray at bitdance.com> wrote:
>
> What do we store in the model?  We could say that the model is always
> text.  But then we lose information about the original bytes message,
> and we can't reproduce it.  For various reasons (mailman being a big one),
> this is not acceptable.  So we could say that the model is always bytes.
> But we want access to (for example) the header values as text, so header
> lookup should take string keys and return string values[2].

Why can't you have both in a single class? If you create the class
using a bytes source (a raw message sent by SMTP, for example), the
class automatically parses and decodes it to unicode strings; if you
create the class using an unicode source (the text body of the e-mail
message and the list of recipients, for example), the class
automatically creates the bytes representation.

(of course all processing can be done lazily for performance reasons)

> What about email files on disk?  They could be bytes, or they could be,
> effectively, text (for example, utf-8 encoded). 

Such a file can be two things:
- the raw encoding of a whole message (including headers, etc.), then
  it should be fed as a bytes object
- the single text body of a hypothetical message, then it should be fed
  as a unicode object

I don't see any possible middle-ground.

> On disk, using utf-8,
> one might store the text representation of the message, rather than
> the wire-format (ASCII encoded) version.  We might want to write such
> messages from scratch.

But then the user knows the encoding (by "user" I mean what/whoever
calls the email API) and mentions it to the email package.

What I'm having an issue with is that you are talking about a bytes
representation and an unicode representation of a message. But they
aren't representations of the same things:
- if it's a bytes representation, it will be the whole, raw message
  including envelope / headers (also, MIME sections etc.)
- if it's an unicode representation, it will only be a section of the
  message decodable as such (a text/plain MIME section, for example;
  or a decoded header value; or even a single e-mail address part of a
  decoded header)

So, there doesn't seem to be any reason for having both a BytesMessage
and an UnicodeMessage at the same abstraction level. They are both
representing different things at different abstraction levels. I don't
see any potential for confusion: raw assembled e-mail message = bytes;
decoded text section of a message = unicode.

As for the problem of potential "bogus" raw e-mail data
(e.g., undecodable headers), well, I guess the library has to make a
choice between purity and practicality, or perhaps let the user choose
themselves. For example, through a `strict` flag. If `strict` is true,
raise an error as soon as a non-decodable byte appears in a header, if
`strict` is false, decode it through a default (encoding, errors)
convention which can be overriden by the user (a sensible possibility
being "utf-8, surrogateescape" to allow for lossless round-tripping).

> As I said above, we could insist that files on
> disk be in wire-format, and for many applications that would work fine,
> but I think people would get mad at us if didn't support text files[3].

Again, this simply seems to be two different abstraction levels:
pre-generated raw email messages including headers, or a single text
waiting to be embedded in an actual e-mail.

> Anyway, what polymorphism means in email is that if you put in bytes,
> you get a BytesMessage, if you put in strings you get a StringMessage,
> and if you want the other one you convert.

And then you have two separate worlds while ultimately the same
concepts are underlying. A library accepting BytesMessage will crash
when a program wants to give a StringMessage and vice-versa. That
doesn't sound very practical.

> [1] Now that surrogateesscape exists, one might suppose that strings
> could be used as an 8bit channel, but that only works if you don't need
> to *parse* the non-ASCII data, just transmit it.

Well, you can parse it, precisely. Not only, but it round-trips if you
unparse it again:

>>> header_bytes = b"From: bogus\xFFname <someone at python.com>"
>>> name, value = header_bytes.decode("utf-8", "surrogateescape").split(":")
>>> name
'From'
>>> value
' bogus\udcffname <someone at python.com>'
>>> "{0}:{1}".format(name, value).encode("utf-8", "surrogateescape")
b'From: bogus\xffname <someone at python.com>'

In the end, what I would call a polymorphic best practice is "try to
avoid bytes/str polymorphism if your domain is well-defined
enough" (which I admit URLs aren't necessarily; but there's no
question a single text/XXX e-mail section is text, and a whole
assembled e-mail message is bytes).

Regards

Antoine.