[Python-Dev] Polymorphic best practices [was: (Not) delaying the 3.2 release]

Fri Sep 17 04:51:04 CEST 2010

On Fri, 17 Sep 2010 00:05:12 +0200, Antoine Pitrou <solipsis at pitrou.net> wrote:
> On Thu, 16 Sep 2010 16:51:58 -0400
> "R. David Murray" <rdmurray at bitdance.com> wrote:
> >
> > What do we store in the model?  We could say that the model is always
> > text.  But then we lose information about the original bytes message,
> > and we can't reproduce it.  For various reasons (mailman being a big one),
> > this is not acceptable.  So we could say that the model is always bytes.
> > But we want access to (for example) the header values as text, so header
> > lookup should take string keys and return string values[2].
> 
> Why can't you have both in a single class? If you create the class
> using a bytes source (a raw message sent by SMTP, for example), the
> class automatically parses and decodes it to unicode strings; if you
> create the class using an unicode source (the text body of the e-mail
> message and the list of recipients, for example), the class
> automatically creates the bytes representation.
> 
> (of course all processing can be done lazily for performance reasons)

Certainly we could do that.  There are methods, though, whose
implementation is the same except for the detail of whether they are
processing bytes or string, so the dual class structure allows that
implementation to be shared.  So even if we changed the API to be single
class, I might well retain the dual class implementation under the
hood.   I'd have to explore which looked better when the time came.

> > What about email files on disk?  They could be bytes, or they could be,
> > effectively, text (for example, utf-8 encoded). 
> 
> Such a file can be two things:
> - the raw encoding of a whole message (including headers, etc.), then
>   it should be fed as a bytes object
> - the single text body of a hypothetical message, then it should be fed
>   as a unicode object
> I don't see any possible middle-ground.

It's not a middle ground, but as I discussed in my response to Glyph,
it could be a series of headers and a body in, say, utf-8 where the
application wants to treat them as unicode, not bytes (ie: *not*
an email).  Python2 supports this use case, albeit with the same
"works most of the time" as it does with other non-ascii edge cases.

> > On disk, using utf-8,
> > one might store the text representation of the message, rather than
> > the wire-format (ASCII encoded) version.  We might want to write such
> > messages from scratch.
> 
> But then the user knows the encoding (by "user" I mean what/whoever
> calls the email API) and mentions it to the email package.

Yes?  And then?  The email package still has to parse the file, and it
can't use its normal parse-the-RFC-data parser because the file could
contain *legitimate* non-ASCII header data.  So there has to be a separate
parser for this case that will convert the non-ASCII data into RFC2047
encoded data.  At that point you have two parsers that share a bunch of
code...and my current implementation lets the input to the second parser
be text, which is the natural representation of that data, the one the
user or application writer is going to expect.  I *could* implement it
as a variant bytes parser, and have the application call the variant
parser with encoded bytes, but why?  What's the benefit?  If the API
takes text, it is *obvious* that non-ascii data is allowed and is going
to get wire-encoded.  If it takes bytes....there is more mental overhead
in figuring out which bytes-parser interface one should call, depending
on whether one has 'wire format" data or encoded non-ascii data.  I can
just imagine someone using the bytes-that-need-transfer-encoding to try
to parse a file containing RFC encoded data that he knows is stored in a
utf-8 encoded file, because that's the interface that accepts an encoding
paramter.  And then the RFC2047 encoded words wouldn't get decoded.

Overall it seems simpler to me that text file == pass text to the text
parser, RFC-encoded bytes data == pass bytes data to the bytes parser.
This also separates opening the file correctly (specify the encoding on
open) from encoding the data as you prefer (encoding specified to the
email package when telling it to encode to wire format).

> What I'm having an issue with is that you are talking about a bytes
> representation and an unicode representation of a message. But they
> aren't representations of the same things:
> - if it's a bytes representation, it will be the whole, raw message
>   including envelope / headers (also, MIME sections etc.)
> - if it's an unicode representation, it will only be a section of the
>   message decodable as such (a text/plain MIME section, for example;
>   or a decoded header value; or even a single e-mail address part of a
>   decoded header)

Conceptually, a BytesMessage is a model of the entire message with all
the parts encoded in RFC wire-format.  When you access pieces of it,
you get the RFC encoded byte strings.  Conceptually a StringMessage
is a model of the entire message with all the parts decoded as far
as possible.  This means that header values are unicode, and jpeg
images are...jpeg images.  When you access pieces of it, you get the
most useful kind of object for manipulating that piece in a program.
(So perhaps StringMessage is a bad label).

This split is about making the API simple, in my mind.  But as I
said to Glyph, perhaps I am wrong about this being the simpler/easier
to understand API.

> So, there doesn't seem to be any reason for having both a BytesMessage
> and an UnicodeMessage at the same abstraction level. They are both
> representing different things at different abstraction levels. I don't
> see any potential for confusion: raw assembled e-mail message = bytes;
> decoded text section of a message = unicode.

Perhaps my explanation above helps clarify this?  They are only at the
same level of abstraction in the sense that encoded byte strings and
decoded unicode strings are at the same abstraction level.  That is,
they aren't.

> As for the problem of potential "bogus" raw e-mail data
> (e.g., undecodable headers), well, I guess the library has to make a
> choice between purity and practicality, or perhaps let the user choose
> themselves. For example, through a `strict` flag. If `strict` is true,
> raise an error as soon as a non-decodable byte appears in a header, if
> `strict` is false, decode it through a default (encoding, errors)
> convention which can be overriden by the user (a sensible possibility
> being "utf-8, surrogateescape" to allow for lossless round-tripping).

Yes, there will be a stict setting, and error reporting, and optional
control over the default encoding and error handler.

> > As I said above, we could insist that files on
> > disk be in wire-format, and for many applications that would work fine,
> > but I think people would get mad at us if didn't support text files[3].
> 
> Again, this simply seems to be two different abstraction levels:
> pre-generated raw email messages including headers, or a single text
> waiting to be embedded in an actual e-mail.

But why not a text (unicode/utf8) representation of a message on disk,
including headers?  Why should that not be supported?  (If it were
a lot of extra work to support it, I'd drop it.  But it isn't.)

> > Anyway, what polymorphism means in email is that if you put in bytes,
> > you get a BytesMessage, if you put in strings you get a StringMessage,
> > and if you want the other one you convert.
> 
> And then you have two separate worlds while ultimately the same
> concepts are underlying. A library accepting BytesMessage will crash
> when a program wants to give a StringMessage and vice-versa. That
> doesn't sound very practical.

Yes, and a library accepting bytes will crash when a program wants
to give it a string.  So?  That's how Python3 works.  Unless, of
course, the application decides to be polymorphic :)

> > [1] Now that surrogateesscape exists, one might suppose that strings
> > could be used as an 8bit channel, but that only works if you don't need
> > to *parse* the non-ASCII data, just transmit it.
> 
> Well, you can parse it, precisely. Not only, but it round-trips if you
> unparse it again:
> 
> >>> header_bytes = b"From: bogus\xFFname <someone at python.com>"
> >>> name, value = header_bytes.decode("utf-8", "surrogateescape").split(":")
> >>> name
> 'From'
> >>> value
> ' bogus\udcffname <someone at python.com>'
> >>> "{0}:{1}".format(name, value).encode("utf-8", "surrogateescape")
> b'From: bogus\xffname <someone at python.com>'

Well, yes, the email module could accept surrogateescape decoded 8bit
input, and encode it to bytes internally (remember that non-header 8bit
data can be valid data).  But that only changes how the bytes get in to
the email module, it doesn't change anything else.  I don't see it as
an improvement over just accepting bytes.

On the other hand, that might be a way to make the current API work
at least a little bit better with 8bit input data.  I'll have to think
about that...

> In the end, what I would call a polymorphic best practice is "try to
> avoid bytes/str polymorphism if your domain is well-defined
> enough" (which I admit URLs aren't necessarily; but there's no
> question a single text/XXX e-mail section is text, and a whole
> assembled e-mail message is bytes).

Hmm.  Yes, we have strayed quite far from the original question into
the broader motivations behind the current email6 API.  Having gone
through this discussion, I realize now that the design isn't really
polymorphic in the strict sense.  Instead it is about handling both
text and bytes use cases with the simplest API I could come up with.
Rather than polymorphism in the email6 interface, what we really have
is a higher-abstraction-level equivalent of the bytes/string split made
by Python3.

And that makes sense to me as a direct outcome of that Python3 split,
and I suspect it is something that may need to be replicated elsewhere in
the stdlib (if I'm right!), such as in an as-yet-nonexistent IRI module.

--David