[Python-Dev] Polymorphic best practices [was: (Not) delaying the 3.2 release]

Thu Sep 16 22:51:58 CEST 2010

On Thu, 16 Sep 2010 17:40:53 +0200, Antoine Pitrou <solipsis at pitrou.net> wrote:
> On Thu, 16 Sep 2010 11:30:12 -0400
> "R. David Murray" <rdmurray at bitdance.com> wrote:
> > 
> > And then BaseHeader uses self.lit.colon, etc, when manipulating strings.
> > It also has to use slice notation rather than indexing when looking at
> > individual characters, which is a PITA but not terrible.
> > 
> > I'm not saying this is the best approach, since this is all experimental
> > code at the moment, but it is *an* approach....
> 
> Out of curiousity, can you explain why polymorphism is needed for
> e-mail? I would assume that headers are bytes until they are parsed, at
> which point they become a pair of unicode strings (one for the header
> name and one for its value).

Currently email accepts strings as input, and produces strings as output.

It needs to also accept bytes as input, and emit bytes as output, because
unicode can only be used as a 7-bit clean data transmission channel,
and that's too restrictive for many email applications (many of which
need to deal with "dirty" (non-RFC conformant) 8bit data. [1]

Backward compatibility says "case closed".

If we were designing from scratch, we could insist that input to the
parser is always bytes, and when the model is serialized it always
produces bytes.  It is possible that one could live with that, but I
don't think it is optimal.

Given a message, there are many times you want to serialize it as text
(for example, for presentation in a UI).  You could provide alternate
serialization methods to get text out on demand....but then what if
someone wants to push that text representation back in to email to
rebuild a model of the message?  So now we have both a bytes parser
and a string parser.

What do we store in the model?  We could say that the model is always
text.  But then we lose information about the original bytes message,
and we can't reproduce it.  For various reasons (mailman being a big one),
this is not acceptable.  So we could say that the model is always bytes.
But we want access to (for example) the header values as text, so header
lookup should take string keys and return string values[2].  But for
certain types of processing, particularly examination of "dirty",
non-RFC conforming input data, you need to be able to access the raw
bytes data.

What about email files on disk?  They could be bytes, or they could be,
effectively, text (for example, utf-8 encoded).  On disk, using utf-8,
one might store the text representation of the message, rather than
the wire-format (ASCII encoded) version.  We might want to write such
messages from scratch.  As I said above, we could insist that files on
disk be in wire-format, and for many applications that would work fine,
but I think people would get mad at us if didn't support text files[3].

So, after much discussion, what we arrived at (so far!) is a model
that mimics the Python3 split between bytes and strings.  If you
start with bytes input, you end up with a BytesMessage object.
If you start with string input to the parser, you end up with a
StringMessage.  If you have a BytesMessage and you want to do
something with the text version of the message, you decode it:

    print(mymsg.decode())

If the message is RFC conformant, the message contains all the information
needed to decode it correctly.  If its not conformant, email does the
best it can and registers defects for the non-conformant bits (or,
optionally, email6 will raise errors when the policy is set to strict).

If you have a StringMessage and you want to use it where wire-format is
needed, you encode it:

    outmsg = mymsg.encode()
    smtpserver.sendmail(
        bytes(outmsg['from']),
        [bytes(x) for x in itertools.chain(
            outmsg['to'], outmsg['cc'], outmsg['bcc'])],
        outmsg.serialize(policy=email.policy.SMTP))

Encoding uses the utf-8 character set by default, but this can be modified
by changing the policy.  The trick for gathering the list of addresses is
how I *think* that part of the API is going to work:  iterating the object
that models an address header gives you a list of address objects, and
converting one of those to a bytes string gives you the wire-format byte
string representing a single address.  Also note that this is the new API;
in compatibility mode (which is controlled by the policy) you'd get the
old behavior of just getting the string representation of the whole header
back (but then you'd have to parse it to turn it into a list of addresses).

The point here is that because we've encoded the message to a
BytesMessage, what we get when we turn the pieces into a bytes string
are the wire-format byte strings that are required for transmission;
for example, non-ASCII characters will be encoded according to
the policy and then RFC2047 transfer encoded as needed.

At this point you may notice there's a problem with the example above.
We actually need to decode each of those byte strings using the ASCII
codec before passing them as arguments to smtplib, since smtplib in
Python3 expects string arguments.  If smtplib were polymorphic we
could pass in the bytes strings directly.  In that case if a string
were passed in instead, smtplib could call some utility routines from
email to encode the text into bytes using the RFC2047 conventions.
As it stands now, there's no *easy* way for a user program to construct a
list of addresses that require RFC2047 encoding and pass it to smtplib.
(This last item is just as much a problem in Python2, by the way.)

This is probably not the right thing to do, though, because that isn't
the kind of polymorphism we're talking about.  When accepting input to
sendmail, smtplib is always bytes out, so having it accept both bytes and
strings as input is probably wrong[4].  Especially since the message body
needs to be passed in in wire-format, because smtplib should not have to
know how to convert text into wire-format...that's the email module's job.

Instead smtplib could take a Message object as input, and do that
serialize call itself.  In which case it could also figure out the
addresses by itself, and/or accept email address objects for the from
and to parameters.

You can see what a can of worms this stuff is :)  This is what I meant
about carefully examining the API contract before blindly providing
polymorphism.  For email, a wire-format bytes string *contains* encoding
information, and you have to stay aware of that as you redesign the
bytes/string interface.

Anyway, what polymorphism means in email is that if you put in bytes,
you get a BytesMessage, if you put in strings you get a StringMessage,
and if you want the other one you convert.

I'm giving consideration to additional polymorphism, such as having the
use of a key of a particular type return a value of that type.  That is,
looking up the subject by the key 'subject' would get you a StringHeader
regardless of whether you were looking it up in a BytesMessagge or a
StringMessage.  But I'm still thinking about whether or not that is a
good idea, I need to write up some more example code to convince myself
one way or another.  The sendmail example above is an example on the
"no" side:  you'll note that in that example the natural thing to do was
to use string keys, but get bytes out.

Well, that was probably more than you wanted to know or read, but
hopefully it will give some perspective on what's involved here.

Feedback on any of this is welcome.  I've got a hole in my schedule next
week that I'm planning on filling with email6 work, so any feedback will
all be grist for the mill.  Anyone interested should also sign up for
the email-sig mailing list and provide feedback when I start posting
there again (which, as I said, should be next week).

--
R. David Murray                                      www.bitdance.com

[1] Now that surrogateesscape exists, one might suppose that strings
could be used as an 8bit channel, but that only works if you don't need
to *parse* the non-ASCII data, just transmit it.  email needs to parse it.
In theory email6 could decode to bytes using surrougateescape and
then process, but the infrastructure to handle that still looks like
what is described above, so it makes more sense to accept bytes
directly.

[2] actually they the return StringHeader objects, but the principle is
the same.

[3] note that you can also have 7bit clean wire format messages stored
on disk as text.  These can be read as text, but my current thought is
that you must give them to email as bytes (that is, encode them using the
ASCII codec).  For email6 in the current design, bytes means wire format,
and text/string means fully decoded.

[4] unless the strings consist only of 7bit clean wire-format ASCII
characters, as is required now.  So what we will probably end up with
is smtplib.sendmail accepting both bytes and strings for backward
compatibility, but string input must continue to be (the equivalent of)
the ASCII decode of 7bit clean wire-format data.