[Python-Dev] Polymorphic best practices [was: (Not) delaying the 3.2 release]

R. David Murray rdmurray at bitdance.com
Fri Sep 17 03:34:26 CEST 2010

On Thu, 16 Sep 2010 18:11:30 -0400, Glyph Lefkowitz <glyph at twistedmatrix.com> wrote:
> On Sep 16, 2010, at 4:51 PM, R. David Murray wrote:
> > Given a message, there are many times you want to serialize it as text
> > (for example, for presentation in a UI).  You could provide alternate
> > serialization methods to get text out on demand....but then what if
> > someone wants to push that text representation back in to email to
> > rebuild a model of the message?
> You tell them "too bad, make some bytes out of that text."  Leave it up
> to the application.  Period, the end, it's not the library's job.  If
> you pushed the text out to a 'view message source' UI representation,
> then the vicissitudes of the system clipboard and other encoding and
> decoding things may corrupt it in inscrutable ways.  You can't fix it. 
> Don't try.

Say we start with this bytes input:

    To: Glyph Lefkowitz <glyph at twistedmatrix.com>
    From: "R. David Murray" <rdmurray at bitdance.com>
    Subject: =?utf-8?q?p=F6stal?=

    A simple message.

Part of the responsibility of the email module is to provide that
in text form on demand, so the application gets:

    To: Glyph Lefkowitz <glyph at twistedmatrix.com>
    From: "R. David Murray" <rdmurray at bitdance.com>
    Subject: pöstal

    A simple message.

Now the application allows the user to do some manipulation of that, and
we have:

    To: "R. David Murray" <rdmurray at bitdance.com>
    From: Glyph Lefkowitz <glyph at twistedmatrix.com>
    Subject: Re: pöstal

    A simple reply.

How does the application "make some bytes out of that text" before passing
it back to email?  The application shouldn't have to know how to do
RFC2047 encoding, certainly, that's one of the jobs of the email module.
If the application just encodes the above as UTF8, then it also has to
be calling an email API that knows it is getting bytes input that has
not been transfer-encoded, and needs to be told the encoding, so that
it can do the correct transfer encoding.  In that case why not have
the API be pass in the text, with an optional override for the default
utf-8 encoding that email will otherwise use?

Perhaps some of the disconnect here with Antoine (and Jean-Paul, on IRC)
is that the email-sig feels that the format of data handled by the email
module (rfcx822-style headers, perhaps with a body, perhaps including MIME
attachments) is of much wider utility than just handling email, and that
since the email module already has to be very liberal in what it accepts,
it isn't much of a stretch to have it handle those use cases as well (and
in Python2 it does, in the same 'most of the time' way it handles other
non-ASCII byte issues).  In that context, it seems perfectly reasonable to
expect it to parse string (unicode) headers containing non-ascii data.
In such use cases there might be no reason to encode to email RFC
wire-format, and thus an encode-to-bytes-and-tell-me-the-encoding
interface wouldn't serve the use case particularly well because the
application wouldn't want the RFC2047 encoding in the file version of
the data.

We could conceivably drop those use cases if it simplified the API and
implementation, but right now it doesn't feel like it does.  Further,
Python2 serves these use cases, because you can read the non-ascii
data and process it as binary data and it would all just work (most of
the time).  So such use cases probably do exist out in the wild (but
no, we don't have any specific pointers, though I myself was working
on such an ap once that never got to production).  If Python3 email
parses only bytes, then it could serve the use case in somewhat the
same way as Python2: the application would encode the data as, say,
utf8 and pass it to the 'wire format bytes' input interface, which would
then register a defect but otherwise pass the data along to the 'wire'
(the file in this case).  On read it would again register a defect, and
the application could pull the data out using the 'give me the wire-bytes'
interface and decode it itself.

But this feels yucky to me, like a regression to Python2's conflation
of bytes and text.  This type of application really wants to work with
unicode, not to have to futz with bytes.

> > So now we have both a bytes parser and a string parser.
> Why do so many messages on this subject take this for granted?  It's
> wrong for the email module just like it's wrong for every other package.
> There are plenty of other (better) ways to deal with this problem.  Let
> the application decide how to fudge the encoding of the characters back
> into bytes that can be parsed.  "In the face of ambiguity, refuse the
> temptation to guess" and all that.  The application has more of an idea
> of what's going on than the library here, so let it make encoding
> decisions.
> Put another way, there's nothing wrong with having a text parser, as
> long as it just encodes the text according to some known encoding and
> then parses the bytes :).

See above for why I don't think that serves all the use cases for text

Perhaps another difference is that in my mind *as an application
developer*, the "real" email message consists of unicode headers and
message bodies, with attachments that are sometimes binary, and that
the wire-format is this formalized encoding we have to use to be able
to send it from place to place.  In that mental model it seems to make
perfect sense to have a StringMessage that I have encode to transmit,
and a BytesMessage that I receive and have to decode to work with.
Just like I decode generic bytes strings that I get from outside my
program and encode my text strings to emit them.  In this email design,
I'm just doing the encode/decode at a higher level of abstraction.

So, forget about the implementation.  What's a better object model/API
for the email package to use?  Keep in mind that at all levels of the
model there are applications that need to access the bytes representation,
and applications that need to access the string representation.  I came
up with the two-class API because it seemed simplest from a user point
of view: you take in bytes input and get a BytesMessage, which you
either manipulate or convert to a StringMessage and then manipulate,
depending on your application, or vice versa.  The alternative seems
to be have two methods for almost every API call, one that accepts or
returns string and another that accepts or returns bytes.

Perhaps others think that the latter is better, but the email-sig
liked my idea, so that's what the current code base implements :)

> > So, after much discussion, what we arrived at (so far!) is a model
> > that mimics the Python3 split between bytes and strings.  If you
> > start with bytes input, you end up with a BytesMessage object.
> > If you start with string input to the parser, you end up with a
> > StringMessage.
> That may be a handy way to deal with some grotty internal implementation
> details, but having a 'decode()' method is broken.  The thing I care

Why is having a decode method broken?

> about, as a consumer of this API, is that there is a clearly defined
> "Message" interface, which gives me a uniform-looking place where I can
> ask for either characters (if I'm displaying them to the user) or bytes
> (if I'm putting them on the wire).  I don't particularly care where
> those bytes came from.  I don't care what decoding tricks were necessary
> to produce the characters.

Exactly.  But how does having Bytes and String message objects not
provide this?  decode and encode hide all those grotty details from the
higher level application.

If you are worried that at some point in your application you might
not know if you have a StringMessage or a BytesMessage, well, that is
equivalent to having a point in your application where you might have
a string object or you might have a bytes object.  Which is to say,
if you end up there, then there is something wrong with your design.

> Now, it may be worthwhile to have specific normalization /
> debrokenifying methods which deal with specific types of corrupt data
> from the wire; encoding-guessing, replacement-character insertion or
> whatever else are fine things to try.  It may also be helpful to keep
> around a list of errors in the message, for inspection.  But as we know,
> there are lots of ways that MIME data can go bad other than encoding, so
> that's just one variety of error that we might want to keep around.

Yes.  email6 intends to extend the already existing error recovery
and diagnostics that the email module currently provides.

> (Looking at later messages as I'm about to post this, I think this all
> sounds pretty similar to Antoine's suggestions, with respect to keeping
> the implementation within a single class, and not having
> BytesMessage/UnicodeMessage at the same abstraction level.)

Forget about the implementation, let's just talk about the API.  The two
class design came out of *API* thoughts, the implementation came second.

If I'm understanding you correctly, you'd prefer to have only one type of
Message object and one type of Header object visible at the API level.
Then, if you want to present the message to the user 'cat' fashion
you'd do:

    for line in mymsg.serialize_as_string():
        print(line, end=None)

while when writing it to smtplib.SMTP.sendmail you'd do:

        [x.as_bytes() for x in itertools.chaim(
            mymsg['to'], mymsg['cc'], mymsg['bcc'])],

(I'm again ignoring the deficiencies of the current smtplib API.)  I can
see the appeal of that in that you don't have to think about whether the
object is bytes or string based at that point in your code.  You just
put your data type desire into the method name.  But it strikes me as
mostly being extra typing.  Kind of like having all strings in Python
represented internally as a bytes/encoding tuple, and doing




The two cases are not exactly parallel, yet I think they are parallel
enough that we're not completely crazy in what we are proposing.

But I *am* open to being convinced otherwise.  If everyone hates the
BytesMessage/StringMessage API design, then that should certainly not
be what we implement in email.

R. David Murray                                      www.bitdance.com

More information about the Python-Dev mailing list