[Email-SIG] Thoughts on the general API, and the Header API.

Sat Feb 20 06:50:38 CET 2010

On Fri, 19 Feb 2010 21:23:52 -0500, Barry Warsaw <barry at python.org> wrote:
> On Jan 25, 2010, at 03:10 PM, R. David Murray wrote:
> The one thing that I think is unwieldy is the signature of the serialize() and
> deserialize() methods.  I've been thinking about "policy" objects that can be
> used to control formatting and I think that perhaps substituting an API like
> this might work:
> 
> serialize(policy=None)
> deserialize(policy=None)

I love the idea of policy objects.  I'm clear on what they do for
serialization.  What do you visualize them doing for deserialization
(parsing)?

> I think this could be interesting for supporting output of the same message
> tree to different destinations.  E.g. if the message is being output directly
> to an SMTP server, you'd stick a policy object on there that had the RFC 5321
> required EOL, but you'd have a different policy object for output to a web
> server.

Yes, this was my intent in providing the newline and max_line_length
parameters, but a policy object is a much cleaner way to do that.
Especially since we can then provide premade policy objects to support
common output scenarios such as SMTP and HTTP.

> >(Encoding or decoding a Message would cause the Message to recursively
> >encode or decode its subparts.  This means you are making a complete
> >new copy of the Message in memory.  If you don't want to do that you
> >can walk the Message and convert it piece by piece (we could provide a
> >generator that does this).)
> 
> It sounds like there's overlap between the encoding/decoding API and the
> serialize/deserialize API.  Are you thinking along those lines?  Differences
> in signature could be papered over with the policy objects.

No, I'm thinking of encode/decode as exactly parallel to encode/decode
on string/bytes.  In my prototype API, for example,  StringHeader
values are unicode, and do *not* contain any rfc2047 encoded words.
decoding a BytesHeader decodes the RFC2047 stuff.  Contrawise, encoding
a StringHeader does the RFC2047 encoding (using whatever charset you
specify or utf-8 by default).  (This means you lose the ability to piece
together headers from bits in different charsets, but what is the actual
use case for that?  And in any case, there will be a way to get at the
underlying header-translation machinery to do it if you really need to.)

Serializing a StringHeader, in my design, produces *text* not bytes.
This is to support the use case of using the email package to manipulate
generic 'name:value // body' formatted data in unicode form (presumably
utf-8 on disk).

To get something that is RFC compliant, you have to encode the StringMessage
object (and thus the headers) to a BytesMessage object, and then
serialize that.  (That's where the incremental encoder may be needed).

The advantage of doing it this way is we support all possible combinations
of input and output format via two strictly parallel interfaces and
their encode/decode methods.

Hmm.  It occurs to me now that another possible way to do this would be to
put the output data format into the policy object.  Then you could
serialize a StringMessage object, and it would know to do the string
to bytes conversion as it went along doing the serialization.
I don't think that would eliminate the need for encode/decode methods:
first, that's what serialize would use when converting for output,
and second, you will sometimes want to manipulate, eg, individual
header values, and it seems like the natural way to do that is something like
this:

    mybytesmessage['subject'].decode().value

You don't want to serialize using a to-string policy object, because
what you want is the decoded value, and you can't do

    mybytesmessage['subject'].value.decode()

because you have to rfc2047 decode....

Hmm.  Here's a thought: could we write an rfc2047 codec?  Then we
could use that second, more python-intuitive form like this:

    mybytesmessage['subject'].value.decode('mimeheader')

Well, looking at that I'm not sure it's better :(

> >Subclasses of these classes for structured headers would have additional
> >methods that would return either specialized object types (datetimes,
> >address objects) or bytes/strings, and these may or may not exist in
> >both Bytes and String forms (that depends on the use cases, I think).
> 
> Is it crackful to think about the policy object also containing a MIME type
> registry for conversion to the specialized object types?

Oooh.  I *like* that idea.  I dislike global registries.  Like Glenn
says, this could make a lot of things safer threading-wise, and
certainly makes things more flexible.  I was worrying that there
might be a case of a complex app needing the registry to have
different states in different parts of the app, and while I don't
have an actual use-case in mind, this would make that a non-problem.

> >So, those are my thoughts, and I'm sure I haven't thought of all the
> >corner cases.  The biggest question is, does it seem like this general
> >scheme is worth pursuing?
> 
> Definitely!  I think it's a great idea.

Thanks.  The repository (lp:python-email6) contains the beginnings
of the implementation of the StringHeader and BytesHeader classes.
I'm currently working on fleshing out the part where it says "this
is a temporary hack, need to handle folding encoded words", which is,
needless to say, a bit complicated...I may set that aside for a bit and
work on the policy object stuff.  Though I also need to put a bunch more
tests into the test database...

--David