[Email-SIG] Thoughts on the general API, and the Header API.
v+python at g.nevcal.com
Tue Jan 26 01:55:15 CET 2010
On approximately 1/25/2010 12:10 PM, came the following characters from
the keyboard of R. David Murray:
> So, those are my thoughts, and I'm sure I haven't thought of all the
> corner cases. The biggest question is, does it seem like this general
> scheme is worth pursuing?
Moving your last question to the front, yes. And of course, we do need
to think through most of the corner cases before absolutely committing
to this approach. But it sounds viable, and avoids an awful lot of
duplicate APIs, and would allow simple email clients to be written
primarily or even fully in bytes or primarily or even fully in strings.
A simple email client that is written fully in strings would "simply"
reject/bounce messages that cannot be decoded to strings. This is
simple; it works for 100% properly encoded messages; in an environment
where a client is coded to process messages from some generator, once
they are both debugged to the extent of generating messages that can be
consumed, then all is well, and no messages would be rejected. This
would not be an appropriate model for a general email server; while I'd
like to see a popular mailing list submission client that would bounce
messages that are improperly formed -- forcing contributors to use RFC
conformant clients, and thus encouraging the of those clients that are
not RFC conformant, but I'm not going to hold my breath.
I think there can be enough power in an API designed in this manner to
allow the full nitty-gritty access as required.
I have some questions and concerns; I haven't thought through all of
them; perhaps some of them are corner cases, if so, they are corner
cases that are particularly interesting to me.
> OK, so we've agreed that we need to handle bytes and text at pretty
> much all API levels, and that the "original data" that informs the data
> structure can be either bytes or text. We want to be able to recover
> that original data, especially in the bytes case, but arguably also in
> the text case.
> Then there's also the issue of transforming a message once we have it in
> a data structure, and the consequent issue of what it means to serialize
> the resulting modified message. (This last comes up in a very specific
> way in issues 968430 and 1670765, which are about preserving the *exact*
> byte representation of a multipart/signed message).
> We've also agreed that whatever we decide to do with the __str__ and
> __bytes__ magic methods, they will be implemented in terms of other
> parts of the API. So I'll ignore those for now.
> I think we want to decide on a general API structure that is implemented
> at all levels and objects where it makes sense, this being the API
> for creating and accessing the following information about any part of
> the model:
> * create object from bytes
> * create object from text
> * obtain the defect list resulting from creating the object
> * serialize object to bytes
> * serialize object to text
> * obtain original input data
> * discover the type of the original input data
> At the moment I see no reason to change the API for defects (a .defects
> attribute on the object holding a list of defects), so I'm going to
> ignore that for now as well.
> I spent a bunch of time trying to define an API for Headers that provided
> methods for all of the above. As I was writing the descriptions for
> the various methods, and especially trying to specify the "correct"
> behavior for both the raw-data-is-bytes and raw-data-is-text cases
> (especially for the methods that serialize the data), the whole thing
> began to give off a bad code smell.
> After setting it aside for a bit, I had what I think is a little epiphany:
> our need is to deal with messages (and parts of messages) that could be
> in either bytes form or text form. The things we need to do with them
> are similar regardless of their form, and so we have been talking about a
> "dual API": one method for bytes and a parallel method for text.
> What if we recognize that we have two different data types, bytes messages
> and text messages? Then the "dual API" becomes a more uniform, almost
> single, API, but with two possible underlying data types.
> In the context specifically of the proposed new Header object, I propose
> that we have a StringHeader and a BytesHeader, and an API that looks
> something like this:
> raw_header (None unless from_full_header was used)
> __init__(name, value)
If it was stated, I missed it: is from_full_header a way of producing
an object from a raw data value? Whereas __init__ would obviously be
used to produce one from string or bytes values. If so, then it would
be a requirement that this from_full_header API would never produce an
exception? Rather it would produce an object with or without defects?
Are there any other *Header APIs that would be required not to produce
exceptions? I don't yet perceive any.
The "charset" parameter... is that not mostly needed for data parts?
Headers are either ASCII, or contain self-describing charset info.
I guess I could see an intermediate decode from string to some charset,
before serialization, as a hint that when generating headers, that all
the characters in the header that are not ASCII are in the specified
charset... and that that charset is the one to be used in the
self-describing serialized ASCII stream? The full generality of the
allows pieces of headers to be encoded using different charsets... with
this API, it would seem that that could only be created containing one
charset... the serialization primitives were made available, so that
piecewise construction of a header value could be done with different
charsets, and then the from_full_header API used to create the complex
value. I don't see this as a severe limitation, I just want to
understand your intention, and document the limitation, or my
> BytesHeader would be exactly the same, with the exception of the signature
> for serialize and the fact that it has a 'decode' method rather than an
> 'encode' method. Serialize would be different only in the fact that
> it would have an additional keyword parameter, must_be_7bit=True.
I am not clear on why StringHeader's serialize would not need the
must_be_7bit parameter... or do I misunderstand that
StringHeader.serialize produces wire-format data?
> The magic of this approach is in those encode/decode methods.
> Encoding a StringHeader would yield a BytesHeader containing the same
> data, but encoded per RFC2047 using the specified charset. Decoding a
> BytesHeader would yield a StringHeader with the same data, but decoded to
> unicode per RFC2047, with any 8bit parts decoded (in the unicode sense,
> not the RFC2047 sense) using the specified charset (which would default to
> ASCII, meaning bare 8bit bytes in headers would throw an error). (What to
> with RFC2047 charsets like unknown-8bit is an open question...probably
> throw an error).
Would the encoding to/from StringHeader/BytesHeader preserve the
from_full_header state and value?
> (Encoding or decoding a Message would cause the Message to recursively
> encode or decode its subparts. This means you are making a complete
> new copy of the Message in memory. If you don't want to do that you
> can walk the Message and convert it piece by piece (we could provide a
> generator that does this).)
Walking it piece by piece would allow the old pieces to be discarded, to
save total memory consumption, where that is appropriate.
Perhaps one generator that would be commonly used, would be to convert
headers only, and leave MIME data parts alone, accessing and converting
them only with the registered methods? This would mean that a "complete
copy" wouldn't generally be very big, if the data parts were excluded
from implicit conversion. Perhaps the "external storage protocol" might
also only be defined for MIME data parts, and walking the tree with this
generator would not need to reference the MIME data parts, nor bring
them in from "external storage".
> raw_header would be the data passed in to the constructor if
> from_full_header is used, and None otherwise. If encode/decode call
> the regular constructor, then this attribute would also act as a flag
> as to whether or not the header was constructed from raw input data
> or via program.
This _implies_ that from_full_header always accepts raw data bytes...
even for the StringHeader. And that implies the need for an implicit
decode, and therefore, perhaps a charset parameter? No, not a charset
parameter, since they are explicitly contained in the header values.
Decode for header values may not need a charset value at all!
No comments for the rest.
Glenn -- http://nevcal.com/
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking
More information about the Email-SIG