[Email-SIG] fixing the current email module
Glenn Linderman
v+python at g.nevcal.com
Fri Oct 9 20:59:25 CEST 2009
On approximately 10/9/2009 5:05 AM, came the following characters from
the keyboard of Barry Warsaw:
> On Oct 8, 2009, at 6:39 PM, Glenn Linderman wrote:
>> 1) wire format. Either what came in, in the parser case, or what
>> would be generated.
>> 2) internal headers from the MIME part
>> 3) decoded BLOB. This means that quopri and base64 are decoded, no
>> more and no less. This is bytes. No headers, only payload. For
>> Content-Transfer-Encoding: binary, this is mostly a noop.
>> 4) text/* parts should also be obtainable as str()/unicode(), payload
>> only. This is where charset decoding is done.
>>
>> I think your talk in the next paragraph about hooks and other object
>> types being produced is a generalization of 4, not 3, and generally
>> no additional decoding needs to be done, just conversion to the right
>> object type (or file, or file-like object).
> I mostly agree with that. I've always called #4 the "decoded payload"
> and #3 I've usually called the "raw payload". Maybe we can bikeshed
> on better terms to help inform us about the API's method/attribute names.
It would be good though to have standardized terms for easier
communication. Maybe as they are chosen, they could be added to that
Wiki RDM set up?
My only problem with "raw" and "decoded" payload, is that there are 3
payload formats, not 2, so there needs to be a 3rd term, corresponding
to #1, #3, and #4, above. #2 is somewhat orthogonal from the payload.
To me, "raw" conjures up #1, not #3.
If Content-Transfer-Encoding is 7bit, 8bit, or binary, then 2 is the
same as 1, it is just a terminology change. Only for
Content-Transfer-Encoding of quoted-printable or base64 must work be
done to convert from #1 to #3.
If Content-Type is text/*, then the transformation from 2 to 3 is more
than a cast, but for many other formats, it is mostly a cast.
> Which brings up another point: right now Message objects have a single
> .get_payload() method that takes a flag to indicate whether it should
> be the decoded or raw payload. That's bong. These should be
> different interfaces.
Separate APIs would be clearer, but for compatibility, should
.get_payload() be retained, with the flag? Fortunately, there is only
one result value in any case, so it is just a matter of what the type of
that output value is, and how it should be handled.
Perhaps the flag parameter should be extended to allow retrieval of all
three payload formats instead of only two?
.get_payload could be converted to call the appropriate specific APIs,
should it be desired to invent separate APIs for each payload format.
>>> The problem is that if the bytes came off the wire, the parser
>>> currently can only attach the most basic MIME base class. It
>>> doesn't know that an image/png should create a MIMEImagePNG instance
>>> there. This is different from hacking the model directly because
>>> the application can instantiate the right class. So the parser
>>> either has to have a hookable way for an application to go from
>>> content-type to class, or the generic MIME base class needs to be
>>> hookable in its .decode() method.
>>
>> So either the email package can stop at 3, and 4 only for text/*
>> parts, or it could learn more types (registered types, with
>> well-defined corresponding objects could be potentially built-in to
>> the email package), and/or it could become hookable for application
>> types. Of course, for disposition to files, storing the BLOB in a
>> file of the right name is adequate... to avoid the file, I agree that
>> converting to a useful object type is handy. But maybe file-like
>> objects would suffice, for most of the types.
>
> My own preferences here is that email does support #4 with a
> registration system to handle returning concrete payload objects based
> on the Content-Type.
Sure, a registration system is fine. It could work for any type that
has a method that can be registered, that accepts a binary BLOB and
returns an appropriate typed and functioning object that can manipulate
that type. That would mean that the application would have to make all
the registration calls up front, instead of making the API calls when
the objects are retrieved. Basically, if the email package doesn't have
a registration system that the application can use, the application has
to invent its own, so this is work that could benefit all applications.
I suppose the default registration for text/* would be to convert from
whatever to Unicode, and the default registration for all other
Content-Type would be to pass back bytes(). Or maybe a few other common
types, for which specific types are available, some specific image/*
types, perhaps, that seems to have MIME types defined for them, although
perhaps people may still prefer to register, say, a PIL type, for
images, so I agree the email package should only provide default
registrations. On the other hand, I'm not sure how the registration
system should work with threads, if different threads want different
registrations...
Actually, although it is not common practice to have encodings other
than the RFC defined base64 and quoted-printable, a registration system
for converting from #1 to #3, with appropriate defaults for base64,
quoted-printable, binary, 7bit, 8bit, would be appropriate, and would
provide a framework for allowing easy extensions to the encodings.
Future mail RFCs may define some, but more likely, applications that
wish to use email transports, where both ends are application
controlled, might wish to define other encodings... the RFCs do allow
for x-* encodings that are user defined. If a registration system is
created for #3 to #4 encodings, the same mechanism could likely be use
for the registration system for #1 to #3 encodings, so there would be
added flexibility at very little cost.
> I also think that the email package probably should not implement
> "store-payloads-on-disk" by default, although it may provide some
> example implementations for simple applications (much the same way
> there's wsgiref for simple applications).
Thinking about this, I agree that storing payloads on disk should not be
the default action. However, if an application wants to control its
memory consumption, the receipt of a large email could negatively impact
that desire. It might be appropriate to place individual MIME parts on
disk, as they are parsed, if the application indicates a threshold part
size and/or threshold aggregate size, beyond which parts should be
placed in cache. Along with that, the temporary storage location in
which to place them would have to be configured.
> Still, that's different than say, storing attachments in a file
> named by the Content-Disposition header's filename parameter. That
> latter is firmly in the domain of the application.
I again agree that this should not be the default action, but I assume
that an API should be provided such that an application could tell the
email package to place the content in the header's filename parameter.
If such an API doesn't already exist, it seems it would be a helpful
extension, and if the part was already cached on disk because of the
above thresholds, the email package could possibly use rename instead of
file copy to achieve the goal.
--
Glenn -- http://nevcal.com/
===========================
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking
More information about the Email-SIG
mailing list