[Email-SIG] API thoughts

Glenn Linderman v+python at g.nevcal.com
Tue Mar 1 22:58:50 CET 2011


On 3/1/2011 12:40 PM, R. David Murray wrote:
> This is a long email, for which my apologies.  I hope you all will
> manage to find some time to read it and provide feedback, as it speaks
> to fundamental design issues.

Indeed.  Good to discuss before designing with ready-mix.

> Everything else is an implementation detail :)

Agreed.

> We propose to create a new API to make all of this easier for
> the application programmer.

YES!!

> [*] There are current real-world use cases for this:  there are nntp
>      servers that use utf-8 for headers, and the http protocol uses
>      latin-1 (or sometimes, I think, utf-8)

All the tunables listed are relevant.  The HTTP protocol standard claims 
to use Latin-1 + RFC 2047 encoding for non-Latin-1 characters; in 
practice, the browser implementations apparently use nearly _any_ 
encoding for headers!!!  For <form> responses, when there is actually 
user-specified data involved, they use the encoding defined for the page 
containing the form, as the encoding of the MIME headers sent back.  The 
"standard headers" seem to be ASCII, and somewhat immune to choice of 
encoding, except perhaps for those few encodings that are not ASCII 
supersets. (I have no clue how such are handled, if they are.  Anyone 
want to write an EBCDIC page containing a <form> for testing?)

This is useful, as it reduces the amount of character escaping likely to 
be required, the designer of the page chooses a character set that can 
represent the page, and is likely in the language of the intended 
recipient, who is likely to fill out the form using the same language.

It would be more useful, if the browsers included a(n ASCII) header that 
specified the encoding of subsequent headers: they do not.  Therefore, 
the server that receives the headers must somehow "know" the proper 
encoding.  For the situation where the CGI (or equivalent) script both 
generates the page containing the <form> and receives the form data, 
this is simple.  For the situation where the same web application 
designer creates the page containing the <form> and the CGI receiving 
the form data, and explicitly or implicitly declares the same encoding 
for both, this is functional, but there is the danger of someone 
changing the static pages to conform to a new standard encoding without 
realizing the consequences on the associated CGI scripts.  It is also 
rather hard to create "form filling" applications that can send form 
data to a server bypassing the access of the form itself... such 
applications must also "know" the proper encoding, and such applications 
are much more likely to be generated outside the realm of the original 
development environment, and much less likely to be involved in any 
planning to change encodings inside the application <form>s and CGIs.

To support reading byte-stream HTTP headers, therefore, it is critical 
that the email API accept an encoding from the application which "knows" 
the encoding; presently cgi.py has to pre-decode incoming headers 
because email does not have such a parameter.  On the other hand, maybe 
cgi.py shouldn't use email header parsing at all... since browsers don't 
use RFC 2047 encoding in practice, the parsing of headers without such 
is straightforward.

Further, HTTP data streams can be extremely large, and thus 
time-consuming to obtain over the wire.  CGI applications cannot afford 
to keep large blocks of data in RAM during receipt, thus if email wishes 
to support CGI, it needs features for placing large blocks of data on 
disk instead of in RAM during the parsing phase; cgi.py presently has to 
preparse headers, to separate them from the data streams, which it then 
handles on its own, because of this issue.

Hence, cgi.py does sufficient preparsing and private handling of HTTP 
data streams, that it seems that the only real benefit it gains from 
using email at all, is the handling of the complex RFC 2047 decoding... 
which in practice isn't used in HTTP data streams!

In any case, if email wants to promulgate itself as the "one true way" 
to process HTTP data streams, as well as SMTP and NNTP data streams, 
then it needs to address the issues above.

There is, by the way, room for improvement in the cgi.py handler for 
HTTP data streams; presently all large MIME objects are written to disk 
(but small ones are kept as string or byte streams), but it isn't 
necessarily the right disk, and the data must then be again copied, byte 
by byte, to its final file system location.  I see that as abhorrent 
overhead.  There is presently no provision for hooks that ask the CGI 
application what to do with the data being received, while it is being 
received, nor for policies to assist with better heuristics, with the 
goal in mind that a properly and completely received MIME object could 
then be renamed to its final location rather than copied.

> I guess I'm proposing, then, that there be an API version definition,
> with two values as of Python3.3: email5 API, and email6 API.  We'll
> figure out how we name and interrogate these formally later.

Question: While it is pretty clear that enhanced behaviors are required 
to benefit new applications that use email, and while some new APIs may 
be incompatible with some existing APIs, might it be possible to design 
the new API, and then build a compatibility layer that looks like the 
old API on top?  Such that there would be policies for the new APIs that 
would work like the old APIs to ease the implementation of such a 
layer?  I'm not sure I fully understand the use of _factory or factory 
parameters, but for APIs that have _factory and grow a factory, could 
not the presence of which parameter imply any variant functionality?

(OK, this question comes after not looking at the email API during all 
the GSOC and your implementation efforts since the last big round of 
discussion, but your proposals here seem to sound like it would be more 
possible with your current thinking that with your previous thinking.)

> The Header registry in this vision is accessed through the Message class.
> I have various thoughts about how this will work, but I'm going to leave
> those for later, since this email is already long enough.  I also have
> some additional thoughts about backward compatibility, but it is going
> to require some experimentation to see if they are realistic.

Consider me an interested observer; I'll enjoy reading, thinking, and 
commenting about these ideas too, but sadly am unlikely to implement an 
email client this year :(  But I have aspirations to do so, because none 
of the existing email clients exactly suit my preferences... (everyone 
should write an editor and an email client, no?  I've done the former 
several times... what I want, though, is emacs-python, instead of 
emacs-lisp).

Glenn
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/email-sig/attachments/20110301/8a551ed1/attachment-0001.html>


More information about the Email-SIG mailing list