[Email-SIG] Thoughts on the general API, and the Header API.

Glenn Linderman v+python at g.nevcal.com
Mon Feb 1 20:06:34 CET 2010


Another thought occurred to me regarding this "Access API"... an IMAP 
implementation could defer obtaining data parts from the server until 
requested, under the covers of this same API.  Of course, for devices 
with limited resources, that would probably be the optimal approach, but 
for devices with lots of resources, an IMAP implementation might also 
want to offer other options.


On approximately 1/28/2010 6:20 PM, came the following characters from 
the keyboard of Glenn Linderman:
> On approximately 1/25/2010 8:10 PM, came the following characters from 
> the keyboard of Glenn Linderman:
>>> That's true.  The Bytes and String versions of binary MIME parts,
>>> which are likely to be the large ones, will probably have a common
>>> representation for the payload, and could potentially point to the same
>>> object.  That breaking of of the expectation that 'encode' and 'decode'
>>> return new objects (in analogy to how encode and decode of 
>>> strings/bytes
>>> works) might not be a good thing, though.
>>
>> Well, one generator could provide the expectation that everything is 
>> new; another could provide different expectations.  The differences 
>> between them, and the tradeoffs would be documented, of course, were 
>> both provided.  I'm not convinced that treating headers and data 
>> exactly the same at all times is a good thing... a convenient option 
>> at times, perhaps, but I can see it as a serious inefficiency in many 
>> use cases involving large data.
>>
>> This deserves a bit more thought/analysis/discussion, perhaps.  More 
>> than I have time for tonight, but I may reply again, perhaps after 
>> others have responded, if they do. 
>
> I guess no one else is responding here at the moment.  Read the ideas 
> below, and then afterward, consider building the APIs you've suggested 
> on top of them.  And then, with the full knowledge that the messages 
> may be either in fast or slow storage, I think that you'll agree that 
> converting the whole tree in one swoop isn't always appropriate... all 
> headers, probably could be.  Data, because of its size, should 
> probably be done on demand.
>
>
> In earlier discussions about the registry, there was the idea of 
> having a registry for transport encoding handling, and a registry for 
> MIME encoding handling.  There were also vague comments about doing an 
> external storage protocol "somehow", but it was a vague concept to be 
> defined later, or at least I don't recall any definitions.
>
> Given a raw bytes representation of an incoming email, mail servers 
> need to choose how to handle it... this may need to be a dynamic 
> choice based on current server load, as well as the obvious static 
> server resources, as well as configured limits.
>
> Unfortunately, the SMTP protocol does not require predeclaration of 
> the size of the incoming DATA part, so servers cannot enforce size 
> limits until they are exceeded.  So as the data streams in, a dynamic 
> adjustment to the handling strategy might be appropriate.  Gateways 
> may choose to route messages, and stall the input until the output 
> channel is ready to receive it, and basically "pass through" the data, 
> with limited need to buffer messages on disk... unless the output 
> channel doesn't respond... then they might reject the message.  An 
> SMTP server should be willing to act as a store-and-forward server, 
> and also must do individual delivery of messages to each RCPT (or at 
> least one per destination domain), so must have a way of dealing with 
> large messages, probably via disk buffering.  The case of disk 
> buffering and retrying generally means that the whole message, not 
> just the large data parts, must be stored on disk, so the external 
> storage protocol should be able to deal with that case.
>
> The minimal external storage format capability is to store the 
> received bytestream to disk, associate it with the envelope 
> information, and be able to retrieve it in whole later.  This would 
> require having the whole thing in RAM at those two points in time, 
> however, and doesn't solve the real problem.  Incremental writing and 
> reading to the external storage would be much more useful.  Even more 
> useful, would be "partially parsed" seek points.
>
> An external storage system that provides "partially parsed" 
> information could include:
>
> 1) envelope information.  This section is useful to SMTP servers, but 
> not other email tools, so should be optional.  This could be a copy of 
> the received RCPT command texts, complete with CRLF endings.
>
> 2) header information.  This would be everything between DATA and the 
> first CRLF CRLF sequence.
>
> 3) data.  Pre-MIME this would simply be the rest of the message, but 
> post-MIME it would be usefully more complex.  If MIME headers can be 
> observed and parsed as the data passes through, then additional 
> metadata could be saved that could enhance performance of the later 
> processing steps.  Such additional metadata could include the 
> beginning of each MIME part, the end of the headers for that part, and 
> the end of the data for that part.
>
> The result of saving that information would mean that minimal data 
> (just headers) would need to be read in create a tree representing the 
> email, the rest could be left in external storage until it is 
> accessed... and then obtained directly from there when needed, and 
> converted to the form required by the request... either the whole 
> part, or some piece in a buffer.
>
> So there could be a variety of external storage systems... one that 
> stores in memory, one that stores on disk per the ideas above, and a 
> variety that retain some amount of cached information about the email, 
> even though they store it all on disk.  Sounds like this could be a 
> plug-in, or an attribute of a message object creation.
>
> But to me, it sounds like the foundation upon which the whole email 
> lib should be built, not something that is shoveled in later.
>
> A further note about access to data parts... clearly "data for the 
> whole MIME part" could be provided, but even for a single part that 
> could be large.  So access to smaller chunks might be desired.
>
> The data access/conversion functions, therefore, should support a 
> buffer-at-a-time access interface.  Base64 supports random access 
> easily, unless it contains characters not in the 64, that are to be 
> ignored, that could throw off the size calculations.  So maybe 
> providing sequential buffer-at-a-time access with rewind is the best 
> that can be done -- quoted-printable doesn't support random access 
> very well, and neither would some sort of compression or encryption 
> technique -- they usually like to start from the beginning -- and 
> those are the sorts of things that I would consider likely to be 
> standardized in the future, to reduce the size of the payload, and to 
> increase the security of the payload.
>

-- 
Glenn -- http://nevcal.com/
===========================
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking



More information about the Email-SIG mailing list