[Email-SIG] Handling large emails: DiskMessage and DiskFeedParser

Menno Smits menno at freshfoo.com
Mon Nov 1 23:47:29 CET 2004


Barry Warsaw wrote:

>>An alternative solution I've been thinking of... what if we abstract 
>>message payloads to a "Payload" class? We could have MemoryPayload for 
>>in-memory storage (the default), TmpFilePayload for temporary disk 
>>storage etc etc. The read/write interface to the payload would always be 
>>the same and all Message methods would only ever access the payload via 
>>the API. Each Message instance would have exactly one MessagePayload 
>>instance internally. I realise this would be a big change and probably 
>>isn't suited for Python 2.4 but do you think this is useful?
> 
> 
> It might be the right way to do it, much like headers can be strings or
> instances of Header.  I don't think we can really do either for Python
> 2.4, but we can continue to pursue this for email 3.1 / Python 2.5.

I really think a separate payload storage class is the right way to do 
it too. The trick of course is to get the interface right. I've been 
thinking about how to do it this morning. Here's a first attempt:

class PayloadStorage:
	def __init__(self):
             '''Payload specific initialisation'''

         def write(self, buf):
              '''Add new data to the end of the payload'''

         def readblocks(self, blocksize):
	     '''Iterate over the payload data return fixed sized blocks
              '''

         def close(self):
              '''Close/cleanup the Payload instance
              '''

Some things to consider...

I've gone with a iterator interface for reading out the payload data 
because a classic file-like read interface would get messy in a 
multi-threaded/forking situation. For example, if a subclass of 
PayloadStorage was to keeping the payload in a disk file, each call to 
readblocks() could re-open the file for reading and yield the payload in 
blocks. This would be difficult to achieve with a standard file-like read.

This approach only allows sequential writing and reading of the payload 
data. I've checked through various parts of the email library and I 
can't find any obvious places where this would be a problem although 
some refactoring will be required in parts. Can anyone see a part of the 
library where random access to the payload data is required?

Note that in order for this approach to work FeedParser/Parser will need 
to be able to take a PayloadStorage factory class option for use when 
create message instances.

As Barry suggested, I see this class working alongside the existing 
string-based payload code. A message's payload could be either a string 
or a PayloadStorage instance. This is needed for backwards compatibility.

Thoughts/feedback anyone?

Regards,
Menno


ps. Barry: unfortunately I don't have as much time up my sleeve to play 
with this as I had hoped. I've had to suddenly move to another city and 
am in the middle of the mess at the moment. I'm still keen to work on 
this however and will keep at it when I have time.

pps. Matthew (matt at mondoinfo.com): The company where I work for 
definitely has a need for this sort of thing and I can think of at least 
one other external project where large messages in RAM is a problem. 
There's definitely a case for this :)


More information about the Email-SIG mailing list