[Email-SIG] Handling large emails: DiskMessage and DiskFeedParser
Menno Smits
menno at freshfoo.com
Mon Nov 1 23:47:29 CET 2004
Barry Warsaw wrote:
>>An alternative solution I've been thinking of... what if we abstract
>>message payloads to a "Payload" class? We could have MemoryPayload for
>>in-memory storage (the default), TmpFilePayload for temporary disk
>>storage etc etc. The read/write interface to the payload would always be
>>the same and all Message methods would only ever access the payload via
>>the API. Each Message instance would have exactly one MessagePayload
>>instance internally. I realise this would be a big change and probably
>>isn't suited for Python 2.4 but do you think this is useful?
>
>
> It might be the right way to do it, much like headers can be strings or
> instances of Header. I don't think we can really do either for Python
> 2.4, but we can continue to pursue this for email 3.1 / Python 2.5.
I really think a separate payload storage class is the right way to do
it too. The trick of course is to get the interface right. I've been
thinking about how to do it this morning. Here's a first attempt:
class PayloadStorage:
def __init__(self):
'''Payload specific initialisation'''
def write(self, buf):
'''Add new data to the end of the payload'''
def readblocks(self, blocksize):
'''Iterate over the payload data return fixed sized blocks
'''
def close(self):
'''Close/cleanup the Payload instance
'''
Some things to consider...
I've gone with a iterator interface for reading out the payload data
because a classic file-like read interface would get messy in a
multi-threaded/forking situation. For example, if a subclass of
PayloadStorage was to keeping the payload in a disk file, each call to
readblocks() could re-open the file for reading and yield the payload in
blocks. This would be difficult to achieve with a standard file-like read.
This approach only allows sequential writing and reading of the payload
data. I've checked through various parts of the email library and I
can't find any obvious places where this would be a problem although
some refactoring will be required in parts. Can anyone see a part of the
library where random access to the payload data is required?
Note that in order for this approach to work FeedParser/Parser will need
to be able to take a PayloadStorage factory class option for use when
create message instances.
As Barry suggested, I see this class working alongside the existing
string-based payload code. A message's payload could be either a string
or a PayloadStorage instance. This is needed for backwards compatibility.
Thoughts/feedback anyone?
Regards,
Menno
ps. Barry: unfortunately I don't have as much time up my sleeve to play
with this as I had hoped. I've had to suddenly move to another city and
am in the middle of the mess at the moment. I'm still keen to work on
this however and will keep at it when I have time.
pps. Matthew (matt at mondoinfo.com): The company where I work for
definitely has a need for this sort of thing and I can think of at least
one other external project where large messages in RAM is a problem.
There's definitely a case for this :)
More information about the Email-SIG
mailing list