[Email-SIG] Handling large emails: DiskMessage and DiskFeedParser

Menno Smits menno at netbox.biz
Mon May 24 21:15:41 EDT 2004


Hi all,

FeedParser is great because it doesn't load the entire message into 
memory during parsing (yes, I realise there are other reasons for 
FeedParser exising too). However, once the message is parsed the 
attachment bodies are still loaded entirely in to memory when Message 
instances are created and populated. This is a big problem for real 
world enviroments where large messages are possible. All available 
memory is consumed and the machine grinds to a halt. We see large 
(40MB+) emails all this time and problems start to occur when several of 
these are being processed simultaneously.

To cope with this problem I've created 2 classes DiskMessage and 
DiskFeedParser (see http://oss.netboxblue.com).

DiskMessage is a simple subclass of Message that stores message payloads 
to temporary files instead of RAM. Its API is compatible with the 
standard Message class although to truly avoid loading the entire 
message in to memory you need to use some extra methods. See the source 
for details.

DiskFeedParser is a hack of the current FeedParser that uses the extra 
methods of DiskMessage to avoid ever loading message payloads into 
memory. If anyone wants to try cleanly subclassing FeedParser for this 
purpose instead of just hacking it I'd like to see the results.

Some informal tests of memory usage after parsing a 25MB email (2 large 
attachments), Python 2.3.3:
                                   VSZ      RSS
Parser with Message:              31840    25088
DiskFeedParser with DiskMessage:  12372    6128

Note that these classes haven't been tested extensively but seem to 
work. Any feedback would be greatly appreciated.

Regards,
Menno

-- 
Menno Smits, Senior Development Engineer
NetBox       http://netbox.biz  |  Voice        +61 500 555 357
Oxcoda       http://oxcoda.com  |  Fax          +61 500 555 358



More information about the Email-SIG mailing list