[Email-SIG] Handling large emails: DiskMessage and DiskFeedParser
Menno Smits
menno at netbox.biz
Mon May 24 21:15:41 EDT 2004
Hi all,
FeedParser is great because it doesn't load the entire message into
memory during parsing (yes, I realise there are other reasons for
FeedParser exising too). However, once the message is parsed the
attachment bodies are still loaded entirely in to memory when Message
instances are created and populated. This is a big problem for real
world enviroments where large messages are possible. All available
memory is consumed and the machine grinds to a halt. We see large
(40MB+) emails all this time and problems start to occur when several of
these are being processed simultaneously.
To cope with this problem I've created 2 classes DiskMessage and
DiskFeedParser (see http://oss.netboxblue.com).
DiskMessage is a simple subclass of Message that stores message payloads
to temporary files instead of RAM. Its API is compatible with the
standard Message class although to truly avoid loading the entire
message in to memory you need to use some extra methods. See the source
for details.
DiskFeedParser is a hack of the current FeedParser that uses the extra
methods of DiskMessage to avoid ever loading message payloads into
memory. If anyone wants to try cleanly subclassing FeedParser for this
purpose instead of just hacking it I'd like to see the results.
Some informal tests of memory usage after parsing a 25MB email (2 large
attachments), Python 2.3.3:
VSZ RSS
Parser with Message: 31840 25088
DiskFeedParser with DiskMessage: 12372 6128
Note that these classes haven't been tested extensively but seem to
work. Any feedback would be greatly appreciated.
Regards,
Menno
--
Menno Smits, Senior Development Engineer
NetBox http://netbox.biz | Voice +61 500 555 357
Oxcoda http://oxcoda.com | Fax +61 500 555 358
More information about the Email-SIG
mailing list