[Help]: mailbox classes

James J. Besemer jb at cascade-sys.com
Fri Sep 20 13:48:24 EDT 2002


Jeff Davis wrote:

>Ok, I'll assume your using mbox format, but this is easily adapted for 
>maildir, etc.
>
>import mailbox
>mbox_file = '/path/to/mbox'
>
>mbox_fp = open(mbox_file)
>mbox = mailbox.UnixMailbox(mbox_fp)
>msg = mbox.next()
>
>there, now you have a message object (a rfc822.Message object, to be 
>exact). Now just import the rfc822 module and do whatever you need. You'll 
>probably want to use a loop to process all of the messages.
>
I've found these packages to be somewhat fragile.  In particular, real 
world email includes a lot of spam and a fair amount of spam does not 
conform 100% to all the mime encoding rules.  A frequent error is that a 
surprising number of mime-encoded messages for some reason do not 
include the ending boundary marker.  This often results in errors while 
parsing that message and the loss of all messages following that point. 
 IIRC the error becomes unexpected EOF after all the remaining messages 
have been mistakenly skipped looking for that missing boundary.  

I was looking to Python and the email classes to whip up a quick and 
dirty spam filter, so "filter out the spam first" is not a solution to 
my problem.  FWIW, Netscape does not have any problem with this same data.

I've had moderate success by reading all the messages into a string, 
splitting the string on "^From ", passing individual messages to the 
email parsing classes, and treating the the unexpected EOF exception 
like a real EOF.  But with hundreds of messages daily, that's a rather 
clumsy and expensive operation.  It also defeats the effort the email 
classes go to to avoid having to have everything in memory.  

I'm curious if I'm missing something obvious here and if there's some 
easy way to get mailbox to work with large amounts of real world email 
with imperfect encodings.

If possible, perhaps mailbox should have some absolute sense of the end 
or beginning of a message ("^From "?) that overrides mime boundaries.   
That MUST be how Netscape does it.

--jb

-- 
James J. Besemer		503-280-0838 voice
2727 NE Skidmore St.		503-280-0375 fax
Portland, Oregon 97211-6557	mailto:jb at cascade-sys.com
				http://cascade-sys.com	








More information about the Python-list mailing list