[Help]: mailbox classes
James J. Besemer
jb at cascade-sys.com
Fri Sep 20 13:48:24 EDT 2002
Jeff Davis wrote:
>Ok, I'll assume your using mbox format, but this is easily adapted for
>maildir, etc.
>
>import mailbox
>mbox_file = '/path/to/mbox'
>
>mbox_fp = open(mbox_file)
>mbox = mailbox.UnixMailbox(mbox_fp)
>msg = mbox.next()
>
>there, now you have a message object (a rfc822.Message object, to be
>exact). Now just import the rfc822 module and do whatever you need. You'll
>probably want to use a loop to process all of the messages.
>
I've found these packages to be somewhat fragile. In particular, real
world email includes a lot of spam and a fair amount of spam does not
conform 100% to all the mime encoding rules. A frequent error is that a
surprising number of mime-encoded messages for some reason do not
include the ending boundary marker. This often results in errors while
parsing that message and the loss of all messages following that point.
IIRC the error becomes unexpected EOF after all the remaining messages
have been mistakenly skipped looking for that missing boundary.
I was looking to Python and the email classes to whip up a quick and
dirty spam filter, so "filter out the spam first" is not a solution to
my problem. FWIW, Netscape does not have any problem with this same data.
I've had moderate success by reading all the messages into a string,
splitting the string on "^From ", passing individual messages to the
email parsing classes, and treating the the unexpected EOF exception
like a real EOF. But with hundreds of messages daily, that's a rather
clumsy and expensive operation. It also defeats the effort the email
classes go to to avoid having to have everything in memory.
I'm curious if I'm missing something obvious here and if there's some
easy way to get mailbox to work with large amounts of real world email
with imperfect encodings.
If possible, perhaps mailbox should have some absolute sense of the end
or beginning of a message ("^From "?) that overrides mime boundaries.
That MUST be how Netscape does it.
--jb
--
James J. Besemer 503-280-0838 voice
2727 NE Skidmore St. 503-280-0375 fax
Portland, Oregon 97211-6557 mailto:jb at cascade-sys.com
http://cascade-sys.com
More information about the Python-list
mailing list