Parsing an mbox mail file
Sheila King
sheila at spamcop.net
Sat Jan 27 14:00:54 EST 2001
Oleg,
Thanks for your reply.
Someone suggested to me, in e-mail, taking a look at the code for the mailbox
module, to see how the module handles finding the next message. I see that it
is very similar to what you are doing, with the "seek" and "tell" and other
such file operations.
I will study this further. For the meantime, I've "solved" my problem, by not
saving the messages in exactly UnixMailbox format. I've put a message
seperator of '========\n' right before the header of each message. This means,
I can no longer use the mailbox module, so I'm back to the rfc822 module.
Here is a simplified version of my current code:
-----------------------------------------------------
import rfc822
def readToMessageSeparator(infile):
lines =[]
while(1):
newline = infile.readline()
if not newline:
return lines
if (newline != '========\n'):
lines +=[newline]
else:
return lines
###########################
### Main Program Begins ###
###########################
infile = open("spam4.txt", "r")
discard = readToMessageSeparator(infile)
## retrieve each message
while (1):
header = rfc822.Message(infile)
if not header:
break
headerString = ''.join(header.headers)
body = ''.join(readToMessageSeparator(infile))
currentmessage = headerString + '\n' + body
print currentmessage
-----------------------------------------------------
--
Sheila King
http://www.thinkspot.net/sheila/
http://www.k12groups.org/
On Sat, 27 Jan 2001 15:58:21 +0300 (MSK), Oleg Broytmann <phd at phd.pp.ru> wrote
in comp.lang.python in article
<mailman.980600491.8027.python-list at python.org>:
:On Sat, 27 Jan 2001, Sheila King wrote:
:> If I use the mailbox module, and use mailboxInstance.next(), it will skip
:> right over the message body to the next message's header. The whole reason I'm
:> wanting to use the mailbox module, is so that I can easily get to the next
:> message in the file, and get it's headers. So, I definitely want to use the
:> "next()" command. How can I read the message body in between calls to next?
:
:#! /usr/local/bin/python -O
:
:
:import sys, os
:infile = open(sys.argv[1], 'r')
:
:from mailbox import UnixMailbox
:mbox = UnixMailbox(infile)
:
:n = 1
:while 1:
: pos = infile.tell()
: from_ = infile.readline() # UnixMailbox ate the field From_ - but I want to preserve it
: infile.seek(pos)
:
: msg = mbox.next()
: if msg is None: break
:
: sys.stdout.write("%sProcessing message N%d" % (chr(13), n))
: sys.stdout.flush()
: n = n + 1
:
: fp = msg.fp
: fp.seek(0) # to the very beginning
:
: outfile = open("_tmp", 'w')
: outfile.write(from_)
: outfile.write(fp.read()) # write the entire body at once
: outfile.close()
:
: os.system("%s _tmp >>error.log 2>&1" % sys.argv[2])
:
:infile.close()
:print
:os.remove("_tmp")
:
:Oleg.
:----
: Oleg Broytmann http://phd.pp.ru/ phd at phd.pp.ru
: Programmers don't die, they just GOSUB without RETURN.
:
More information about the Python-list
mailing list