Parsing an mbox mail file

Sheila King sheila at spamcop.net
Sat Jan 27 14:00:54 EST 2001


Oleg,

Thanks for your reply.

Someone suggested to me, in e-mail, taking a look at the code for the mailbox
module, to see how the module handles finding the next message. I see that it
is very similar to what you are doing, with the "seek" and "tell" and other
such file operations.

I will study this further. For the meantime, I've "solved" my problem, by not
saving the messages in exactly UnixMailbox format. I've put a message
seperator of '========\n' right before the header of each message. This means,
I can no longer use the mailbox module, so I'm back to the rfc822 module.

Here is a simplified version of my current code:

-----------------------------------------------------
import rfc822

def readToMessageSeparator(infile):
	lines =[]
	while(1):
		newline = infile.readline()
		if not newline:
			return lines
		if (newline != '========\n'):
			lines +=[newline]
		else:
			return lines

###########################
### Main Program Begins ###
###########################

infile = open("spam4.txt", "r")
discard = readToMessageSeparator(infile)

## retrieve each message
while (1):
	header = rfc822.Message(infile)
	if not header:
		break
	headerString = ''.join(header.headers)
	body = ''.join(readToMessageSeparator(infile))
	currentmessage = headerString + '\n' + body
	print currentmessage
-----------------------------------------------------

--
Sheila King
http://www.thinkspot.net/sheila/
http://www.k12groups.org/

On Sat, 27 Jan 2001 15:58:21 +0300 (MSK), Oleg Broytmann <phd at phd.pp.ru> wrote
in comp.lang.python in article
<mailman.980600491.8027.python-list at python.org>:

:On Sat, 27 Jan 2001, Sheila King wrote:
:> If I use the mailbox module, and use mailboxInstance.next(), it will skip
:> right over the message body to the next message's header. The whole reason I'm
:> wanting to use the mailbox module, is so that I can easily get to the next
:> message in the file, and get it's headers. So, I definitely want to use the
:> "next()" command. How can I read the message body in between calls to next?
:
:#! /usr/local/bin/python -O
:
:
:import sys, os
:infile = open(sys.argv[1], 'r')
:
:from mailbox import UnixMailbox
:mbox = UnixMailbox(infile)
:
:n = 1
:while 1:
:   pos = infile.tell()
:   from_ = infile.readline() # UnixMailbox ate the field From_ - but I want to preserve it
:   infile.seek(pos)
:
:   msg = mbox.next()
:   if msg is None: break
:
:   sys.stdout.write("%sProcessing message N%d" % (chr(13), n))
:   sys.stdout.flush()
:   n = n + 1
:
:   fp = msg.fp
:   fp.seek(0) # to the very beginning
:
:   outfile = open("_tmp", 'w')
:   outfile.write(from_)
:   outfile.write(fp.read()) # write the entire body at once
:   outfile.close()
:
:   os.system("%s _tmp >>error.log 2>&1" % sys.argv[2])
:
:infile.close()
:print
:os.remove("_tmp")
:
:Oleg.
:----
:     Oleg Broytmann            http://phd.pp.ru/            phd at phd.pp.ru
:           Programmers don't die, they just GOSUB without RETURN.
:




More information about the Python-list mailing list