[Tutor] Parsing an mbox mail file

Sheila King sheila@thinkspot.net
Sat, 27 Jan 2001 10:30:15 -0800


On Sat, 27 Jan 2001 01:36:25 -0800 (PST), Danny Yoo
<dyoo@hkn.eecs.berkeley.edu>  wrote about Re: [Tutor] Parsing an mbox mail
file:

:On Fri, 26 Jan 2001, Sheila King wrote:
:
:> import mailbox
:> 
:> infile = open("spam2.txt", "r")
:> messages = mailbox.UnixMailbox(infile)
:> 
:> while (1):
:> 	currentmssg = messages.next()
:> 	if (currentmssg ==None):
:> 		break
:> 	print currentmssg
:> --------------------------------------------------

:If we look at what Messages can do, we find near the bottom of:
:
:    http://python.org/doc/current/lib/message-objects.html
:
:that these Message instances should contain an "fp" file pointer that lets
:us look at the message body. 

Yes.

: So we could adjust your code like this:
:
:###
:    currentmssg = messages.next()
:    if (currentmssg ==None):
:        break
:    print currentmssg.fp.read()  # let's look at the msg contents 
:###

Interesting. I tried this out, and it sort of prints the bodies without the
headers. But not exactly. Every once in a while, it prints one to three lines
of the header.

I am going to play around with this a bit more, later. But for now, I've
"solved" my problem, by not saving the messages in a strict mbox format. I'm
preceding each message with a message separator line. Since I know what the
message separator line is, I can read up to that line, and then discard it
with no ill effects.

My message separator is '========\n'
Since I'm not saving in strict mbox format, I can't use the UnixMailbox
component from the mailbox module. So, I'm back to the rfc822.

Here is a simplified version of my current code:

---------------------------------------------------------
import rfc822

def readToMessageSeparator(infile):
	lines =[]
	while(1):
		newline = infile.readline()
		if not newline:
			return lines
		if (newline != '========\n'):
			lines +=[newline]
		else:
			return lines

###########################
### Main Program Begins ###
###########################

infile = open("spam4.txt", "r")
discard = readToMessageSeparator(infile)

## retrieve each message
while (1):
	header = rfc822.Message(infile)
	if not header:
		break
	headerString = ''.join(header.headers)
	body = ''.join(readToMessageSeparator(infile))
	currentmessage = headerString + '\n' + body
	print currentmessage
---------------------------------------------------------

--
Sheila King
http://www.thinkspot.net/sheila/
http://www.k12groups.org/