simple text file 'parsing' question

Sun Jun 20 16:12:53 EDT 1999

On Mon, Jun 21, 1999 at 12:56:07AM +0900, Matt Gushee wrote:
> KP <terocr at mysolution.com> writes:
> 
> > Here's my dilema: a directory filled (200+) with small emails. My goal
> > is to strip all the headers and combine them into one file. I can read
> > all the files just fine and write them all to one file, but I cannot
> > discern how to strip the headers.
> 
> I have no expertise in this area, but I've been reading the "Internet
> Data Handling" section of the Library Reference (Ch. 12 of the 1.5.2
> edition), and it seems like there are several modules that might help
> you. In particular, check out 'rfc822.'
> 
> Hope this helps.
> 
> Matt Gushee
> Portland, Maine, USA
> mgushee at havenrock.com
> 

I wrote a small piece of code that does *exactly* what you are describing.
it doesn't exactly strip the headers, but it parses the message using rfc822
and deals with it. you'll find it attached to this message. if for some
reason it doesn't come through, let me know, and I'll resend it.

regards,
Jeff
-- 
|| visit gfd <http://quark.newimage.com/> 
|| psa member #293 <http://www.python.org/> 
|| New Image Systems & Services, Inc. <http://www.newimage.com/>
-------------- next part --------------
#!/usr/bin/env python

import os
import dircache
import mimetools

import colacanister

import getdate

from rfc822 import Message

_COLAROOT="/home/jam/projects/cola/cola.archive"
_COLABASEHREF="http://www.cs.helsinki.fi/%7Emjrauhal/linux/cola.archive/"

if __name__ == "__main__":
	l = dircache.listdir(_COLAROOT)
	print len(l)
	for item in l:
		p = os.path.join(_COLAROOT, item)
		if os.path.isdir(p):
			articles = dircache.listdir(p)
			for a in articles:
				if a[:5] != "cola." and a[:4] != "mjr.":
					continue

				fp = open(os.path.join(p, a), "r")
				m = Message(fp, seekable=0)
				fp.close()

				if not m.has_key("subject"):
					print "** message does not have subject line. skipped."
					continue

				url = os.path.join(item, a)
				print "processing '%s'" % (url),
				if colacanister.get_cola_by_archiveurl(url) is None:
					c = colacanister.colacanister()
					c["cola_from"] = m["from"]

					if m.has_key("date"):
						c["cola_dateposted"] = getdate.getdate(m["date"])

					c["cola_subject"] = m["subject"]
					c["cola_archiveurl"] = url
					c.insert()
					print "added."
				else:
					print "already archived."