simple text file 'parsing' question
jam
jam at newimage.com
Sun Jun 20 16:12:53 EDT 1999
On Mon, Jun 21, 1999 at 12:56:07AM +0900, Matt Gushee wrote:
> KP <terocr at mysolution.com> writes:
>
> > Here's my dilema: a directory filled (200+) with small emails. My goal
> > is to strip all the headers and combine them into one file. I can read
> > all the files just fine and write them all to one file, but I cannot
> > discern how to strip the headers.
>
> I have no expertise in this area, but I've been reading the "Internet
> Data Handling" section of the Library Reference (Ch. 12 of the 1.5.2
> edition), and it seems like there are several modules that might help
> you. In particular, check out 'rfc822.'
>
> Hope this helps.
>
> Matt Gushee
> Portland, Maine, USA
> mgushee at havenrock.com
>
I wrote a small piece of code that does *exactly* what you are describing.
it doesn't exactly strip the headers, but it parses the message using rfc822
and deals with it. you'll find it attached to this message. if for some
reason it doesn't come through, let me know, and I'll resend it.
regards,
Jeff
--
|| visit gfd <http://quark.newimage.com/>
|| psa member #293 <http://www.python.org/>
|| New Image Systems & Services, Inc. <http://www.newimage.com/>
-------------- next part --------------
#!/usr/bin/env python
import os
import dircache
import mimetools
import colacanister
import getdate
from rfc822 import Message
_COLAROOT="/home/jam/projects/cola/cola.archive"
_COLABASEHREF="http://www.cs.helsinki.fi/%7Emjrauhal/linux/cola.archive/"
if __name__ == "__main__":
l = dircache.listdir(_COLAROOT)
print len(l)
for item in l:
p = os.path.join(_COLAROOT, item)
if os.path.isdir(p):
articles = dircache.listdir(p)
for a in articles:
if a[:5] != "cola." and a[:4] != "mjr.":
continue
fp = open(os.path.join(p, a), "r")
m = Message(fp, seekable=0)
fp.close()
if not m.has_key("subject"):
print "** message does not have subject line. skipped."
continue
url = os.path.join(item, a)
print "processing '%s'" % (url),
if colacanister.get_cola_by_archiveurl(url) is None:
c = colacanister.colacanister()
c["cola_from"] = m["from"]
if m.has_key("date"):
c["cola_dateposted"] = getdate.getdate(m["date"])
c["cola_subject"] = m["subject"]
c["cola_archiveurl"] = url
c.insert()
print "added."
else:
print "already archived."
More information about the Python-list
mailing list