Efficient scanning of mbox files
Martin Franklin
mfranklin1 at gatwick.westerngeco.slb.com
Mon Nov 11 07:25:44 EST 2002
On Mon, 2002-11-11 at 12:06, Martin Franklin wrote:
> On Mon, 2002-11-11 at 11:42, Moore, Paul wrote:
> >
> > def add_group(self, id, file):
> > print "Opening file", file, "for group", id
> > fp = open(file, "rb")
> > posns = []
> > oldpos = 0
> > n = 0
> > while 1:
> > line = fp.readline()
> > if not line: break
> > if FROM_RE.match(line):
> > n += 1
> > posns.append(oldpos)
> > oldpos = fp.tell()
> > fp.close()
> > posns.append(oldpos)
> > print "Group", id, "- articles(posns) =", n, len(posns)
> > self.groups[id] = (file, n, posns)
> >
> > --
> > http://mail.python.org/mailman/listinfo/python-list
>
> Paul,
>
>
> I ran the above example on my Python folder (7000+ messages...)
> it took 12 seconds to process. Then I changed the
> if FROM_RE.match(line):
>
> to
>
> if line.startswith("From "):
>
>
> And got a 2 second speed up....
>
> Then I slurped the file into a cStringIO.StringIO object and got it down
> to 5 seconds.....
>
>
Another thought.... if you have Python 2.2 (or greater) you can
iterate through the file :-
for line in fp:
if line.startswith("From "):
posns.append(oldpos)
Again this should shave a second or two from the result...
This is my fastest:-
import time
import cStringIO
groups={}
def add_group(id, file, fp):
print "Opening file", file, "for group", id
posns = []
oldpos = 0
for line in fp:
if line.startswith("From "):
posns.append(oldpos)
oldpos = fp.tell()
posns.append(oldpos)
n=len(posns)-1
print "Group", id, "- articles(posns) =", n, len(posns)
groups[id] = (file, n, posns)
cfile=cStringIO.StringIO(open("Mail/Python").read())
cfile.seek(0)
add_group(1,"/home/bpse/Mail/Python", cfile)
cfile.close()
print time.clock()
More information about the Python-list
mailing list