Efficient scanning of mbox files

Paul Moore gustav at morpheus.demon.co.uk
Mon Nov 11 15:46:20 EST 2002


Martin Franklin <mfranklin1 at gatwick.westerngeco.slb.com> writes:

> I ran the above example on my Python folder (7000+ messages...)
> it took 12 seconds to process.  Then I changed the 
> if FROM_RE.match(line):
>
> to
>
> if line.startswith("From "):

Trouble is, I can't do this, as the mbox files I've got *don't*
reliably have lines starting with "From" in the message body quoted
with an initial ">" :-(

> Then I slurped the file into a cStringIO.StringIO object and got it down
> to 5 seconds.....

I'll have a look at slurping, though. I was worrying because I have a
mix of CRLF and LF line endings (some files have one, some the
other). I wasn't sure what effect that would have - but thinking about
it, as long as I read files in binary mode, seek offsets should be the
same as byte positions in the in-memory string, so things should be
OK.

Thanks for the suggestions,
Paul

-- 
This signature intentionally left blank



More information about the Python-list mailing list