Finding messages in huge mboxes
Donn Cave
donn at u.washington.edu
Mon Feb 2 16:43:30 EST 2004
In article <401eb54c$0$315$e4fe514c at news.xs4all.nl>,
Bastiaan Welmers <haasje at welmers.net> wrote:
...
> I need find messages in huge mbox files (50MB or more).
...
> Especially because I often need messages at the end
> of the MBOX file.
> So I tried the following (scanning messages backwards
> on found "From " lines with readline())
readline() is not your friend here. I suggest that
you read large blocks of data, like 8192 bytes for
example, and search them iteratively. Like,
next = block.find('\nFrom ', prev + 1)
This will give you the location of each message in
the current block, so you can split the block up
into a list of messages. (There will be an extra
chunk of data at the beginning of each block, before
the first "From " - recycle that onto the end of the
next block.)
Since file object buffering is at best useless in this
application, I would use posix.open, posix.lseek and
posix.read. Taking this approach, I find that reading
the last 10 messages in a 100 Mb folder takes 0.05 sec.
Donn Cave, donn at u.washington.edu
More information about the Python-list
mailing list