Efficient scanning of mbox files

Moore, Paul Paul.Moore at atosorigin.com
Mon Nov 11 06:42:58 EST 2002


I have an application which needs to scan through a number of large
mbox files, preprocessing them to allow later efficient access to
individual messages by index and/or message-id (I'm writing a news
server which serves up mbox archives as newsgroups).

I'm happy doing a preprocessing step of the file, but I want this to
be reasonably efficient (it's only startup time, but I don't want it
to be a *lot* of startup time), and I don't want to store masses of
data in memory (some of these files are *big*). The memory constraint
makes storing messages in memory a non-starter. I am currently
scanning the mbox by hand, reading line by line, and storing the file
offset and length of each message (detecting "From" delimiters). I can
then use seek() and read() to grab messages on demand. By reading each
message into an email.Message instance during the preprocessing step,
I can also get the message-id for an id->index lookup dictionary, and
also to get message headers for "overview" type information without
needing to go back to the message.

This works OK, but I wonder if it's the most effective way. I
considered using the "mailbox" module, but couldn't find a way
with this of getting file offsets. Also I have to do my own "while 1/
readline" loop, as "for line in file" uses buffering which makes the
results of the tell() function invalid...

If anyone has any suggestions for improvement, I'd love to hear them.

I've considered slurping the whole file into RAM and splitting it
up and analyzing like that, but I've not tried it yet. My instinct
is that this would complicate the code, and could have large memory
overheads (building lots of big strings) for little speed improvement.

Thanks for any help,
Paul.

PS Here's the code. Return value is a tuple of filename, number of
   articles, and a list of file positions which are the start/end
   of articles (so there are n+1 entries in the list). This version
   doesn't use email to get message-id or header values, and so is
   simpler and more efficient than what I'll end up with...

    def add_group(self, id, file):
        print "Opening file", file, "for group", id
        fp = open(file, "rb")
        posns = []
        oldpos = 0
        n = 0
        while 1:
            line = fp.readline()
            if not line: break
            if FROM_RE.match(line):
                n += 1
                posns.append(oldpos)
            oldpos = fp.tell()
        fp.close()
        posns.append(oldpos)
        print "Group", id, "- articles(posns) =", n, len(posns)
        self.groups[id] = (file, n, posns)




More information about the Python-list mailing list