ZODB for inverted index?

vd12005 at yahoo.fr vd12005 at yahoo.fr
Mon Oct 23 19:39:09 CEST 2006


While playing to write an inverted index (see:
http://en.wikipedia.org/wiki/Inverted_index), i run out of memory with
a classic dict, (i have thousand of documents and millions of terms,
stemming or other filtering are not considered, i wanted to understand
how to handle GB of text first). I found ZODB and try to use it a bit,
but i think i must be misunderstanding how to use it even after reading

i would like to use it once to build my inverted index, save it to disk
via a FileStorage,

and then reuse this previously created inverted index from the
previously created FileStorage, but it looks like i am unable to
reread/reload it in memory, or i am missing how to do it...

firstly each time i use the code below, it looks everything is added
another time, is there a way to rather rewrite/replace it? and how am i
suppose to use it after an initial creation? i thought that using the
same FileStorage would reload my object inside dbroot, but it doesn't.
i was also interested by the cache mecanisms, are they transparent?

or maybe do you know a good tutorial to understand ZODB?

thx for any help, regards.

here is a sample code :

import sys
from BTrees.OOBTree import OOBTree
from BTrees.OIBTree import OIBTree
from persistent import Persistent

class IDF2:
    def __init__(self):
        self.docs = OIBTree()
        self.idfs = OOBTree()
    def add(self, term, fromDoc):
        self.docs[fromDoc] = self.docs.get(fromDoc, 0) + 1
        if not self.idfs.has_key(term):
            self.idfs[term] = OIBTree()
        self.idfs[term][fromDoc] = self.idfs[term].get(fromDoc, 0) + 1
    def N(self, term):
        "total number of occurrences of 'term'"
        return sum(self.idfs[term].values())
    def n(self, term):
        "number of documents containing 'term'"
        return len(self.idfs[term])
    def ndocs(self):
        "number of documents"
        return len(self.docs)
    def __getitem__(self, key):
        return self.idfs[key]
    def iterdocs(self):
        for doc in self.docs.iterkeys():
            yield doc
    def iterterms(self):
        for term in self.idfs.iterkeys():
            yield term

storage = FileStorage.FileStorage("%s.fs" % sys.argv[1])
db = DB(storage)
conn = db.open()
dbroot = conn.root()
if not dbroot.has_key('idfs'):
    dbroot['idfs'] = IDF2()
idfs = dbroot['idfs']

import transaction
for i, line in enumerate(open(sys.argv[1])):
    # considering doc is linenumber...
    for word in line.split():
        idfs.add(word, i)
# Commit the change

i was expecting :

storage = FileStorage.FileStorage("%s.fs" % sys.argv[1])
db = DB(storage)
conn = db.open()
dbroot = conn.root()
print dbroot.has_key('idfs')

=> to return True

