ZODB for inverted index?

Wed Oct 25 20:27:54 EDT 2006

vd12005 at yahoo.fr wrote:
> Hello,
> 
> While playing to write an inverted index (see:
> http://en.wikipedia.org/wiki/Inverted_index), i run out of memory with
> a classic dict, (i have thousand of documents and millions of terms,
> stemming or other filtering are not considered, i wanted to understand
> how to handle GB of text first). I found ZODB and try to use it a bit,
> but i think i must be misunderstanding how to use it even after reading
> http://www.zope.org/Wikis/ZODB/guide/node3.html...
> 
> i would like to use it once to build my inverted index, save it to disk
> via a FileStorage,
> 
> and then reuse this previously created inverted index from the
> previously created FileStorage, but it looks like i am unable to
> reread/reload it in memory, or i am missing how to do it...
> 
> firstly each time i use the code below, it looks everything is added
> another time, is there a way to rather rewrite/replace it? and how am i
> suppose to use it after an initial creation? i thought that using the
> same FileStorage would reload my object inside dbroot, but it doesn't.
> i was also interested by the cache mecanisms, are they transparent?
> 
> or maybe do you know a good tutorial to understand ZODB?
> 
> thx for any help, regards.
> 
> here is a sample code :
> 
> import sys
> from BTrees.OOBTree import OOBTree
> from BTrees.OIBTree import OIBTree
> from persistent import Persistent
> 
> class IDF2:
>     def __init__(self):
>         self.docs = OIBTree()
>         self.idfs = OOBTree()
>     def add(self, term, fromDoc):
>         self.docs[fromDoc] = self.docs.get(fromDoc, 0) + 1
>         if not self.idfs.has_key(term):
>             self.idfs[term] = OIBTree()
>         self.idfs[term][fromDoc] = self.idfs[term].get(fromDoc, 0) + 1
>     def N(self, term):
>         "total number of occurrences of 'term'"
>         return sum(self.idfs[term].values())
>     def n(self, term):
>         "number of documents containing 'term'"
>         return len(self.idfs[term])
>     def ndocs(self):
>         "number of documents"
>         return len(self.docs)
>     def __getitem__(self, key):
>         return self.idfs[key]
>     def iterdocs(self):
>         for doc in self.docs.iterkeys():
>             yield doc
>     def iterterms(self):
>         for term in self.idfs.iterkeys():
>             yield term
> 
> storage = FileStorage.FileStorage("%s.fs" % sys.argv[1])
> db = DB(storage)
> conn = db.open()
> dbroot = conn.root()#
> if not dbroot.has_key('idfs'):
>     dbroot['idfs'] = IDF2()
> idfs = dbroot['idfs']
> 
> import transaction
> for i, line in enumerate(open(sys.argv[1])):
>     # considering doc is linenumber...
>     for word in line.split():
>         idfs.add(word, i)
> # Commit the change
> transaction.commit()
> 
> ---
> i was expecting :
> 
> storage = FileStorage.FileStorage("%s.fs" % sys.argv[1])
> db = DB(storage)
> conn = db.open()
> dbroot = conn.root()
> print dbroot.has_key('idfs')
> 
> => to return True
> 

you have to have Persistent as base class

class IDF2(Persistent):
 ....

and maybe (?) reset idfs.idfs=idfs.idfs  or do a idfs._p_changed=1 thing or so - don't remember the latter exactly.

but doubt if the memory management of ZODB is intelligent enough (with some extra control?) really improve your task in terms of mem usage (swapping blackout).

Other ideas:

* This is often the best method to balance mem & disk in extreme index applications: use directly the filesystem (thus (escaped) filenames/subdirs) for your index. You just append your pointers to the files. The OS cache system is already a good careful mem/disc balancer - you can do some extra cache logic in your application. This works best with filesystems who can deal well with small files
(but maybe many of your words have long index lists anyway...)
( To maybe reduce number of files/inodes bulk many items into one pickle/shleve/anddbm.. file by using sub hash keys. Example: 1 million words => 10000 files x ~100 sub-entries x 10000 refs.  )

* a fast relational/dictionary database (mysql)

* Advanced memory mapped file techniques / C-OODBMS ( ObjectStore/PSE ); 64bit OS if > 3GB
  ( thats the technique telecoms often run their tables fast  - but this is maybe too advanced ... ) 

-robert