Databases and python
Bryan Olson
fakeaddress at nowhere.org
Tue Feb 21 00:58:29 EST 2006
Dan Stromberg wrote:
> I've been putting a little bit of time into a file indexing engine
[...]
To solve the O.P.'s first problem, the facility we need is an
efficient externally-stored multimap. A multimap is like a map,
except that each key is associated with a collection of values,
not just a single value. Obviously we could simply encode
multiple values into a single string -- and that's what the
O.P. did -- but updating large strings is inefficient.
Fortunately, the standard Python distribution now includes an
efficient multimap facility, though the standard library doc
does not yet say so. The bsddb module is, in the current
version, built on bsddb3, which exposes far more features of
the Berkeley DB library than the bsddb module.
http://pybsddb.sourceforge.net/bsddb3.html
Sleepycat Software's Berkeley DB library: supports an option
of mapping keys to multiple values:
http://sleepycat.com/docs/ref/am_conf/dup.html
Below is a simple example.
--Bryan
import bsddb
def add_words_from_file(index, fname, word_iterator):
""" Pass the open-for-write bsddb B-Tree, a filename, and a list
(or any interable) of the words in the file.
"""
s = set()
for word in word_iterator:
if word not in s:
s.add(word)
index.put(word, fname)
index.sync()
print
def lookup(index, word):
""" Pass the index (as built with add_words_from_file) and a
word to look up. Returns list of files containing the word.
"""
l = []
cursor = index.cursor()
item = cursor.set(word)
while item != None:
l.append(item[1])
item = cursor.next_dup()
cursor.close()
return l
def test():
env = bsddb.db.DBEnv()
env.open('.', bsddb.db.DB_CREATE | bsddb.db.DB_INIT_MPOOL)
db = bsddb.db.DB(env)
db.set_flags(bsddb.db.DB_DUP)
db.open(
'junktest.bdb',
None,
bsddb.db.DB_HASH,
bsddb.db.DB_CREATE | bsddb.db.DB_TRUNCATE)
data =[
('bryfile.txt', 'nor heed the rumble of a distant drum'),
('junkfile.txt', 'this is the beast, the beast so sly'),
('word file.txt', 'is this the way it always is here in Baltimore')
]
for (fname, text) in data:
words = text.split()
add_words_from_file(db, fname, words)
for word in ['is', 'the', 'heed', 'this', 'way']:
print '"%s" is in files: %s' % (word, lookup(db, word))
test()
More information about the Python-list
mailing list