Databases and python
Bryan Olson
fakeaddress at nowhere.org
Thu Feb 16 08:45:28 EST 2006
Dan Stromberg wrote:
> I've been putting a little bit of time into a file indexing engine
[...]
> So far, I've been taking the approach of using a single-table database
> like gdbm or dbhash [...] and making each entry keyed by
> a word, and under the word in the database is a null terminated list of
> filenames (in http://dcs.nac.uci.edu/~strombrg/base255.html representation).
>
[...]
> the program just isn't performing like I'd like it to.
>
> And I think it's because despite the caching and minimal representation
> conversion, it's still just too slow converting linear lists to arrays
> back and forth and back and forth.
I think I follow. The straightforward method of building the
list of files associated with a word is order n**2 time. On each
new entry, you look up the entire string, append one file-id to
it, and write the new version back.
> So this leads me to wonder - is there a python database interface that
> would allow me to define a -lot- of tables? Like, each word becomes a
> table, and then the fields in that table are just the filenames that
> contained that word. That way adding filenames to a word shouldn't bog
> down much at all.
Well, you could use simple files instead of fancy database tables.
Below is a demo of an alternate technique that uses bsddb B-Trees,
and puts both the word and the file-id in the key. I don't know
how efficient it is for real data, but at least the time won't grow
as Theta(n**2).
--Bryan
import bsddb
import urllib
def add_words_from_file(index, fname, word_iterator):
""" Pass the open-for-write bsddb B-Tree, a filename, and a list
(or any interable) of the words in the file.
"""
fname = urllib.quote_plus(fname)
s = set()
for word in word_iterator:
if word not in s:
s.update(word)
key = '%s %s' % (urllib.quote_plus(word), fname)
index[key] = ''
index.sync()
def lookup(index, word):
""" Pass the B-Tree (as built with add_words_from_file) and a
word to look up. Returns list of files containing the word.
"""
word = urllib.quote_plus(word)
fname_list = []
try:
current = index.set_location(word)
while True:
(key, _) = current
(w, fname) = key.split()
if w != word:
break
fname_list.append(urllib.unquote_plus(fname))
current = index.next()
except KeyError:
pass
return fname_list
def test():
index = bsddb.btopen('junktest.bdb', 'n')
data =[
('bryfile.txt', 'nor heed the rumble of a distant drum'),
('junkfile.txt', 'this is the beast, the beast so sly'),
('word file.txt', 'is this the way it always is here in Baltimore')
]
for (fname, text) in data:
words = text.split()
add_words_from_file(index, fname, words)
for word in ['is', 'the', 'heed', 'this', 'way']:
print '"%s" is in files: %s' % (word, lookup(index, word))
test()
More information about the Python-list
mailing list