'database' search

Sun Apr 23 23:29:52 EDT 2000

[Bjorn Pettersen, wants to do word completion]
> ...
> The smallest datastructure I've found so far, is a 'prefix dictionary',
> so if e.g. "aardwark" was in the database, the 'prefix dictionary' would
> contain:
>
>  pfdict = {
>   'a':        ['aardwark', ...],
>   'aa':       ['aardwark', ...],
>   'aar':      ['aardwark'],
>   'aard':     ['aardwark'],
>   'aardw':    ['aardwark'],
>   'aardwa':   ['aardwark'],
>   'aardwar':  ['aardwark'],
>   'aardwark': ['aardwark']
>  }
>
> The maximum size for this would be n * avg len of item, with the real
> size being much(?) smaller with sufficient prefix sharing...

Before going nuts with variations on prefix trees, try the attached.  This
uses a sorted list of all words, and does a binary search to find the ones
starting with a given prefix -- dirt simple.  The advantage to that is that
it works <wink>.

BTW, the amount of compression you can get from a tree (or trie) approach
depends on the language and the set of words, but when coding in Python
you're almost certainly going to consume more storage with all the
supporting hair (ancillary dicts and lists) than you'll save by sharing a
few initial bytes in the words.  Trees are used for speed reasons instead,
in intense searching applications (which yours is not).

> I have a relatively stable database of between 1500 and 5000 items

So even if they average 10 letters/item, this won't consume more than 50Kb.

> new insertions are probably going to be less than 100 during the
> lifetime of the database)

In 1.5.2 and 1.6, appending a new word to a sorted list and then sorting
again takes time linear in the # of words -- it's recognized as a special
case (of another special case <wink>), and doesn't do a full sort.  So the
method below is more efficient than you might think.

when-your-database-grows-100x-larger-ask-again-ly y'rs  - tim

from bisect import bisect

class Completer:
    def __init__(self, wordlist):
        """wordlist -> a Completer for the list of words."""
        wordlist.sort()
        self.words = wordlist

    def complete(self, prefix):
        """prefix -> alphabetical list of all words starting w/ prefix."""
        words = self.words
        # Find i, j s.t. all in words[i:j] start w/ prefix.
        i = bisect(words, prefix)
        # Subtle:  if prefix is in the list, i points *after* it.
        # Fiddle i to point before it.  The bisect module should
        # really have more functions to control what happens here.
        while i > 0 and words[i-1] == prefix:
            i = i-1
        j = i
        n = len(words)
        # note that .startswith() is a 1.6 feature
        while j < n and words[j].startswith(prefix):
            j = j+1
        return words[i:j]

    def addword(self, word):
        """Add a word to the list of words."""
        self.words.append(word)
        self.words.sort()