Storing pairs of (int, int) in a database : which db to choose ?

Skip Montanaro skip at pobox.com
Tue Dec 23 12:00:12 EST 2003


    andrei> So my problem now is what should I use to store my data and
    andrei> still be able to retrieve all messages for a given word very
    andrei> quickly.

There's the rub.  PySQLlite consumes a lot of disk space presumably to
support efficient access to the data (hashing keys, etc).  If you want
compact storage you aren't likely to do much better than

    wordid1:msgid1
    wordid2:msgid2
    ...

Unfortunately, searching that will probably be a bit slow.

If the word ids and msg ids have fixed maximum sizes you might try zero- or
space- padding them so your records are of fixed length.  You can then store
the file sorted by wordid and do a binary search in the file looking records
of interest.

Another possibility, assuming your database keys are integers, is to try the
bsddb recno format database.  I've never used it before (not as much call
for a database file with integer keys), but it seems to be *relatively*
space-efficient.  I believe it doesn't actually need to store the keys,
since they are implicit.

I built a recno db file containing the pickle of the squares of the numbers
in range(1, 100000):

    >>> for n in range(1,100000):
    ...   db1[n] = pickle.dumps(n*n)
    ... 

The total string length of the pickled squares was 1.3MB.  The file size was
2.1MB.

Skip





More information about the Python-list mailing list