Storing pairs of (int, int) in a database : which db to choose ?

Wed Dec 24 05:07:10 EST 2003

Skip Montanaro <skip at pobox.com> wrote in message news:<mailman.81.1072198845.684.python-list at python.org>...
> andrei> So my problem now is what should I use to store my data and
>     andrei> still be able to retrieve all messages for a given word very
>     andrei> quickly.
> 
> There's the rub.  PySQLlite consumes a lot of disk space presumably to
> support efficient access to the data (hashing keys, etc).  If you want
> compact storage you aren't likely to do much better than
> 
>     wordid1:msgid1
>     wordid2:msgid2
>     ...
> 
> Unfortunately, searching that will probably be a bit slow.

Well - I want to use this storage to speed up my searches so searching
will really need to be fast.

I didn't want the most compact storage, but at 88 bytes to store a
couple of integers is just way too much.

Another reason I wanted to use a database like sqlite is that it
presents some safety guarantees. If something goes wrong when updating
(like no diskspace left on end-user's disk) then the data is still
says safe, i.e. the latest transaction is not commited and that's it.

I wrote a similar version of what I wanted a couple of years ago not
using a database, just C/C++ code, something like this (I think a
similar method was suggested in this thread) : I was processing
messages in batches, so for each 10000 messages or so I would make in
memory a list of words and in what messages occur, and for each word I
would write all messages that contain it as consecutive entries in a
file. Then I would update the global word index (kept in another
file), which for each work kept a linked list of zones in the other
file where msgIds containg this word were.

Worked pretty well, but this was for a server I controlled, i.e. at
all times I was sure I had enough space and I was controling it. For
an application to be deployed to end-users I need more safety. That is
why I am playing with sql.

> If the word ids and msg ids have fixed maximum sizes you might try zero- or
> space- padding them so your records are of fixed length.  You can then store
> the file sorted by wordid and do a binary search in the file looking records
> of interest.

Still would be too slow for a search, and search is the main reason
for this structure.

> Another possibility, assuming your database keys are integers, is to try the
> bsddb recno format database.  I've never used it before (not as much call
> for a database file with integer keys), but it seems to be *relatively*
> space-efficient.  I believe it doesn't actually need to store the keys,
> since they are implicit.
> 
> I built a recno db file containing the pickle of the squares of the numbers
> in range(1, 100000):
> 
>     >>> for n in range(1,100000):
>     ...   db1[n] = pickle.dumps(n*n)
>     ... 
> 
> The total string length of the pickled squares was 1.3MB.  The file size was
> 2.1MB.

Interesting.
Fortunately I found lupy which suits me, but if I were not I might
have been tempted to experiment with that.

Andrei