Storing pairs of (int, int) in a database : which db to choose ?
Stormbringer
andreif at mail.dntis.ro
Wed Dec 24 05:07:10 EST 2003
Skip Montanaro <skip at pobox.com> wrote in message news:<mailman.81.1072198845.684.python-list at python.org>...
> andrei> So my problem now is what should I use to store my data and
> andrei> still be able to retrieve all messages for a given word very
> andrei> quickly.
>
> There's the rub. PySQLlite consumes a lot of disk space presumably to
> support efficient access to the data (hashing keys, etc). If you want
> compact storage you aren't likely to do much better than
>
> wordid1:msgid1
> wordid2:msgid2
> ...
>
> Unfortunately, searching that will probably be a bit slow.
Well - I want to use this storage to speed up my searches so searching
will really need to be fast.
I didn't want the most compact storage, but at 88 bytes to store a
couple of integers is just way too much.
Another reason I wanted to use a database like sqlite is that it
presents some safety guarantees. If something goes wrong when updating
(like no diskspace left on end-user's disk) then the data is still
says safe, i.e. the latest transaction is not commited and that's it.
I wrote a similar version of what I wanted a couple of years ago not
using a database, just C/C++ code, something like this (I think a
similar method was suggested in this thread) : I was processing
messages in batches, so for each 10000 messages or so I would make in
memory a list of words and in what messages occur, and for each word I
would write all messages that contain it as consecutive entries in a
file. Then I would update the global word index (kept in another
file), which for each work kept a linked list of zones in the other
file where msgIds containg this word were.
Worked pretty well, but this was for a server I controlled, i.e. at
all times I was sure I had enough space and I was controling it. For
an application to be deployed to end-users I need more safety. That is
why I am playing with sql.
> If the word ids and msg ids have fixed maximum sizes you might try zero- or
> space- padding them so your records are of fixed length. You can then store
> the file sorted by wordid and do a binary search in the file looking records
> of interest.
Still would be too slow for a search, and search is the main reason
for this structure.
> Another possibility, assuming your database keys are integers, is to try the
> bsddb recno format database. I've never used it before (not as much call
> for a database file with integer keys), but it seems to be *relatively*
> space-efficient. I believe it doesn't actually need to store the keys,
> since they are implicit.
>
> I built a recno db file containing the pickle of the squares of the numbers
> in range(1, 100000):
>
> >>> for n in range(1,100000):
> ... db1[n] = pickle.dumps(n*n)
> ...
>
> The total string length of the pickled squares was 1.3MB. The file size was
> 2.1MB.
Interesting.
Fortunately I found lupy which suits me, but if I were not I might
have been tempted to experiment with that.
Andrei
More information about the Python-list
mailing list