Questions about bsddb

Nick Vatamaniuc vatamane at gmail.com
Wed May 9 10:24:08 EDT 2007


On May 9, 8:23 am, sinoo... at yahoo.com wrote:
> Hello,
>
> I need to build a large database that has roughly 500,000 keys, and a
> variable amount of data for each key. The data for each key could
> range from 100 bytes to megabytes.The data under each will grow with
> time as the database is being built.  Are there some flags I should be
> setting when opening the database to handle large amounts of data per
> key? Is hash or binary tree recommended for this type of job, I'll be
> building the database from scratch, so lots of lookups and appending
> of data. Testing is showing bt to be faster, so I'm leaning towards
> that. The estimated build time is around 10~12 hours on my machine, so
> I want to make sure that something won't get messed up in the 10th
> hour.
>
> TIA,
> JM

JM,

How will you access your data?
If you access the keys often in a sequencial manner, then bt is
better.

In general, the rule is:

1) for small data sets, either one works
2) for larger data sets, use bt. Also, bt is good for sequential key
access.
3) for really huge data sets where the metadata of the the btree
cannot even fit in the cache, the hash will be better. The reasoning
is since the metadata is larger than the cache there will be at least
an I/O operation, but with a btree there might be mulple I/O to just
find the key because the tree is not all in the memory and will have
multiple levels.

Also consider this:
I had somewhat of a similar problem. I ended up using MySQL as a
backend. In my application, the data actually was composed of a number
of fields and I wanted to select based on some of those fields as well
(i.e. select based on part of the value, not just the keys).  and thus
needed to have indices for those fields. The result was that my disk I/
O was saturated (i.e. the application was running as fast as the hard
drive would let it), so it was good enough for me.

Hope this helps,
-Nick Vatamaniuc




More information about the Python-list mailing list