Dictionary or Database—Please advise
mrkafk at gmail.com
Fri Feb 26 19:20:19 CET 2010
> Shelve looks like an interesting option, but what might pose an issue
> is that I'm reading the data from a disk instead of memory. I didn't
> mention this in my original post, but I was hoping that by using a
> database it would be more memory efficient in storing data in RAM so I
> wouldn't have to read from (or swap to/from) disk. Would using the
> shelve package make reading/writing data from disk faster since it is
> in a binary format?
Read the docs:
"class shelve.BsdDbShelf(dict[, protocol=None[, writeback=False]])¶
A subclass of Shelf which exposes first(), next(), previous(),
last() and set_location() which are available in the bsddb module but
not in other database modules. The dict object passed to the constructor
must support those methods. This is generally accomplished by calling
one of bsddb.hashopen(), bsddb.btopen() or bsddb.rnopen(). The optional
protocol and writeback parameters have the same interpretation as for
the Shelf class."
Apparently using shelve internally gives you option of using bsddb,
which is good news: bsddb is B-tree DB, which is highly efficient for
finding keys. I would recommend bsddb.btopen(), as it creates B-tree DB
(perhaps other implementations, like anydb or hash db are good as well,
but I personally didn't test them out).
I can't say for Berkeley DB implementation, but in general B-tree
algorithm has O(log2 n) complexity for finding keys, which roughly means
that if you need to find particular key in a db of 1 million keys,
you'll probably need ~20 disk accesses (or even less if some keys looked
at in the process of search happen to be in the same disk sectors). So
yes, it's highly efficient.
Having said that, remember that disk is many orders of magnitude slower
than RAM, so it's no free lunch.. Nothing will beat memory-based data
structure when it comes to speed (well new flash or hybrid disks perhaps
could significantly improve in comparison to current mainstream
mechanical-platter disks? there are some hyper-fast storage hardware
companies out there, although they tend to charge arm and leg for their
stuff for now).
Caveat: Berkeley DB is dual-licensed -- if you're using it for
commercial work, it might be that you'd need to buy a license for it.
Although I have had no experience with this really, if someone here did
perhaps they will shed some light on it?
More information about the Python-list