[sapug] Large dictionaries
Chris Foote
chris at inetd.com.au
Thu May 11 15:47:18 CEST 2006
On Thu, 11 May 2006, Daryl Tester wrote:
> Chris Foote wrote:
>
>> I have the need to store a large (10M) number of keys in a hash table,
>> based on a tuple of (long_integer, integer). The standard python
>> dictionary works well for small numbers of keys, but starts to perform
>> badly for me inserting roughly 5M keys:
>
> Is this a startup operation, or are you expecting to insert that
> many records that often?
At startup, and on receiving a signal every few hours.
> I was going to suggest using cPickle to
> load a precomputed dictionary, but a quick test shows the performance
> is probably worse.
You'd run out of RAM pretty quickly parsing it as well :-)
> I haven't looked at the modern dictionary code, but seem to recall
> that 1.5's code would expand the internal hash table and perform
> rebalancing operations when the collision rate got too high (the
> rebalancing being the heavy cost). I thought there might be
> parameter to {}.__init__() to predefine the hash table size,
> but alas this seems to have put the kibosh on that:
>
> <http://mail.python.org/pipermail/python-bugs-list/2002-June/012091.html>
Ah, that shed's some light as to why there isn't an 'nelems' argument
available :-(
> I find writing new types for Python to be pretty straightforward;
> you could always find the (pre-) hash library of your dreams and
> bolt that to a new type (or class).
Yes, that sounds like the way to go, but I can't believe that someone
hasn't written some already.
>> p.s. Disk-based DBs are out of the question because most key lookups
>> will result in a miss, and lookup time is critical for this application.
>
> A python interface to cdb perhaps? <http://pilcrow.madison.wi.us/#pycdb>
>
> Given enough memory you could probably cache the entire file in RAM.
I actually tried the dbdbm module with the 'None' file argument, but it
stored the data as a temporary file on disk. Very, very slow, presumably
due to flushing for every k+v pair, and Sleepycat DB 3 isn't so lightweight
anymore.
I imagine cdb might be faster, so I'll have to try it on a RAM disk.
--
Chris Foote <chris at inetd.com.au>
Inetd Pty Ltd T/A HostExpress
Web: http://www.hostexpress.com.au
Blog: http://www.hostexpress.com.au/drupal/chris
Phone: (08) 8410 4566
More information about the sapug
mailing list