[sapug] Large dictionaries

Chris Foote chris at inetd.com.au
Thu May 11 15:47:18 CEST 2006


On Thu, 11 May 2006, Daryl Tester wrote:

> Chris Foote wrote:
>
>> I have the need to store a large (10M) number of keys in a hash table,
>> based on a tuple of (long_integer, integer).  The standard python
>> dictionary works well for small numbers of keys, but starts to perform
>> badly for me inserting roughly 5M keys:
>
> Is this a startup operation, or are you expecting to insert that
> many records that often?

At startup, and on receiving a signal every few hours.

> I was going to suggest using cPickle to
> load a precomputed dictionary, but a quick test shows the performance
> is probably worse.

You'd run out of RAM pretty quickly parsing it as well :-)

> I haven't looked at the modern dictionary code, but seem to recall
> that 1.5's code would expand the internal hash table  and perform
> rebalancing operations when the collision rate got too high (the
> rebalancing being the heavy cost).  I thought there might be
> parameter to {}.__init__() to predefine the hash table size,
> but alas this seems to have put the kibosh on that:
>
> <http://mail.python.org/pipermail/python-bugs-list/2002-June/012091.html>

Ah, that shed's some light as to why there isn't an 'nelems' argument
available :-(

> I find writing new types for Python to be pretty straightforward;
> you could always find the (pre-) hash library of your dreams and
> bolt that to a new type (or class).

Yes, that sounds like the way to go, but I can't believe that someone
hasn't written some already.

>> p.s. Disk-based DBs are out of the question because most key lookups
>> will result in a miss, and lookup time is critical for this application.
>
> A python interface to cdb perhaps?  <http://pilcrow.madison.wi.us/#pycdb>
>
> Given enough memory you could probably cache the entire file in RAM.

I actually tried the dbdbm module with the 'None' file argument, but it
stored the data as a temporary file on disk.  Very, very slow, presumably
due to flushing for every k+v pair, and Sleepycat DB 3 isn't so lightweight
anymore.

I imagine cdb might be faster, so I'll have to try it on a RAM disk.

-- 
Chris Foote <chris at inetd.com.au>
Inetd Pty Ltd T/A HostExpress
Web:   http://www.hostexpress.com.au
Blog:  http://www.hostexpress.com.au/drupal/chris
Phone: (08) 8410 4566


More information about the sapug mailing list