[sapug] Large dictionaries

Michael Cohen michael.cohen at netspeed.com.au
Thu May 11 10:15:57 CEST 2006


Hi Chris,
  You probably need to balance the hash table. Normally a dict works by hashing
  the key into an array of a certain size. The array is indexed by the hash
  value and a linked list of values is attached to each slot in the array.

  If you load the table too much, there will be many hash collisions, which
  will cause the linked lists to get very long. This will slow down access time
  in an exponential fasion.

  Apart from not using a tuple for a dict key (im not sure how efficient the
  algorithm of getting a hash value from a python tuple struct is). I would
  recommend to split the single dict into a dict of dicts. This will have the
  effect to spreading the load on the hash table:

 If your key is:

 key = (first, second)

#instead of something like this:
 hash = {}
 hash[key] = value

#use (of the top of my head not necessarily correct)
 hash = {}

#insertion:
 try:
   tmp = hash[first]
 except KeyError:
   tmp = {}
   hash[first] = tmp

 tmp[second] = value
 
#retrieval:
  value = hash[first][second]
 
This might be faster because the number of elements stored in each hash table
is smaller. Let us all know how you go.

Michael.
 
On Thu, May 11, 2006 at 04:26:34PM +0930, Chris Foote wrote:
> Hi all.
> 
> I have the need to store a large (10M) number of keys in a hash table,
> based on a tuple of (long_integer, integer).  The standard python
> dictionary works well for small numbers of keys, but starts to perform
> badly for me inserting roughly 5M keys:
> 
> # keys   dictionary  metakit   (both using psyco)
> ------   ----------  -------
> 1M            8.8s     22.2s
> 2M           24.0s     43.7s
> 5M          115.3s    105.4s
> 
> Does anyone know of a fast hash module which is more optimal for
> large datasets ?
> 
> p.s. Disk-based DBs are out of the question because most key lookups
> will result in a miss, and lookup time is critical for this application.
> 
> Cheers,
> 
> -- 
> Chris Foote <chris at inetd.com.au>
> Inetd Pty Ltd T/A HostExpress
> Web:   http://www.hostexpress.com.au
> Blog:  http://www.hostexpress.com.au/drupal/chris
> Phone: (08) 8410 4566
> _______________________________________________
> sapug mailing list
> sapug at python.org
> http://mail.python.org/mailman/listinfo/sapug
> 


More information about the sapug mailing list