splitting a large dictionary into smaller ones

Mon Mar 23 03:31:54 EDT 2009

On Sun, 22 Mar 2009 23:10:21 -0400, Terry Reedy wrote:

> Searching for a key in, say, 10 dicts will be slower than searching for
> it in just one.  The only reason I would do this would be if the dict
> had to be split, say over several machines.  But then, you could query
> them in parallel.

That surely depends on the number of collisions? With millions of keys, 
the number of collisions could be high (although it almost certainly 
won't be). As I understand it, searching a dict is O(1) on average, but 
if there are collisions it can degrade to O(m) (where m is the number of 
items colliding for some hash). In principle, m could be large.

Hypothetically speaking, if you could predict where collisions occur and 
split your data into multiple smaller dicts with fewer collisions, you 
could delegate to one of the smaller dicts and reduce O(m) lookups to 
2*O(1) lookups, which of course is just O(1).

Of course, this entirely assumes that collisions in Python's hash 
function are both predictable and frequent in the OP's data. If they're 
not both, then it's unlikely that this suggestion will be of practical 
use. Could be an interesting experiment though :)

-- 
Steven