[Python-ideas] dict.hash - optimized per module

Antoine Pitrou solipsis at pitrou.net
Sun Oct 17 12:52:40 CEST 2010


On Sun, 17 Oct 2010 20:41:34 +1100
Steven D'Aprano <steve at pearwood.info> wrote:
> On Sun, 17 Oct 2010 06:27:37 pm Jan Koprowski wrote:
> 
> >   After watching I made graph, using presented at conference library
> > dictinfo, for __builtin__.__dict__.
> >   When I saw few collisions I think "Why this module doesn't have
> > their own hashing function implementation which allow to avoid
> > collision in this set of names?". 
> 
> Python 2.6 has 143 builtin names, and zero collisions:

It depends what you call collisions. Collisions during bucket lookup, or
during hash value comparison (that is, after you selected a bucket)?

For the former, here is the calculation assuming an overallocation
factor of 4 (which, IIRC, is the one used in the dict implementation):

>>> import builtins
>>> d = builtins.__dict__
>>> m = len(d) * 4
>>> for name in d:
...   h = hash(name) % m
...   if h in hashes: print("Collision for", name)
...   hashes.setdefault(h, []).append(name)
... 
Collision for True
Collision for FutureWarning
Collision for license
Collision for KeyboardInterrupt
Collision for UserWarning
Collision for RuntimeError
Collision for MemoryError
Collision for Ellipsis
Collision for UnicodeError
Collision for Exception
Collision for tuple
Collision for delattr
Collision for setattr
Collision for ArithmeticError
Collision for property
Collision for KeyError
Collision for PendingDeprecationWarning
Collision for map
Collision for AssertionError
>>> len(d)
130
>>> len(hashes)
110


> > My second think was "Why each 
> > Python module doesn't have their own internal hashing function which
> > doesn't produce collisions in scope of his names".

The real answer here is that Python needs hash values to be
globally valid. Both for semantics (module dicts are regular dicts and
should be usable as such), and for efficiency (having an unique hash
function means the precalculated hash value can be stored for critical
types such as str).

Regards

Antoine.





More information about the Python-ideas mailing list