[Python-Dev] http://mail.python.org/pipermail/python-dev/2011-December/115172.html

Mon Jan 2 04:28:13 CET 2012

On Sun, Jan 1, 2012 at 10:00 PM, PJ Eby <pje at telecommunity.com> wrote:
> On Sun, Jan 1, 2012 at 7:37 PM, Jim Jewett <jimjjewett at gmail.com> wrote:

>> Well, there is nothing wrong with switching to a different hash function
>> after N
>> collisions, rather than "in the first place".  The perturbation
>> effectively does by
>> shoving the high-order bits through the part of the hash that survives the
>> mask.

> Since these are true hash collisions, they will all have the same high order
> bits.  So, the usefulness of the perturbation is limited mainly to the
> common case where true collisions are rare.

That is only because the perturb is based solely on the hash.
Switching to an entirely new hash after the 5th collision (for a given
lookup) would resolve that (after the 5th collision); the question is
whether or not the cost is worthwhile.

>> > (Well, technically, you could use trees or some other O log n data
>> > structure as a fallback once you have too many collisions, for some
>> > value
>> > of "too many".  Seems a bit wasteful for the purpose, though.)
>>
>> Your WSGI specification < http://www.python.org/dev/peps/pep-0333/ >
>> requires
>> using a real dictionary for compatibility; storing some of the values
>> outside the
>> values array would violate that.

> When I said "use some other data structure", I was referring to the internal
> implementation of the dict type, not to user code.  The only user-visible
> difference (even at C API level) would be the order of keys() et al.

Given the wording requiring a real dictionary, I would have assumed
that it was OK (if perhaps not sensible) to do pointer arithmetic and
access the keys/values/hashes directly.  (Though if the breakage was
between python versions, I would feel guilty about griping too
loudly.)

-jJ