# [ python-Bugs-942952 ] Weakness in tuple hash

Sun May 16 08:25:40 EDT 2004

```Bugs item #942952, was opened at 2004-04-27 14:41
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=942952&group_id=5470

Category: Python Interpreter Core
Group: None
Status: Open
Resolution: None
Priority: 6
Submitted By: Steve Tregidgo (smst)
Assigned to: Tim Peters (tim_one)
Summary: Weakness in tuple hash

Initial Comment:
I've encountered some performance problems when
constructing dictionaries with keys of a particular
form, and have pinned the problem down to the hashing
function.  I've reproduced the effect in Python 1.5.2,
Python 2.2 and Python 2.3.3.  I came across this when
always been fast, but in this case I killed the
interpreter after a few minutes of CPU thrashing.  The
performance problem was caused because every key in the
dictionary had the same hash value.

The problem is as follows: for hashable values X and Y,
hash( (X, (X, Y)) ) == hash(Y).  This is always true
(except in the corner case where hash((X, Y)) is
internally calculated to be -1 (the error value) and so
-2 is forced as the return value).  With data in this
form where X varies more than Y (Y is constant, or
chosen from relatively few values compared to X)
chances of collision become high as X is effectively
ignored.

The hash algorithm for tuples starts with a seed value,
then generates a new value for each item in the tuple
by multiplying the iteration's starting value by a
constant (keeping the lowest 32 bits) and XORing with
the hash of the item.  The final value is then XORed
with the tuple's length.  In Python (ignoring the

# assume 'my_mul' would multiply two numbers and
return the low 32 bits
value = seed
for item in tpl:
value = my_mul(const, value) ^ hash(item)
value = value ^ len(tpl)

The tuple (X, Y) therefore has hash value:

my_mul(const, my_mul(const, seed) ^ hash(X)) ^
hash(Y) ^ 2

...and the tuple (X, (X, Y)) has hash value:

my_mul(const, my_mul(const, seed) ^ hash(X)) ^
hash((X, Y)) ^ 2

The outer multiplication is repeated, and is XORed with
itself (cancelling both of them). The XORed 2s cancel
also, leaving just hash(Y).  Note that this
cancellation property also means that the same hash
value is shared by (X, (X, (X, (X, Y)))), and (X, (X,
(X, (X, (X, (X, Y)))))), and so on, and (X, Z, (X, Z,
Y)) and so on.

I realise that choosing a hash function is a difficult
task, so it may be that the behaviour seen here is a
tradeoff against other desireable properties -- I don't
have the expertise to tell.  My naive suggestion would
be that an extra multiplication is necessary, which
presumably has a performance impact (multiplication
being more expensive than XOR) but would reduce the
possibility of cancellation. On the other hand, perhaps
this particular case is rare enough that it's not worth
the effort.

For my own application I'm fortunate in that I can
probably rearrange the data structure to avoid this case.

----------------------------------------------------------------------

Comment By: Yitz Gale (ygale)
Date: 2004-05-16 15:25

Message:
Logged In: YES
user_id=1033539

Hmm, you are correct. This is appears to be
an off-by-one problem: the original seed always gets
multiplied by the constant (which is silly), and the last
item in the tuple does not get multiplied (which causes
the bug).

The correct solution is to change:

value = my_mul(const, value) ^ hash(item)

in Steve's pseudo-code to:

value = my_mul(const, value ^ hash(item))

Of course, you still get a lot more robustness
for almost no cost if you vary "const" across
the tuple via a table.

----------------------------------------------------------------------

Comment By: Jim Jewett (jimjjewett)
Date: 2004-05-03 16:40

Message:
Logged In: YES
user_id=764593

Note that in this case, the problem was because of nested
tuples.  X was the first element of both tuples, and both
tuples had the same length -- so the X's would still have been
multiplied by the same constant.  (Not the same constant as
Y, and hash(X, Y), but the same constant as each other.)  A
non-linear function might work, but would do bad things to
other data.

----------------------------------------------------------------------

Comment By: Yitz Gale (ygale)
Date: 2004-05-02 05:50

Message:
Logged In: YES
user_id=1033539

I suggest leaving the function as is, but replacing the
constant with a value that varies as we step through the
tuple. Take the values from a fixed table. When we reach the
end of the table, start again from the beginning and mung
the values slightly (e.g., increment them).

True, there will always be the possibilities of collisions,
but this will make it very unlikely except in very weird
cases. Using a table of constants is a standard technique
for hashes.

----------------------------------------------------------------------

Comment By: Jim Jewett (jimjjewett)
Date: 2004-04-28 20:45

Message:
Logged In: YES
user_id=764593

Wouldn't that just shift the pathological case from (x, (x, y))
to (x, (-x, y))?  My point was that any hash function will act
badly on *some* pattern of input, and if a pattern must be
penalized, this might be a good pattern to choose.

----------------------------------------------------------------------

Comment By: Raymond Hettinger (rhettinger)
Date: 2004-04-28 19:36

Message:
Logged In: YES
user_id=80475

The OP was not referring to "some collisions"; his app
collapsed all entries to a single hash value.  Changing XOR
to + would partially eliminate the self cancelling property
of this hash function.

Also, I am concerned about the tuple hash using the same
multiplier as the hash for other objects. In sets.py, a
naive combination of the component hash values caused many
distinct sets to collapse to a handful of possibilities --
while tuples do not have an identical issue, it does
highlight the risks involved.

----------------------------------------------------------------------

Comment By: Jim Jewett (jimjjewett)
Date: 2004-04-28 17:50

Message:
Logged In: YES
user_id=764593

Any hash will have some collisions.  If there is going to be
predictably bad data, this is probably a good place to have it.

The obvious alternatives are a more complicated hash (slows
everything down), a different hash for embedded tuples (bad,
since hash can't be cached then) or ignoring some elements
when determining the hash (bad in the normal case of different
data).

I would also expect your workaround of data rearrangement to
be sensible almost any time (X, (X, Y)) is really a common
case.  (The intuitive meaning for me is "X - then map X to Y",
which could be done as (X, Y) or at least (X, (None, Y)), or
perhaps d[X]=(X,Y).)

----------------------------------------------------------------------

Comment By: Steve Tregidgo (smst)
Date: 2004-04-27 14:45

Message:
Logged In: YES
user_id=42335

I'll repeat the tuple hashing algorithm in a fixed-width
font (with the first line following "for..." being the loop
body):

# assume 'my_mul' would multiply two numbers and
return the low 32 bits
value = seed
for item in tpl:
value = my_mul(const, value) ^ hash(item)
value = value ^ len(tpl)

----------------------------------------------------------------------

You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=942952&group_id=5470

```