[ python-Bugs-942952 ] Weakness in tuple hash

SourceForge.net noreply at sourceforge.net
Sat May 1 22:50:20 EDT 2004


Bugs item #942952, was opened at 2004-04-27 14:41
Message generated for change (Comment added) made by ygale
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=942952&group_id=5470

Category: Python Interpreter Core
Group: None
Status: Open
Resolution: None
Priority: 6
Submitted By: Steve Tregidgo (smst)
Assigned to: Tim Peters (tim_one)
Summary: Weakness in tuple hash

Initial Comment:
I've encountered some performance problems when
constructing dictionaries with keys of a particular
form, and have pinned the problem down to the hashing
function.  I've reproduced the effect in Python 1.5.2,
Python 2.2 and Python 2.3.3.  I came across this when
loading a marshalled dictionary with about 40000
entries.  Loading other dictionaries of this size has
always been fast, but in this case I killed the
interpreter after a few minutes of CPU thrashing.  The
performance problem was caused because every key in the
dictionary had the same hash value.

The problem is as follows: for hashable values X and Y,
hash( (X, (X, Y)) ) == hash(Y).  This is always true
(except in the corner case where hash((X, Y)) is
internally calculated to be -1 (the error value) and so
-2 is forced as the return value).  With data in this
form where X varies more than Y (Y is constant, or
chosen from relatively few values compared to X)
chances of collision become high as X is effectively
ignored.

The hash algorithm for tuples starts with a seed value,
then generates a new value for each item in the tuple
by multiplying the iteration's starting value by a
constant (keeping the lowest 32 bits) and XORing with
the hash of the item.  The final value is then XORed
with the tuple's length.  In Python (ignoring the
careful business with -1):

    # assume 'my_mul' would multiply two numbers and
return the low 32 bits
    value = seed
    for item in tpl:
        value = my_mul(const, value) ^ hash(item)
    value = value ^ len(tpl)

The tuple (X, Y) therefore has hash value:

    my_mul(const, my_mul(const, seed) ^ hash(X)) ^
hash(Y) ^ 2
    
...and the tuple (X, (X, Y)) has hash value:

    my_mul(const, my_mul(const, seed) ^ hash(X)) ^
hash((X, Y)) ^ 2

The outer multiplication is repeated, and is XORed with
itself (cancelling both of them). The XORed 2s cancel
also, leaving just hash(Y).  Note that this
cancellation property also means that the same hash
value is shared by (X, (X, (X, (X, Y)))), and (X, (X,
(X, (X, (X, (X, Y)))))), and so on, and (X, Z, (X, Z,
Y)) and so on.

I realise that choosing a hash function is a difficult
task, so it may be that the behaviour seen here is a
tradeoff against other desireable properties -- I don't
have the expertise to tell.  My naive suggestion would
be that an extra multiplication is necessary, which
presumably has a performance impact (multiplication
being more expensive than XOR) but would reduce the
possibility of cancellation. On the other hand, perhaps
this particular case is rare enough that it's not worth
the effort.

For my own application I'm fortunate in that I can
probably rearrange the data structure to avoid this case.



----------------------------------------------------------------------

Comment By: Yitz Gale (ygale)
Date: 2004-05-02 05:50

Message:
Logged In: YES 
user_id=1033539

I suggest leaving the function as is, but replacing the
constant with a value that varies as we step through the
tuple. Take the values from a fixed table. When we reach the
end of the table, start again from the beginning and mung
the values slightly (e.g., increment them).

True, there will always be the possibilities of collisions,
but this will make it very unlikely except in very weird
cases. Using a table of constants is a standard technique
for hashes.

----------------------------------------------------------------------

Comment By: Jim Jewett (jimjjewett)
Date: 2004-04-28 20:45

Message:
Logged In: YES 
user_id=764593

Wouldn't that just shift the pathological case from (x, (x, y)) 
to (x, (-x, y))?  My point was that any hash function will act 
badly on *some* pattern of input, and if a pattern must be 
penalized, this might be a good pattern to choose.


----------------------------------------------------------------------

Comment By: Raymond Hettinger (rhettinger)
Date: 2004-04-28 19:36

Message:
Logged In: YES 
user_id=80475

The OP was not referring to "some collisions"; his app
collapsed all entries to a single hash value.  Changing XOR
to + would partially eliminate the self cancelling property
of this hash function. 

Also, I am concerned about the tuple hash using the same
multiplier as the hash for other objects. In sets.py, a
naive combination of the component hash values caused many
distinct sets to collapse to a handful of possibilities --
while tuples do not have an identical issue, it does
highlight the risks involved.



----------------------------------------------------------------------

Comment By: Jim Jewett (jimjjewett)
Date: 2004-04-28 17:50

Message:
Logged In: YES 
user_id=764593

Any hash will have some collisions.  If there is going to be 
predictably bad data, this is probably a good place to have it.

The obvious alternatives are a more complicated hash (slows 
everything down), a different hash for embedded tuples (bad, 
since hash can't be cached then) or ignoring some elements 
when determining the hash (bad in the normal case of different 
data).

I would also expect your workaround of data rearrangement to 
be sensible almost any time (X, (X, Y)) is really a common 
case.  (The intuitive meaning for me is "X - then map X to Y", 
which could be done as (X, Y) or at least (X, (None, Y)), or 
perhaps d[X]=(X,Y).)



----------------------------------------------------------------------

Comment By: Steve Tregidgo (smst)
Date: 2004-04-27 14:45

Message:
Logged In: YES 
user_id=42335

I'll repeat the tuple hashing algorithm in a fixed-width
font (with the first line following "for..." being the loop
body):

 # assume 'my_mul' would multiply two numbers and
return the low 32 bits
value = seed
for item in tpl:
    value = my_mul(const, value) ^ hash(item)
value = value ^ len(tpl)


----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=942952&group_id=5470



More information about the Python-bugs-list mailing list