# finding items that occur more than once in a list

Sat Mar 22 11:58:15 CET 2008

```On Mar 22, 8:31 am, Bryan Olson <fakeaddr... at nowhere.org> wrote:
> castiro... at gmail.com wrote:
> > John Machin wrote:
> >> On Mar 22, 1:11 am, castiro... at gmail.com wrote:
> >>> A collision sequence is not so rare.
> >>>>>> [ hash( 2**i ) for i in range( 0, 256, 32 ) ]
> >>> [1, 1, 1, 1, 1, 1, 1, 1]
> >> Bryan did qualify his remarks: "If we exclude the case where an
> >> adversary is choosing the keys, ..."
>
> > Some adversary.  What, you mean, my boss or my customers?
>
> We mean that the party supplying the keys deliberately chose
> them to make the hash table inefficient. In this thread the goal
> is efficiency; a party working toward an opposing goal is an

There are situations where this can happen I guess

> If you find real-world data sets that tend to induce bad-case
> behavior in Python's hash, please do tell. It would be reason
> enough to adjust the hash function. The hashes in popular
> software such as Python are already quite well vetted.

As Python allows you to make your own hash functions for your own
classes, another danger is that you are dealing with objects with a
bad hash function.  I've actually tried it (as I am *not* a pro!) and
FWIW here are the results.

def __hash__(self):
return 1 # that's a bad hash function!

from time import time

def test(n):
t0 = time()
# Create a hash table of size n
s = set(BadHash() for x in xrange(n))
t1 = time()
# Add an object to it
t2 = time()
# Find that object
obj in s
t3 = time()
print "%s\t%.3e\t%.3e\t%.3e" % (n, (t1-t0)/n**2, (t2-t1)/n, (t3-
t2)/n)

for k in range(8, 17):
# I'm hoping that adding an element to a dict of size (1 << k) + 1
# will not trigger a resize of the hasmap.
test((1 << k) + 1)

--------------------------------

257	3.917e-08	6.958e-08	7.051e-08
513	3.641e-08	8.598e-08	7.018e-08
1025	3.522e-08	7.118e-08	6.722e-08
2049	3.486e-08	6.935e-08	6.982e-08
4097	3.480e-08	7.053e-08	6.931e-08
8193	3.477e-08	6.897e-08	6.981e-08
16385	3.441e-08	6.963e-08	7.075e-08
32769	3.720e-08	7.672e-08	7.885e-08
65537	3.680e-08	7.159e-08	7.510e-08

So theory and practice agree!  In this case The hash table behaves

Lastly, if one deals with a totally ordered set of object but they are
not hashable (there isn't a good hash function), then Ninereeds' idea
of sorting first is still useful.

--
Arnaud

```