finding items that occur more than once in a list

Arnaud Delobelle arnodel at googlemail.com
Sat Mar 22 11:58:15 CET 2008


On Mar 22, 8:31 am, Bryan Olson <fakeaddr... at nowhere.org> wrote:
> castiro... at gmail.com wrote:
> > John Machin wrote:
> >> On Mar 22, 1:11 am, castiro... at gmail.com wrote:
> >>> A collision sequence is not so rare.
> >>>>>> [ hash( 2**i ) for i in range( 0, 256, 32 ) ]
> >>> [1, 1, 1, 1, 1, 1, 1, 1]
> >> Bryan did qualify his remarks: "If we exclude the case where an
> >> adversary is choosing the keys, ..."
>
> > Some adversary.  What, you mean, my boss or my customers?
>
> We mean that the party supplying the keys deliberately chose
> them to make the hash table inefficient. In this thread the goal
> is efficiency; a party working toward an opposing goal is an
> adversary.

There are situations where this can happen I guess

> If you find real-world data sets that tend to induce bad-case
> behavior in Python's hash, please do tell. It would be reason
> enough to adjust the hash function. The hashes in popular
> software such as Python are already quite well vetted.

As Python allows you to make your own hash functions for your own
classes, another danger is that you are dealing with objects with a
bad hash function.  I've actually tried it (as I am *not* a pro!) and
FWIW here are the results.

-------- badhash.py ----------

class BadHash(object):
    def __hash__(self):
        return 1 # that's a bad hash function!

from time import time

def test(n):
    t0 = time()
    # Create a hash table of size n
    s = set(BadHash() for x in xrange(n))
    t1 = time()
    # Add an object to it
    obj = BadHash()
    s.add(obj)
    t2 = time()
    # Find that object
    obj in s
    t3 = time()
    print "%s\t%.3e\t%.3e\t%.3e" % (n, (t1-t0)/n**2, (t2-t1)/n, (t3-
t2)/n)

print "n\tcreate/n^2\tadd/n\t\tfind/n"
for k in range(8, 17):
    # I'm hoping that adding an element to a dict of size (1 << k) + 1
    # will not trigger a resize of the hasmap.
    test((1 << k) + 1)

--------------------------------

marigold:junk arno$ python badhash.py
n	create/n^2	add/n		find/n
257	3.917e-08	6.958e-08	7.051e-08
513	3.641e-08	8.598e-08	7.018e-08
1025	3.522e-08	7.118e-08	6.722e-08
2049	3.486e-08	6.935e-08	6.982e-08
4097	3.480e-08	7.053e-08	6.931e-08
8193	3.477e-08	6.897e-08	6.981e-08
16385	3.441e-08	6.963e-08	7.075e-08
32769	3.720e-08	7.672e-08	7.885e-08
65537	3.680e-08	7.159e-08	7.510e-08

So theory and practice agree!  In this case The hash table behaves
just like a linked list.

Lastly, if one deals with a totally ordered set of object but they are
not hashable (there isn't a good hash function), then Ninereeds' idea
of sorting first is still useful.

--
Arnaud




More information about the Python-list mailing list