insert unique data in a list

Steven D'Aprano steve at REMOVE-THIS-cybersource.com.au
Mon Dec 14 16:53:38 EST 2009


On Mon, 14 Dec 2009 17:13:24 +0000, mattia wrote:

> Il Sun, 13 Dec 2009 21:17:28 -0800, knifenomad ha scritto:
> 
>> On 12월14일, 오후12시42분, Steven D'Aprano
>> <ste... at REMOVE.THIS.cybersource.com.au> wrote:
>>> On Sun, 13 Dec 2009 17:19:17 -0800, knifenomad wrote:
>>> > this makes the set type hashable.
>>>
>>> > class Set(set):
>>> >     __hash__ = lambda self: id(self)
>>>
>>> That's a *seriously* broken hash function.
>>>
>>> >>> key = "voila"
>>> >>> d = { Set(key): 1 }
>>> >>> d
>>>
>>> {Set(['i', 'a', 'l', 'o', 'v']): 1}>>> d[ Set(key) ]
>>>
>>> Traceback (most recent call last):
>>>   File "<stdin>", line 1, in <module>
>>> KeyError: Set(['i', 'a', 'l', 'o', 'v'])
>>>
>>> --
>>> Steven
>> 
>> of course it is broken as long as it uses it's instance id. i added
>> this to notify that unhashable can become hashable implementing
>> __hash__ inside the class. which probably set to None by default.
> 
> Ok, nice example, but I believe that using id() as the hash function can
> lead to unexpected collisions.

No, you have that backwards. Using id() as the hash function means that 
equal keys will hash unequal -- rather than unexpected collisions, it 
leads to unexpected failure-to-collide-when-it-should-collide.

And it isn't a "nice example", it is a terrible example. 

Firstly, the example fails to behave correctly. It simply doesn't work as 
advertised.

Secondly, and much worse, it encourages people to do something dangerous 
without thinking about the consequences. If it is so easy to hash mutable 
objects, why don't built-in lists and sets don't have a __hash__ method? 
Do you think that the Python developers merely forgot?

No, there is good reason why mutable items shouldn't be used as keys in 
dicts or in sets, and this example simply papers over the reasons why and 
gives the impression that using mutable objects as keys is easy and safe 
when it is neither.

Using mutable objects as keys or set elements leads to surprising, 
unexpected behaviour and hard-to-find bugs. Consider the following set 
with lists as elements:

L = [1, 2]
s = Set()  # set that allows mutable elements
s.add(L)
s.add([1, 2, 3])

So far so good. But what should happen now?

L.append(3)

The set now has two equal elements, which breaks the set invariant that 
it has no duplicates.

Putting the problem of duplicates aside, there is another problem:

L = [1, 2]
s = Set([L])
L.append(3)

There are two possibilities: either the hash function of L changes when 
the object changes, or it doesn't. Suppose it changes. Then the hash of L 
after the append will be different from the hash of L before the append, 
and so membership testing (L in s) will fail.

Okay, suppose we somehow arrange matters so that the hash of the object 
doesn't change as the object mutates. This will solve the problem above: 
we can mutate L as often as we like, and L in s will continue to work 
correctly. But now the hash of an object doesn't just depend on it's 
value, but on its history. That means that two objects which are 
identical can hash differently, and we've already seen this is a problem.

There is one final approach which could work: we give the object a 
constant hash function, so that all objects of that type hash 
identically. This means that the performance of searches and lookups in 
sets and dicts will fall to that of lists. There is no point in paying 
all the extra overhead of a dict to get behaviour as slow, or slower, 
than a list.

In other words, no matter what you do, using mutable objects as keys or 
set elements leads to serious problems that need to be dealt with. It 
simply isn't true that all you need to do to make mutable objects usable 
in dicts or sets is to add a hash function. That part is trivial.



-- 
Steven



More information about the Python-list mailing list