Sets in Python

Wed Sep 19 22:17:09 EDT 2007

On Wed, 19 Sep 2007 20:58:03 +0000, Karthik Gurusamy wrote:

> While it's easy to explain the behavior, I think the decision to dis-
> allow mutable items as keys is a bit arbitrary. There is no need for
> dict to recompute hash 

What???

Of course it does. How else can it look up the key? Because it (somehow) 
just recognizes that it has seen the key before? How? By magic?

> (first of all, a user doesn't even need to know
> if underneath 'hashing' is used -- the service is just a mapping between
> one item to another item).

The user doesn't need to know the mechanism, but the dict does. Dicts are 
implemented as hash tables. I suppose they could be implemented as 
something else (what? linear lists? some sort of tree?) but the same 
considerations must be made: the dict must be able to find keys it has 
seen before. How is the dict supposed to recognise the key if the key has 
changed?

> Since we know hashing is used, all that is needed is, a well-defined way
> to construct a hash out of a mutable. "Given a sequence, how to get a
> hash" is the problem. 

Nonsense. That's not the problem. The problem is how to get exactly the 
same hash when the sequence has changed.

In other words, if you have two mutable objects M1 and M2, then you 
expect:

hash(M1) == hash(M2) if and only if M1 and M2 are equal
hash(M1) != hash(M2) if M1 and M2 are unequal

but also:

if M1 mutates to become equal to M2, hash(M1) must remain the same while 
still being different from hash(M2).

That means that hash() now is a non-deterministic function. hash([1,2,3]) 
will vary according to how the list [1,2,3] was constructed!

Obviously something has to give, because not all of these things are 
mutually compatible.

> If later the given sequence is different, that's
> not the dict's problem.

Data structures don't have problems. Programmers do. And language 
designers with sense build languages that minimize the programmers 
problems, not maximize them.

> So if the list changes, it will result in a different hash and we will
> get a hash-miss. I doubt this is in anyway less intuitive than dis-
> allowing mutable items as keys.

The choices for the language designer are:

(1) Invent some sort of magical non-deterministic hash function which 
always does the Right Thing.

(2) Allow the hash of mutable objects to change, which means you can use 
mutable objects as keys in dicts but if you change them, you can no 
longer find them in the dict. They'll still be there, using up memory, 
but you can't get to them.

(3) Simply disallow mutable objects as keys.

Alternative 1 is impossible, as we've seen, because the requirements for 
the Right Thing are not mutually compatible.

Alternative (2) leads to hard-to-find, hard-to-diagnose bugs where you 
store objects in a dict only for them to mysteriously disappear later. 
Worse, it could lead to bugs like the following hypothetical:

>>> M = [1, 2, 3]
>>> D = {M: 'parrot'} # pretend this works
>>> D
{[1, 2, 3]: 'parrot'}
>>> M.append(4)
>>> D
{[1, 2, 3, 4]: 'parrot'}
>>> D[[1, 2, 3, 4]]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
KeyError: [1, 2, 3, 4]

Try explaining that one to programmers: they can SEE the key in the dict 
when they print it, but they can't get it or delete it because the hash 
has changed.

Alternative 3 is easy to deal with: simply don't use mutable objects as 
keys. That's what Python does. Sure, the programmer sometimes needs to 
work around the lack (convert the list into a tuple, or a string, or 
pickle it, whatever...) which on rare occasions is hard to do, but at 
least there are no mysterious, hard to track down bugs.

-- 
Steven.