Mutable objects which define __hash__ (was Re: Why are tuples immutable?)

Bengt Richter bokr at oz.net
Thu Dec 30 01:51:38 CET 2004


On Wed, 29 Dec 2004 22:47:55 +1000, Nick Coghlan <ncoghlan at iinet.net.au> wrote:

><Sections cut where I don't feel I have anything to add to what Bengt already said>
>
>Bengt Richter wrote:
>> A second time a key may be hashed is when it is used as a lookup key. This can be a reference to
>> the identical key object first used, or it can be a new object. A new object has to be hashed in order
>> to have a hash value to use in finding candidate keys to compare to. If _this_ object
>> is used as a key again, it must be hashed again -- unless it is an object that caches its own
>> hash and invalidates it when mutated so it can rehash itself if its __hash__ method is subsequently
>> called.
>> 
>> Thus re-use of a different lookup key than the original provides an opportunity for a performance
>> gain in the lookup if the key object caches it hash value for return by __hash__.
>> That's what I meant by "why not?" in response to your assertion that the constant hash restriction
>> doesn't (ok, I took that as a "can't" ;-) provide any performance gain.
>
>I'm not sure I'm following here. . . for the case of an object caching its own 
>hash, that seems to be a performance gain that is independent of the dictionary 
>implementation. From the dictionary's point of view, it just calls hash() and 
>works with the result, exactly as it does now.
I'm sorry. I left out the idea that the dict itself could do the equivalent by keeping
track of mutable keys (actually any keys) being re-used, and could internally cache a
hash value that it could look up as a hint by id, to use to find a bucket for value comparisons etc.
Whether that would be optimum or pessimum performance would depend on the keys and usage patterns of course.

>
>I agree hash caching gives good optimisation - my point was that hash caching 
>can be just as effective in the presence of mutable keys, so long as there is 
>some way to invalidate the cached hash values when a mutation occurs.
>
>> Given that you assume you know when "rehashing" is going to be necessary -- meaning you
>> know when keys mutate.
>
>Well, yes. Antoon's original complaint was that mutable objects could not be 
>used as keys in dictionary, even if the application guaranteed key stability for 
>the lifetime of the relevant dictionary entry.
>
>I gave dict.rehash as a method equivalent to list.sort when working with 
>mutables keys or list entries, respectively.
>
>>>Or resist the temptation to guess, and have rehash() raise an exception :)
>> 
>> Easy out ;-)
>
>I certainly thought so }:>
>
>>>The identity keyed dict was based on Antoon's posted use case: he wanted to 
>>>classify mutable objects, and then check for membership in those groups.
>> 
>> I thought he just wanted to use mutables as if they were immutable, so that
>> equal mutable keys would look up the same value in a dict.
>
>That was in the original post, yes. Later, he gave an example that ran something 
>like:
>
>   for node in graph:
>     if node in some_dict:
>       # Do NOT mutate this node
>       # Possibly do other things
>       continue
>     # Do something that mutates the (non-key) node
>
>>>You can use lists for this, but the lookup is slow. Conventional dictionaries 
>>>and sets give a fast lookup, but require that the items in use be hashable.
>>>
>> 
>> A constant hash works. But since it clogs a single bucket and gives no clue
>> of value mismatch, we'd have to time it to see how it compares with list.index.
>> Did the Timbot do both?
>
>Write them both? I don't know.
>
>However, list.index is a simple linear search of the array of objects (I was 
>reading that code recently for unrelated reasons).
>
>The dict hash collision resolution was almost certainly written by Tim - he 
>definitely wrote the comments at the top of the file trying to explain how the 
>heck it works. I have no real clue as to the performance of the lookup algorithm 
>in the degenerate case of a large number of distinct keys having the same hash 
>value, but reading the code suggests it would perform worse than the simple 
>integer increment that list.index uses (as the dict lookup requires multiple 
>if-tests and so forth on each iteration).
So my initial mutable-key dict subclass hack might actually be better than
the wrapper version hashing to zero. Hm.

>
>>>An identity dictionary (or set) solves the 'object classification' problem 
>>>perfectly.
>> 
>> I missed the definition of those terms I'm afraid ;-)
>
>It was in some other part of this rambling thread. It's easier to repeat the 
>idea than it is to go find where that was :)
Thanks either way, but this is more convenient for me ;-)

>
>An 'identity dictionary' is just a dict that looks up keys based on identity 
>rather than value. It ignores the keys' actual hash and comparison operations, 
>and forces use of the default hash for hashing, and identity comparison for 
>equality testing. It maps from key _objects_ to values, rather than from key 
>values to other values.
>
>An 'identity set' is simply the same idea applied to a set instead of a 
>dictionary. Conceptually, it stores sets of arbitrary objects, rather than sets 
>of values.
Essentially syntactic sugar to avoid writing id(obj) ? (and to get a little performance
improvement if they're written in C). I can't believe this thread came from the
lack of such sugar ;-)

>
>The 'object classification' problem is simply the situation where you have an 
>existing collection of objects (e.g. graphs of a node), and you want to classify 
>them according to some criteria. Dictionaries or sets are a nice way to do that, 
>but the standard types are a pain if your original objects are mutable (this 
>inconvenience being the basis of Antoon's original complaint).
The pain being writing id(obj) ?? ;-)

Or, for that matter, (if you are the designer) giving the objects an
obj.my_classification attribute (or indeed, property, if dynamic) as part
of their initialization/design?

>
>For a classification problem, though, we're likely only interested in what 
>categories each of the original objects falls into - so identity-based storage 
>will usually be entirely adequate (and, in some cases, actually better then 
>value-based storage, since we can avoid irrelevant comparisons of potentially 
>complex objects). And the performance should romp it in when compared to using 
>lists for the categorisation.
So it boils down to statically classified objects as in

   obj = MyOrSomeonesObjClass(some_args)
   ...
   my_obj_classifications[id(obj)] = my_initial_classifier_function(obj)

after all this? ;-)

For that matter, if you have a my_initial_classifier_function, why wouldn't you just
memoize that function (with suitable mutation detection and cache update)? You could
control the caching to freeze the set membership at some point if desirable, for the
the functionality of KeyError or a dict.get default.

Or subclass your graph node so you can do something readable like
    if node.is_leaf: ...
instead of
    if my_obj_classification[id(node)] == 'leaf': ...

>
>Hence why I suggested Antoon should consider pursuing collections.identity_dict 
>and collections.identity_set if identity-based lookup would actually address his 
>requirements. Providing these two data types seemed like a nice way to do an end 
>run around the bulk of the 'potentially variable hash' key problem.
I googled for those ;-) I guess pursuing meant implementing ;-)

Regards,
Bengt Richter



More information about the Python-list mailing list