Best practice for caching hash
Cameron Simpson
cs at cskk.id.au
Tue Mar 15 22:58:53 EDT 2022
On 16Mar2022 10:57, Chris Angelico <rosuav at gmail.com> wrote:
>> Is it sensible to compute the hash only from the immutable parts?
>> Bearing in mind that usually you need an equality function as well and
>> it may have the same stability issues.
>
>My understanding - and I'm sure Marco will correct me if I'm wrong
>here - is that this behaves like a tuple: if it contains nothing but
>hashable objects, it is itself hashable, but if it contains anything
>unhashable, the entire tuple isn't hashable.
A significant difference is that tuples have no keys, unlike a dict.
A hash does not have to hash all the internal state, ony the relevant
state, and not even all of that. The objective of the hash is twofold to
my mind:
- that "equal" objects (in the `==` sense) have the same hash, so that
they hash to the same backet in dicts and can therefore be found
- that hash values are widely distributed, to maximise the evenness of
the object distribution in buckets
For dicts to work, the former has to hold.
The latter has more flexibility. A dict has keys. If the dicts are quite
varied (sets of tags for example), it may be effective to just hash the
keys. But if the dict keys are similar (labelled CSV-rows-as-dicts, or
likewise with database rows) this will go badly because the hashes will
all (or maybe mostly) collide.
>As such, any valid hash value will be stable, and "asking for a hash
>will raise TypeError" is also stable.
I would seek to avoid a TypeError for a frozendict, but as you can see
above I have not thought of a way to do that which would also have
desireable hash characteristics in almost all circumstances. (I think we
can accept that almost anything will have pathological cases, but the
bad cases in my hash-the-keys notion are not, to my mind, rare.)
>> >The problem is the first time I get an error with details, for example:
>> >TypeError: unhashable type: 'list'
>> >The subsequent times I simply raise a generic error:
>> >TypeError
>>
>> You could, you know, cache the original exception. That does keep links
>> to the traceback and therefore all the associates stack frames, so that
>> isn't cheap (it is peerfectly fast - just the reference, t just prevents
>> those things from being reclaimed).
>
>I don't like that idea myself, for that exact reason - it'll keep
>arbitrary numbers of objects alive.
I don't like it either, for that exact reason. That reason is the same
reason which has Python 3 exception variables get unset as you leave an
`except` clause. I'm sure it irks everyone, but the memory penalty of
not doing so is high.
>But caching the stringified form
>would be more reasonable here, and have similar effect.
Mmm, yes.
Cheers,
Cameron Simpson <cs at cskk.id.au>
More information about the Python-list
mailing list