hash(unicode(string)) == hash(string) sometimes (was Re: Why KeyError ???)

Wed Mar 6 18:10:29 EST 2002

Paul Rubin <phr-n2002a at nightsong.com> wrote in message news:<7xvgcabcti.fsf at ruckus.brouhaha.com>...
> "Raymond Hettinger" <python at rcn.com> writes:
> > > KeyError: šđčćž
> > > >>> a == b
> > > 1
> > 
> > Hmm.  I don't get the same identity check results as you do:
> > >>> a = '\xe7\xd0\x9f\x86\xa7'
> > >>> b = unicode(a,'cp1250')
> > >>> a is b
> > 0
> 
> Try == rather than 'is'.  The docs are a little bit imprecise about
> what's supposed to happen here, but 2.2.7 "Mapping types" says about
> numeric keys:
> 
>     A dictionary's keys are almost arbitrary values. The only types of
>     values not acceptable as keys are values containing lists or
>     dictionaries or other mutable types that are compared by value rather
>     than by object identity. Numeric types used for keys obey the normal
>     rules for numeric comparison: if two numbers compare equal (e.g. 1 and
>     1.0) then they can be used interchangeably to index the same
>     dictionary entry.
> 
> That makes it surprising if two unicode strings that compare as ==
> don't index the same dictionary item.

We don't have two unicode strings here. a is NOT a unicode string. It
is a string string. The comparison coerces the string string to
unicode string (using the default encoding), so they compare equal.
However dictionary keying uses a hash function as well as comparison.
As currently implemented, hash(a) != hash(b) sometimes. The hashes are
equal when the Unicode characters are merely zero-extended 8-bit
characters, as will happen when the default encoding is ascii or
Latin-1. However the hashes will not be the same when the encoding is
more complicated, as in the OP's cp1250 example.

If the "fix" for this would involve making hash(string) always do
hash(unicode(string)) then I sure hope that (borrowing timbot
phraseology) somebody optimises the snot out of it.

BTW, I'm not so sure of the utility of hash(1) == hash(1.0) --- why on
earth would anyone want to use floats as keys in a dictionary, anyway?
Eveything one reads on floating-point fulminates against equality
testing. Seems like extra code and extra run-time for little benefit.