[Python-Dev] Hash values and comparing objects

M.-A. Lemburg mal@lemburg.com
Thu, 06 Jul 2000 23:53:59 +0200

Ka-Ping Yee wrote:
> On Thu, 6 Jul 2000, M.-A. Lemburg wrote:
> > Previously, Unicode used UTF-8 as basis for calculating the
> > hash value
> Right, and i was trying to suggest (in a previous message)
> that the hash value should be calculated from the actual
> Unicode character values themselves.  Then for any case where
> it's possible for an 8-bit string to be == to a Unicode
> string, they will have the same hash.  Doesn't this solve the
> problem?  Have i misunderstood?

Not really, since the default encoding doesn't necessarily
need to have anything to do with a Unicode subset, e.g.
take one of the many Windows code pages.

> > How serious is the need for objects which compare equal to
> > have the same hash value ?
> For basic, immutable types like strings -- quite serious indeed,
> i would imagine.

What I meant was: would it do much harm if 
hash(unicode)==hash(string) would only be guaranteed for ASCII
only values -- even though unicode may compare equal to string.
> > 2. In some locales '' == u'' is true, while in others this is
> >    not the case. If they do compare equal, the hash values
> >    must match.
> This sounds very bad.  I thought we agreed that attempting to
> compare (or add) a Unicode string and an 8-bit string containing
> non-ASCII characters (as in your example) should raise an exception.

Only if the default encoding is ASCII. If Python runs in a
different locale environment that encoding can change, e.g.
to Latin-1 or one of the available code pages (this is to
enhance Python's compatibility with the underlying environment).
> Such an attempt constitutes an ambiguous request -- you haven't
> specified how to turn the 8-bit bytes into Unicode, and it's better
> to be explicit than to have the interpreter guess (and guess
> differently depending on the environment!!)

The interpreter doesn't guess, it uses the locale setting
as defined by the user.

If the programmer wants to be sure about what encoding is
actually used, he will have to be explicit about it. That's
what I would recommend for most applications, BTW. The
auto-conversion magic is mainly meant for simplifying
intergration of Unicode into existing systems.

Marc-Andre Lemburg
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/