[Python-Dev] Encodings

M.-A. Lemburg mal@lemburg.com
Mon, 10 Jul 2000 12:19:21 +0200


Guido van Rossum wrote:
> 
> > Instead of tossing things we should be *constructive* and come
> > up with a solution to the hash value problem, e.g. I would
> > like to make the hash value be calculated from the UTF-16
> > value in a way that is compatible with ASCII strings.
> 
> I think you are proposing to drop the following rule:
> 
>   if a == b then hash(a) == hash(b)
> 
> or also
> 
>   if hash(a) != hasb(b) then a != b
> 
> This is very fundamental for dictionaries! 

The rule is fine for situations where a and b have the same
type, but you can't expect coercion to be consistent with
it.

> Note that it is currently
> broken:
> 
>   >>> d = {'\200':1}
>   >>> d['\200']
>   1
>   >>> u'\200' == '\200'
>   1
>   >>> d[u'\200']
>   Traceback (most recent call last):
>     File "<stdin>", line 1, in ?
>   KeyError: ?
>   >>>

That's because hash(unicode) currently get's calculated using
the UTF-8 encoding as basis, while the compare uses the
default encoding -- this needs to be changed, of course.

> While you could fix this with a variable encoding, it would be very
> hard, probably involving the string to Unicode before taking its hash,
> and this would slow down the hash calculation for 8-bit strings
> considerably (and these are fundamental for the speed of the
> language!).
> 
> So I am for restoring ASCII as the one and only fixed encoding.  (Then
> you can fix your hash much easier!)
>
> Side note: the KeyError handling is broken.  The bad key should be run
> through repr() (probably when the error is raised than when it is
> displayed).

Agreed.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/