[Python-Dev] Encodings
M.-A. Lemburg
mal@lemburg.com
Mon, 10 Jul 2000 12:19:21 +0200
Guido van Rossum wrote:
>
> > Instead of tossing things we should be *constructive* and come
> > up with a solution to the hash value problem, e.g. I would
> > like to make the hash value be calculated from the UTF-16
> > value in a way that is compatible with ASCII strings.
>
> I think you are proposing to drop the following rule:
>
> if a == b then hash(a) == hash(b)
>
> or also
>
> if hash(a) != hasb(b) then a != b
>
> This is very fundamental for dictionaries!
The rule is fine for situations where a and b have the same
type, but you can't expect coercion to be consistent with
it.
> Note that it is currently
> broken:
>
> >>> d = {'\200':1}
> >>> d['\200']
> 1
> >>> u'\200' == '\200'
> 1
> >>> d[u'\200']
> Traceback (most recent call last):
> File "<stdin>", line 1, in ?
> KeyError: ?
> >>>
That's because hash(unicode) currently get's calculated using
the UTF-8 encoding as basis, while the compare uses the
default encoding -- this needs to be changed, of course.
> While you could fix this with a variable encoding, it would be very
> hard, probably involving the string to Unicode before taking its hash,
> and this would slow down the hash calculation for 8-bit strings
> considerably (and these are fundamental for the speed of the
> language!).
>
> So I am for restoring ASCII as the one and only fixed encoding. (Then
> you can fix your hash much easier!)
>
> Side note: the KeyError handling is broken. The bad key should be run
> through repr() (probably when the error is raised than when it is
> displayed).
Agreed.
--
Marc-Andre Lemburg
______________________________________________________________________
Business: http://www.lemburg.com/
Python Pages: http://www.lemburg.com/python/