[Python-Dev] unicode hell/mixing str and unicode as dictionary keys

Ralf Schmitt ralf at brainbot.com
Thu Aug 3 19:17:27 CEST 2006


M.-A. Lemburg wrote:
> Ralf Schmitt wrote:
>> Ralf Schmitt wrote:
>>> Still trying to port our software. here's another thing I noticed:
>>>
>>> d = {}
>>> d[u'm\xe1s'] = 1
>>> d['m\xe1s'] = 1
>>> print d
>>>
>>> With python 2.4 I can add those two keys to the dictionary and get:
>>> $ python2.4 t2.py
>>> {u'm\xe1s': 1, 'm\xe1s': 1}
>>>
>>> With python 2.5 I get:
>>>
>>> $ python2.5 t2.py
>>> Traceback (most recent call last):
>>>    File "t2.py", line 3, in <module>
>>>      d['m\xe1s'] = 1
>>> UnicodeDecodeError: 'ascii' codec can't decode byte 0xe1 in position 1: 
>>> ordinal not in range(128)
>>>
>>> Is this intended behaviour? I guess this might break lots of programs 
>>> and the way python 2.4 works looks right to me.
>>> I think it should be possible to mix str/unicode keys in dicts and let 
>>> non-ascii strings compare not-equal to any unicode string.
>> Also this behaviour makes your programs break randomly, that is, it will 
>> break when the string you add hashes to the same value that the unicode 
>> string has (at least that's what I guess..)
> 
> This is because Unicode and 8-bit string keys only work
> in the same way if and only if they are plain ASCII.

This is okay. But in the case where one is not ASCII I would prefer to 
be able to compare them (not equal) instead of getting a UnicodeError.
I know it's too late to change this, ...

> 
> The reason lies in the hash function used by Unicode: it is
> crafted to make hash(u) == hash(s) for all ASCII s, such
> that s == u.
> 
> For non-ASCII strings, there are no guarantees as to the
> hash value of the strings or whether they match or not.
> 
> This has been like that since Unicode was introduced, so it's
> not new in Python 2.5.
> 

...but in the case of dictionaries this behaviour has changed and in 
prior versions of python dictionaries did work as I expected them to.
Now they don't.

When working with unicode strings and (accidently) mixing with str 
strings, things might seem to work until the first non-ascii string
is given to some code and one gets that UnicodeDecodeError (e.g. when 
comparing them).

If one mixes unicode strings and str strings as keys in a dictionary 
things might seem to work far longer until he tries to put in some non 
ASCII string with the "wrong" hash value and suddenly things go boom.
I'd rather keep the pre 2.5 behaviour.

- Ralf



More information about the Python-Dev mailing list