[Python-Dev] unicode hell/mixing str and unicode as dictionary keys

Thu Aug 3 19:03:08 CEST 2006

On Aug 3, 2006, at 9:51 AM, M.-A. Lemburg wrote:

> Ralf Schmitt wrote:
>> Ralf Schmitt wrote:
>>> Still trying to port our software. here's another thing I noticed:
>>>
>>> d = {}
>>> d[u'm\xe1s'] = 1
>>> d['m\xe1s'] = 1
>>> print d
>>>
>>> With python 2.4 I can add those two keys to the dictionary and get:
>>> $ python2.4 t2.py
>>> {u'm\xe1s': 1, 'm\xe1s': 1}
>>>
>>> With python 2.5 I get:
>>>
>>> $ python2.5 t2.py
>>> Traceback (most recent call last):
>>>    File "t2.py", line 3, in <module>
>>>      d['m\xe1s'] = 1
>>> UnicodeDecodeError: 'ascii' codec can't decode byte 0xe1 in  
>>> position 1:
>>> ordinal not in range(128)
>>>
>>> Is this intended behaviour? I guess this might break lots of  
>>> programs
>>> and the way python 2.4 works looks right to me.
>>> I think it should be possible to mix str/unicode keys in dicts  
>>> and let
>>> non-ascii strings compare not-equal to any unicode string.
>>
>> Also this behaviour makes your programs break randomly, that is,  
>> it will
>> break when the string you add hashes to the same value that the  
>> unicode
>> string has (at least that's what I guess..)
>
> This is because Unicode and 8-bit string keys only work
> in the same way if and only if they are plain ASCII.
>
> The reason lies in the hash function used by Unicode: it is
> crafted to make hash(u) == hash(s) for all ASCII s, such
> that s == u.
>
> For non-ASCII strings, there are no guarantees as to the
> hash value of the strings or whether they match or not.
>
> This has been like that since Unicode was introduced, so it's
> not new in Python 2.5.

What is new is that the exception raised on "u == s" after hash  
collision is no longer silently swallowed.

-bob