[Python-Dev] unicode hell/mixing str and unicode as dictionary keys
M.-A. Lemburg
mal at egenix.com
Thu Aug 3 19:39:15 CEST 2006
Ralf Schmitt wrote:
>>>> Still trying to port our software. here's another thing I noticed:
>>>>
>>>> d = {}
>>>> d[u'm\xe1s'] = 1
>>>> d['m\xe1s'] = 1
>>>> print d
>>>>
>>>> With python 2.5 I get:
>>>>
>>>> $ python2.5 t2.py
>>>> Traceback (most recent call last):
>>>> File "t2.py", line 3, in <module>
>>>> d['m\xe1s'] = 1
>>>> UnicodeDecodeError: 'ascii' codec can't decode byte 0xe1 in position 1:
>>>> ordinal not in range(128)
>>>>
>> This is because Unicode and 8-bit string keys only work
>> in the same way if and only if they are plain ASCII.
>
> This is okay. But in the case where one is not ASCII I would prefer to
> be able to compare them (not equal) instead of getting a UnicodeError.
> I know it's too late to change this, ...
It is too late to change this, since it was always like this ;-)
Seriously, Unicode is doing the right thing here: you should
really always get an exception if you compare apples and
oranges, rather than reverting to comparing the ids of apples
and oranges as fall-back solution.
I believe that Py3k will implement this.
>> The reason lies in the hash function used by Unicode: it is
>> crafted to make hash(u) == hash(s) for all ASCII s, such
>> that s == u.
>>
>> For non-ASCII strings, there are no guarantees as to the
>> hash value of the strings or whether they match or not.
>>
>> This has been like that since Unicode was introduced, so it's
>> not new in Python 2.5.
>>
>
> ...but in the case of dictionaries this behaviour has changed and in
> prior versions of python dictionaries did work as I expected them to.
> Now they don't.
Let's put it this way: Python 2.5 uncovered a bug in your
application that has always been there. It's better to
fix your application than arguing to cover up the bug again.
> When working with unicode strings and (accidently) mixing with str
> strings, things might seem to work until the first non-ascii string
> is given to some code and one gets that UnicodeDecodeError (e.g. when
> comparing them).
>
> If one mixes unicode strings and str strings as keys in a dictionary
> things might seem to work far longer until he tries to put in some non
> ASCII string with the "wrong" hash value and suddenly things go boom.
> I'd rather keep the pre 2.5 behaviour.
It's actually a good preparation for Py3k where 1 == u'abc' will
(likely) also raise an exception.
--
Marc-Andre Lemburg
eGenix.com
Professional Python Services directly from the Source (#1, Aug 03 2006)
>>> Python/Zope Consulting and Support ... http://www.egenix.com/
>>> mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
________________________________________________________________________
::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::
More information about the Python-Dev
mailing list