Unicode and dictionaries

Sat Jan 16 21:43:48 EST 2010

On Jan 16, 5:38 pm, Carl Banks <pavlovevide... at gmail.com> wrote:
> On Jan 16, 3:58 pm, Steven D'Aprano <st... at REMOVE-THIS-
> cybersource.com.au> wrote:
> > On Sat, 16 Jan 2010 15:35:05 -0800, gizli wrote:
> > > Hi all,
>
> > > I am using Python 2.6.2 (release26-maint, Apr 19 2009, 01:56:41). I ran
> > > into this issue yesterday and wanted  to check to see if this is a
> > > python bug. It seems that there is an inconsistency between lists and
> > > dictionaries in the way that unicode objects are handled. Take a look at
> > > the following example:
>
> > >>>> test_dict = {u'öğe':1}
> > >>>> u'öğe' in test_dict.keys()
> > > True
> > >>>> 'öğe' in test_dict.keys()
> > > True
>
> > I can't reproduce your result, at least not in 2.6.1:
>
> > >>> test_dict = {u'öğe':1}
> > >>> u'öğe' in test_dict.keys()
> > True
> > >>> 'öğe' in test_dict.keys()
>
> > __main__:1: UnicodeWarning: Unicode equal comparison failed to convert
> > both arguments to Unicode - interpreting them as being unequal
> > False
>
> The OP changed his default encoding.  I was able to confirm the
> behavior after setting the default encoding to latin-1.
>
> This is most definitely a bug in Python.

I've thought it over and I'm not so sure it's a bug now, but it is
highly questionable.  Here is more detailed explanation.  The
following script shows why; my terminal is UTF-8.

Python 2.5.4 (r254:67916, Nov 19 2009, 19:46:21)
[GCC 4.3.4] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> reload(sys) # get sys.setdefaultencoding back
<module 'sys' (built-in)>
>>> sys.setdefaultencoding('utf-8')
>>> u'öğe' == 'öğe'
True
>>> test_dict = {u'öğe':1}
>>> test_dict['öğe']
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
KeyError: '\xc3\xb6\xc4\x9fe'

So the source encoding is UTF-8, and you see I've set the default
encoding to UTF-8.  You'll notice that u'öğe' and 'öğe' compare equal,
this is entirely correct.  Given that UTF-8 is the source encoding,
the string 'öğe' will be read as a byte-string with the UTF-8 encoding
of those Unicode characters.  And, given that UTF-8 is also the
default encoding, the string will be re-encoded using UTF-8, and so
will be equal to the Unicode stirng.

Given that the two are equal, the correct behavior for dicts would be
to use the two as the same key.  However, it doesn't.  In fact the two
objects don't even have the same hash code:

>>> hash(u'öğe')
1671320785
>>> hash('öğe')
-813744964

This ought to be a bug; objects that compare equal and are hashable
must have the same hash code.  However, given that it is crucially
important to be as fast as possible when calculating that hash code of
ASCII strings, I could imagine that this is deliberate.  (And if it is
it should be documented so; I looked briefly but did not see it.)

I can imagine another buggy possibility as well.  test_dict['öğe'] = 2
will add a new key to the above example, but it could overwrite the
key if there's a hash collision, because the objects compare equal.

All in all, it's a mighty mess.  The best advice is to avoid it
altogether and leave the default encoding alone.

Thankfully Python 3 does away with all this nonsense.

Carl Banks