[Python-Dev] Dicts are broken Was: unicode hell/mixing str and unicode asdictionarykeys

Fri Aug 4 17:43:50 CEST 2006

Terry Reedy wrote:
> "Michael Hudson" <mwh at python.net> wrote in message 
> news:2m3bccwopj.fsf at starship.python.net...
>> Michael Chermside <mcherm at mcherm.com> writes:
>>
>>> I'm changing the subject line because I want to convince everyone that
>>> the problem being discussed in the "unicode hell" thread has nothing
>>> to do with unicode and strings. It's all about dicts.
>> I'd say it's more to do with __eq__.  It's a strange __eq__ method
>> that raises an Exception, IMHO.
> 
> I agree; a == b should always work, certainly unless explicitly programmed 
> otherwise in Python for a particular class. 

... which this is.

> So I think the proper solution 
> is fix the buggy __eq__ method to return False instead.  If a byte string 
> does not have a default (ascii) text interpretation, then it obviously is 
> not equal to any particular unicode text.
> 
> The fundamental axiom of sets and hence of dict keys is that any 
> object/value either is or is not a member (at any given time for 'mutable' 
> set collections).  This requires that testing an object for possible 
> membership by equality give a clean True or False answer.
> 
>> Please do realize that the motivation for this change was hours and
>> hours of tortous debugging caused by a buggy __eq__ method making keys
>> "impossibly" seem to not be in dictionaries.
> 
> So why not fix the buggy __eq__ method?

Because it's not buggy.

Python just doesn't know the encoding of the 8-bit string, so can't
make any assumptions on it. As result, it raises an exception to inform
the programmer.

It is well possible that the string uses an encoding where the
Unicode string is indeed the equal to the string, assuming this
encoding, e.g.

>>> s = 'trärää'
>>> u = u'trärää'
>>> s == u
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 2:
ordinal not in range(128)
>>> hash(s)
673683206
>>> hash(u)
673683206

Here, the encoding that creates the match is Latin-1.

>>> 2.4, fails in 2.5, and arguably ought to work fine. I think we should
>>> restore the behavior of dicts that when they compare keys for
>>> equality they suppress exceptions (treating the objects as unequal),
>>> or at LEAST retain the behavior for one more release making it a
>>> warning this time.
>> Please no.  Here's just one piece of evidence that the 2.4 semantics
>> are pretty silly too:
> 
> We mostly agreed half a decode ago that 1/2 should be .5 instead of 0, but 
> to avoid breaking code, have (or Guido has) refrained from yet making the 
> change the default.  To me, the same principle applies here at least as 
> strongly.

I think that's a different category of semantic change: the integer
division change will cause applications to return wrong data (if not
fixed properly). The exception will just let the application refuse
to continue.

How about generating a warning instead and then go for the exception
in 2.6 ?

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Aug 04 2006)
>>> Python/Zope Consulting and Support ...        http://www.egenix.com/
>>> mxODBC.Zope.Database.Adapter ...             http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ...        http://python.egenix.com/
________________________________________________________________________

::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::