[Python-Dev] Dicts are broken Was: unicode hell/mixing str and unicode asdictionarykeys

Tue Aug 8 09:25:44 CEST 2006

Martin v. Löwis wrote:
> M.-A. Lemburg schrieb:
>> Python just doesn't know the encoding of the 8-bit string, so can't
>> make any assumptions on it. As result, it raises an exception to inform
>> the programmer.
> 
> Oh, Python does make an assumption what the encoding is: it assumes
> it is the system encoding (i.e. "ascii"). Then invoking the ascii
> codec raises an exception, because the string clearly isn't ascii.

Right, and as consequence, Python raises an exception to let the
programmer correct the problem.

The subsequent solution to the problem may result in the
string being decoded into Unicode and the two resulting Unicode
objects being unequal, or it may also result in them being equal.
Python doesn't have this knowledge, so always returning false
is clearly wrong.

Hiding programmer errors is not making life easier in the
long run, so I'm -1 on having the equality comparison return
False.

Instead we should generate a warning in Python 2.5 and introduce
the exception in Python 2.6.

>> Note that you do have to interpret the string as characters
>> > if you compare it to Unicode and there's nothing wrong with
>> > that.
> 
> Consider this:
> py> int(3+4j)
> Traceback (most recent call last):
>   File "<stdin>", line 1, in ?
> TypeError: can't convert complex to int; use int(abs(z))
> py> 3 == 3+4j
> False
>
> So even though the conversion raises an exception, the
> values are determined to be not equal. Again, because int
> is a nearly true subset of complex, the conversion goes
> the other way, but *if* it would use the complex->int
> conversion, then the TypeError should be taken as
> a guarantee that the objects don't compare equal.

In the above example, you clearly know that the two are
unequal due to the relationship between complex numbers
having an imaginary part and integers..

The same is true for the overflow case:

>>> 2**10000 == 1.23
False
>>> float(2**10000)
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
OverflowError: long int too large to convert to float

(Note that in Python 2.3 this used to raise an exception as well.)

However, this is not the case for 8-bit string vs. Unicode,
since you cannot use such extra knowledge if you find that ASCII
encoding assumption obviously doesn't match the string
in question.

> Expanding this view to Unicode should mean that a unicode
> string U equals a byte string B if
> U.encode(system_encode) == B or B.decode(system_encoding) == U,
> and that they don't equal otherwise 

Agreed.

Note that Python always coerces to the "bigger" type. As a result,
the second option is what is actually implemented in Python.

> (e.g. if the conversion
> fails with a "not convertible" exception). 

I disagree with this part.

Failure to decode a string doesn't imply inequality. It implies
that the programmer needs to step in and correct the problem by
making an explicit and conscious decision.

The alternative would be to decide that equal comparisons should never
be allowed to raise exceptions and instead have the equal comparison
return False. In which case, we'd have the revert the dict patch
altogether and instead silence all exceptions that
are generated during the equal comparison (not only in the dict
implementation), replacing them with a False return value.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Aug 08 2006)
>>> Python/Zope Consulting and Support ...        http://www.egenix.com/
>>> mxODBC.Zope.Database.Adapter ...             http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ...        http://python.egenix.com/
________________________________________________________________________

::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::