[Python-Dev] Dicts are broken Was: unicode hell/mixing str and unicode asdictionarykeys

"Martin v. Löwis" martin at v.loewis.de
Tue Aug 8 09:56:53 CEST 2006


M.-A. Lemburg schrieb:
> Hiding programmer errors is not making life easier in the
> long run, so I'm -1 on having the equality comparison return
> False.

There is no error to hide here. The objects are inequal, period.

> Instead we should generate a warning in Python 2.5 and introduce
> the exception in Python 2.6.

A warning about what? That you can't put byte string and Unicode
strings into the same dictionary (as keys)? Next we start not allowing
to put numbers and strings into the same dictionary, because there
is no conversion defined between them?

> In the above example, you clearly know that the two are
> unequal due to the relationship between complex numbers
> having an imaginary part and integers..

Right. And so I do when the byte string does not convert to
Unicode.

> However, this is not the case for 8-bit string vs. Unicode,
> since you cannot use such extra knowledge if you find that ASCII
> encoding assumption obviously doesn't match the string
> in question.

It's not the question "Could there be a conversion under which
they are equal?" If you ask that question, then

py> "3"==3
False

should raise an exception, because there exists a conversion under
which these objects are equal:

py> int("3")==3
True

It's just that, under the conversion Python applies, the byte
string and the Unicode string are not equal.

> Note that Python always coerces to the "bigger" type. As a result,
> the second option is what is actually implemented in Python.
[which is decode-to-unicode]

It might be debatable which of the types is the "bigger" type. It's
not that byte strings are a true subset of Unicode strings, under
some conversion, since there are byte strings which have no Unicode
equivalent (because they are not characters, and don't convert under
the encoding), and there are Unicode strings that have no byte string
equivalent.

For example, if the system encoding is UTF-8, then byte string is
the bigger type (all Unicode strings convert to byte strings, but
not all byte strings convert to Unicode strings).

However, this is a red herring: Python has, for whatever reason,
chosen to convert byte->unicode, and nobody is questioning that
choice.

> I disagree with this part.
> 
> Failure to decode a string doesn't imply inequality.

If the failure is "these bytes don't have a meaningful character
interpretation", then the bytes are *clearly* not equal to
some character string.

> It implies
> that the programmer needs to step in and correct the problem by
> making an explicit and conscious decision.

There is no problem to correct. The strings *are* inequal.

> The alternative would be to decide that equal comparisons should never
> be allowed to raise exceptions and instead have the equal comparison
> return False.

There are many reasons why comparison could raise an exception.
It could be out of memory, it could be that there is an
internal/programming error in the codec being used, it could be
that the codec is not found (likewise for other comparisons).

However, if the codec is working properly, and clearly determines
that the byte string has no character string equivalent, then
it can't be equal to some character (unicode) string.

Regards,
Martin


More information about the Python-Dev mailing list