[Python-Dev] decoding errors when comparing strings

Wed, 26 Jul 2000 02:09:48 +0200

(revisiting an old thread on mixed string comparisions)

summary: the current interpreter throws an "ASCII decoding
error" exception if you compare 8-bit and unicode strings, and
the 8-bit string happen to contain a character in the 128-255
range.

this is not only confusing for users, it also confuses the hell
out of Python itself.  for example:

>>> a =3D u"=E4"
>>> b =3D "=E4"
>>> hash(a)
-283818363
>>> hash(b)
-283818363
>>> a =3D=3D b
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
UnicodeError: ASCII decoding error: ordinal not in range(128)
>>> d =3D {}
>>> d[a] =3D "a"
>>> d[b] =3D "b"
>>> len(d)
UnicodeError: ASCII decoding error: ordinal not in range(128)

oops.

:::

it's clear that we should do something about this, but it's
not entirely clear what do to.

quoting from the earlier thread:

[paul]
> As soon as you find a character out of the ASCII range in one of the
> strings, I think that you should report that the two strings are
> unequal.

[me]
> sounds reasonable -- but how do you flag "unequal" in cmp?  which
> value is "larger" if all that we know is that they're different...

[moshe]
> We can say something like "beyond the ASCII range, every unicode =
character
> is larger then any regular 8-bit character", and compare
> lexicographically.

[mal]
> The usual method in the Python compare logic is to revert to
> the type name for compares in case coercion fails... I think
> this is the right description in this case: decoding fails and
> thus coercion becomes impossible.
>=20
> PyObject_Compare() has the logic, we'd just have to reenable
> it for Unicode which currently is handled as special case to
> pass through the decoding error.
>=20
> Note that Unicode objects which don't coerce would then always
> compare larger than 8-bit strings ("unicode" > "string").

:::

having digested this for a week or two, I'm leaning towards
moshe's proposal.

even if mal's proposal will give the same result in practice, I'm
not entirely happy with the idea that the actual contents of a
variable (and not just its type) should determine whether the
"last resort" type name comparision should be used.

a third alternative would be to keep the exception, and make
the dictionary code exception proof.  having looked at the code,
I'm afraid this might be easier said than done...

:::

comments?

</F>