[Python-Dev] decoding errors when comparing strings

M.-A. Lemburg mal@lemburg.com
Wed, 26 Jul 2000 10:41:06 +0200


Fredrik Lundh wrote:
> 
> (revisiting an old thread on mixed string comparisions)
> 
> summary: the current interpreter throws an "ASCII decoding
> error" exception if you compare 8-bit and unicode strings, and
> the 8-bit string happen to contain a character in the 128-255
> range.
> 
> this is not only confusing for users, it also confuses the hell
> out of Python itself.  for example:
> 
> >>> a = u"ä"
> >>> b = "ä"
> >>> hash(a)
> -283818363
> >>> hash(b)
> -283818363
> >>> a == b
> Traceback (most recent call last):
>   File "<stdin>", line 1, in ?
> UnicodeError: ASCII decoding error: ordinal not in range(128)
> >>> d = {}
> >>> d[a] = "a"
> >>> d[b] = "b"
> >>> len(d)
> UnicodeError: ASCII decoding error: ordinal not in range(128)
> 
> oops.

This is due to the fact that the Python dictionary lookup
implementation doesn't properly treat exceptions raised
during object compares -- this should probably be changed
(e.g. the lookup procedure could clear the error just before
returning to caller) ?!

The Unicode implementation only makes this bug in the core
visible because before that raising exceptions during
compares was not a common thing to do.

> :::
> 
> it's clear that we should do something about this, but it's
> not entirely clear what do to.
> 
> quoting from the earlier thread:
> 
> [paul]
> > As soon as you find a character out of the ASCII range in one of the
> > strings, I think that you should report that the two strings are
> > unequal.
> 
> [me]
> > sounds reasonable -- but how do you flag "unequal" in cmp?  which
> > value is "larger" if all that we know is that they're different...
> 
> [moshe]
> > We can say something like "beyond the ASCII range, every unicode character
> > is larger then any regular 8-bit character", and compare
> > lexicographically.
> 
> [mal]
> > The usual method in the Python compare logic is to revert to
> > the type name for compares in case coercion fails... I think
> > this is the right description in this case: decoding fails and
> > thus coercion becomes impossible.
> >
> > PyObject_Compare() has the logic, we'd just have to reenable
> > it for Unicode which currently is handled as special case to
> > pass through the decoding error.
> >
> > Note that Unicode objects which don't coerce would then always
> > compare larger than 8-bit strings ("unicode" > "string").
> 
> :::
> 
> having digested this for a week or two, I'm leaning towards
> moshe's proposal.
> 
> even if mal's proposal will give the same result in practice, I'm
> not entirely happy with the idea that the actual contents of a
> variable (and not just its type) should determine whether the
> "last resort" type name comparision should be used.

This has been the standard way of processing for years: if
coercion fails (for whatever reason), go ahead and fall back
to the type name compare.
 
> a third alternative would be to keep the exception, and make
> the dictionary code exception proof.  having looked at the code,
> I'm afraid this might be easier said than done...

Right. Plus we'd have to be *very* careful about not introducing
a performance problem here instead (after all, dict lookups
are at the heart of what makes Python so cool).

Note that we should look into this independent of the
Unicode discussion: user code may very well raise exceptions
too and the result would be a lingering exception state just
like in the example above.
 
> :::
> 
> comments?

I'd say: mask the coercion error during compare in the
standard way and remove the special casing for Unicode
in PyObject_Compare().

Then as second step: rethink coercion altogether and possibly 
fix the situation in the compare operator of either strings
or Unicode.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/