[Python-Dev] Unicode and comparisons

Guido van Rossum guido@python.org
Tue, 04 Apr 2000 07:51:42 -0400


> Fredrik bug report made me dive a little deeper into compares
> and contains tests.
> 
> Here is a snapshot of what my current version does:
> 
> >>> '1' == None
> 0
> >>> u'1' == None
> 0
> >>> '1' == 'aäöü'
> 0
> >>> u'1' == 'aäöü'
> Traceback (most recent call last):
>   File "<stdin>", line 1, in ?
> UnicodeError: UTF-8 decoding error: invalid data
> 
> >>> '1' in ('a', None, 1)
> 0
> >>> u'1' in ('a', None, 1)
> 0
> >>> '1' in (u'aäöü', None, 1)
> 0
> >>> u'1' in ('aäöü', None, 1)
> Traceback (most recent call last):
>   File "<stdin>", line 1, in ?
> UnicodeError: UTF-8 decoding error: invalid data
> 
> The decoding errors occur because 'aäöü' is not a valid
> UTF-8 string (Unicode comparisons coerce both arguments
> to Unicode by interpreting normal strings as UTF-8
> encodings of Unicode).
> 
> Question: is this behaviour acceptable or should I go
> even further and mask decoding errors during compares
> and contains tests too ?

I think this is right -- I expect it will catch more errors than it
will cause.

This made me go out and see what happens if you compare a numeric
class instance (one that defines __int__) to another int -- it doesn't
even call the __int__ method!  This should be fixed in 1.7 when we do
the smart comparisons and rich coercions (or was it the other way
around? :-).

--Guido van Rossum (home page: http://www.python.org/~guido/)