[ python-Bugs-1564763 ] Unicode comparison change in 2.4 vs. 2.5

Tue Sep 26 13:13:29 CEST 2006

Bugs item #1564763, was opened at 2006-09-25 01:43
Message generated for change (Comment added) made by lemburg
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1564763&group_id=5470

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: Unicode
Group: Python 2.5
Status: Open
Resolution: None
Priority: 5
Submitted By: Joe Wreschnig (piman)
Assigned to: M.-A. Lemburg (lemburg)
Summary: Unicode comparison change in 2.4 vs. 2.5

Initial Comment:
Python 2.5 changed the behavior of unicode comparisons
in a significant way from Python 2.4, causing a test
case failure in a module of mine. All tests passed with
an earlier version of 2.5, though unfortunately I don't
know what version in particular it started failing with.

The following code prints out all True on Python 2.4;
the strings are compared case-insensitively, whether
they are my lowerstr class, real strs, or unicodes. On
Python 2.5, the comparison between lowerstr and unicode
is false, but only in one direction.

If I make lowerstr inherit from unicode rather than
str, all comparisons are true again. So at the very
least, this is internally inconsistent. I also think
changing the behavior between 2.4 and 2.5 constitutes a
serious bug.

----------------------------------------------------------------------

>Comment By: M.-A. Lemburg (lemburg)
Date: 2006-09-26 13:13

Message:
Logged In: YES 
user_id=38388

In any case, the introduction of the Unicode tp_richcompare
slot is likely the cause for this behavior:

$python2.5 lowerstr.py
u'baR' == l'Bar'?       False
$ python2.4 lowerstr.py
u'baR' == l'Bar'?       True

Note that in both Python 2.4 and 2.5, the lowerstr.__eq__()
method is not even called. This is probably due to the fact
that Unicode can compare itself to strings, so the
w.__eq__(v) part of the rich comparison is never tried.

Now, the Unicode .__eq__() converts the string to Unicode,
so the right hand side becomes u'Bar' in both cases.

I guess a debugger session is due...

----------------------------------------------------------------------

Comment By: M.-A. Lemburg (lemburg)
Date: 2006-09-26 12:55

Message:
Logged In: YES 
user_id=38388

Ah, wrong track: Py_TPFLAGS_HAVE_RICHCOMPARE is set via
Py_TPFLAGS_DEFAULT.

----------------------------------------------------------------------

Comment By: M.-A. Lemburg (lemburg)
Date: 2006-09-26 12:39

Message:
Logged In: YES 
user_id=38388

Armin, is it possible that the missing
Py_TPFLAGS_HAVE_RICHCOMPARE type flag in the Unicode type is
causing this ?

I just had a look at the code and it appears that the
comparison code checks the flag rather than just looking at
the slot itself (didn't even know there was such a type flag).

----------------------------------------------------------------------

Comment By: Armin Rigo (arigo)
Date: 2006-09-25 23:33

Message:
Logged In: YES 
user_id=4771

Sorry, I missed your comment: if lowerstr inherits from
unicode then it just works.  The reason is that
'abc'.__eq__(u'abc') returns NotImplemented, but
u'abc'.__eq__('abc') returns True.

This is only inconsistent because of the asymmetry between
strings and unicodes: strings can be transparently turned
into unicodes but not the other way around -- so
unicode.__eq__(x) can accept a string as the argument x
and convert it to a unicode transparently, but str.__eq__(x)
does not try to convert x to a string if it is a unicode.

It's not a completely convincing explanation, but I think it
shows at least why we got at the current situation of Python
2.5.

----------------------------------------------------------------------

Comment By: Armin Rigo (arigo)
Date: 2006-09-25 23:11

Message:
Logged In: YES 
user_id=4771

This is an artifact of the change in the unicode class, which
now has the proper __eq__, __ne__, __lt__, etc. methods
instead of the semi-deprecated __cmp__.  The mixture of
__cmp__ and the other methods is not very well-defined.  This
is why your code worked in 2.4: a bit by chance.

Indeed, in theory it should not, according to the language
reference.  So what I am saying is that although it is a
behavior change from 2.4 to 2.5, I would argue that it is not
a bug but a bug fix...

The reason is that if we ignore the __eq__ vs __cmp__ issues,
the operation 'a == b' is defined as: Python tries
a.__eq__(b); if this returns NotImplemented, then Python
tries b.__eq__(a).  As an exception, if type(b) is a strict
subclass of type(a), then Python tries in the other order. 
This is why you get the 2.5 behavior: if lowerstr inherits
from str, it is not a subclass of unicode, so u'abc' ==
lowerstr() tries u'abc'.__eq__(), which works immediately. 
On the other hand, if lowerstr inherits from unicode, then
Python tries first lowerstr().__eq__(u'abc').

This part of the Python object model - when to reverse the
order or not - is a bit obscure and not completely helpful...
Subclassing built-in types generally only works a bit.  In
your situation you should use a regular class that behaves in
a string-like fashion, with an __eq__() method doing the
case-insensitive comparison... if you can at all - there are
places where you need a real string, so this "solution" might
not be one either, but I don't see a better one :-(

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1564763&group_id=5470