[Python-Dev] cmp(x,x)

Tue May 18 04:57:33 EDT 2004

Hello,

A minor semantic change that creeped in some time ago was an implicit
assumption that any object x should "reasonably" be expected to compare equal
to itself.  The arguments are summarized below (should this be documented,
inserted in NEWS, turned in a mini-PEP a posteriori ... ?).

The point is that comparisons now behave differently if they are issued by C
or Python code.  The expression 'x == y' will always call x.__eq__(y), because
in some settings (e.g. Numeric) the result is not just True or False but an
arbitrary object (e.g. a Numeric array of zeroes and ones).  You cannot just
say that 'x == x' should return True, because of that.  So the behavior of
PyObject_RichCompare() didn't change.

On the other hand, when C code does a comparison it uses
PyObject_RichCompareBool(), meaning it is only interested in a 1 or 0
answer.  So PyObject_RichCompareBool() is where the shortcut about comparing
an object with itself has been inserted.

The result is the following: say x has a method __cmp__() that always returns
-1.

>>> x == x             #   x.__cmp__() is called
False
>>> x < x              #   x.__cmp__() is called
True
>>> cmp(x,x)           #   x.__cmp__() is NOT called
0
>>> [x] == [x]         #   x.__cmp__() is NOT called
True

The only way to explain the semantic is that the expression 'x == x' always
call the special methods, but built-in functions are allowed to assume that
any object compares equal to itself.  In other words, C code usually checks
that objects are "identical or equal", which is equalivalent to the Python
expression 'x is y or x == y'.  For example, the equality of lists now works
as follows:

def __eq__(lst1, lst2):
    if len(lst1) != len(lst2):
        return False
    for x, y in zip(lst1, lst2):
        if not (x is y or x == y):
            return False
    else:
        return True

Should any of this be documented?

An alternative behavior would have been to leave PyObject_RichCompareBool()
alone and only insert the short-cut on specific object types' comparison
methods on a case-by-case basis.  For example, identical lists would just
compare equal, without going through the loop comparing each element.  This
would remove the surprize of x.__cmp__(x) being not always called.  The
semantics would be easier to explain too: two lists are equal if they are the
same list or if elements compare pairwise equal.  We would have (with x as
above):

>>> cmp(x,x)                # x.__cmp__(x) is called
-1
>>> [x] == [x]
False                       # because the lists contain a non-equal element
>>> lst = [x]; lst == lst
True                        # because it is the same list

Finally, whatever the final semantic is, we should make sure that existing
built-in objects behave in a consistent way.  Of course I'm thinking about
floats: on my Linux box,

>>> f = float('nan')
>>> cmp(f,f)
0                    # because f is f
>>> f == f
False                # because float.__eq__() is called

Note that as discussed below the following behavior is *expected* and in
accordance with standards:

>>> float('nan') is float('nan')
False
>>> float('nan') == float('nan')
False                             # not the same object

Unless there are serious objections I suggest to (i.e. I plan to) remove the
short-cut in PyObject_RichCompareBool() -- performance is probably not an
issue here -- and then review all built-in comparison methods and make sure
that they return "equal" for identical objects.

-+-  summary of arguments that lead to the original change (in cvs head).

The argument in favor of the change is to remove the complex and not useful
code trying to compare self-recursive structures: for example, in some setting
comparing __builtin__.__dict__ with itself would trigger recursive comparison
of __builtin__.__dict__ with itself in an endless loop.  The complex algorithm
was able to spot that.  The new semantics immediately assume
__builtin__.__dict__ to be equal to itself.  Removing the complex algorithm
means that you will now get an endless loop when comparing two *non-identical*
self-recursive structures with the same shape, which is most probably not a
problem in practice (on the contrary this implicit algorithm did hide a bug in
one of my program).

The argument _against_ used to be that e.g. it makes sense that the float
value 'nan' should be different from itself, as various standards require.  
This argument does not apply in Python: these standards are about comparing
*values*, not objects, so it makes perfect sense to say that even if x is the
result of a computation that yielded an unknown answer 'nan', this answer is
still equal to itself; what it is probably not equal to is *another* 'nan'
which was obtained differently.  In other words *two* float objects both
containing 'nan' should be different, but *one* 'nan' object is still equal to
itself.  This is sane as long as no code considers 'nan' as a singleton, or
tries to reuse 'nan' objects for different 'nan' values.

-+-

Armin