Rich Comparisons Gotcha

Mon Dec 8 09:24:59 EST 2008

Rober Kern wrote:
>James Stroud wrote:
>> Steven D'Aprano wrote:
>>> On Sun, 07 Dec 2008 13:57:54 -0800, James Stroud wrote:

>>>> Rasmus Fogh wrote:

>>>>>>>> ll1 = [y,1]
>>>>>>>> y in ll1
>>>>> True
>>>>>>>> ll2 = [1,y]
>>>>>>>> y in ll2
>>>>> Traceback (most recent call last):
>>>>>   File "<stdin>", line 1, in <module>
>>>>> ValueError: The truth value of an array with more than one element
is
>>>>> ambiguous. Use a.any() or a.all()
>>>> I think you could be safe calling this a bug with numpy.

>>> Only in the sense that there are special cases where the array
>>> elements are all true, or all false, and numpy *could* safely return a
>>> bool. But special cases are not special enough to break the rules.
>>> Better for the numpy caller to write this:

>>> a.all() # or any()

>>> instead of:

>>> try:
>>>     bool(a)
>>> except ValueError:
>>>     a.all()

>>> as they would need to do if numpy sometimes returned a bool and
>>> sometimes raised an exception.

>> I'm missing how a.all() solves the problem Rasmus describes, namely
that
>> the order of a python *list* affects the results of containment tests
by
>> numpy.array. E.g. "y in ll1" and "y in ll2" evaluate to different
>> results in his example. It still seems like a bug in numpy to me, even
>> if too much other stuff is broken if you fix it (in which case it
>> apparently becomes an "issue").

> It's an issue, if anything, not a bug. There is no consistent
> implementation of
> bool(some_array) that works in all cases. numpy's predecessor Numeric
> used to
> implement this as returning True if at least one element was non-zero.
> This
> works well for bool(x!=y) (which is equivalent to (x!=y).any()) but does
> not
> work well for bool(x==y) (which should be (x==y).all()), but many people
> got
> confused and thought that bool(x==y) worked. When we made numpy, we
> decided to
> explicitly not allow bool(some_array) so that people will not write
> buggy code like this again.

You are so right, Robert:

> The deficiency is in the feature of rich comparisons, not numpy's
> implementation of it. __eq__() is allowed to return non-booleans;
> however, there are some parts of Python's implementation like
> list.__contains__() that still expect the return value of __eq__() to be
> meaningfully cast to a boolean.

One might argue if this is a deficiency in rich comparisons or a rather a
bug in list, set and dict. Certainly numpy is following the rules. In fact
numpy should be applauded for throwing an error rather than returning a
misleading value.

For my personal problem I could indeed wrap all objects in a wrapper with
whatever 'correct' behaviour I want (thanks, TJR). It does seem a bit
much, though, just to get code like this to work as intended:
  alist.append(x)
  print ('x is present: ', x in alist)

So, I would much prefer a language change. I am not competent to even
propose one properly, but I'll try.

First, to clear the air:
Rich comparisons, the ability to overload '==', and the constraints (or
lack of them) on __eq__ must stay unchanged. There are reasons for their
current behaviour - ieee754 is particularly convincing - and anyway they
are not going to change. No point in trying.

There remains the problem is that __eq__ is used inside python
'collections' (list, set, dict etc.), and that the kind of overloading
used (quite legitimately) in numpy etc. breaks the collection behaviour.
It seems that proper behaviour of the collections requires an equality
test that satisfies:
1) x equal x
2) x equal y => y equal x
3) x equal y and y equal z => x equal z
4) (x equal y) is a boolean
5) (x equal y) is defined (and will not throw an error) for all x,y
6) x unequal y == not(x equal y) (by definition)

Note to TJR: 5) does not mean that Python should magically shield me from
errors. All I am asking is that programmers design their equal() function
to avoid raising errors, and that errors raised from equal() clearly
count as bugs.

I cannot imagine getting the collections to work in a simple and intuitive
manner without an equality test that satisfies 1)-6). Maybe somebody else
can. Instead I would propose adding an __equal__ special method for the
purpose.

It looks like the current collections use the folowing, at least in part

def oldCollectionTest(x,y):
  if x is y:
    return True
  else:
    return (x == y)

I would propose adding a new __equal__ method that satisfies 2) - 6)
above.

We could then define

def newCollectionTest(x,y):
  if x is y:
    # this takes care of satisfying 1)
    return True
  elif hasattr(x, '__equal__'):
    return x.__equal__(y)
  elif hasattr(y, '__equal__'):
    return y.__equal__(x)
  else:
    return False

The implementations for list, set and dict would then behave according to
newCollectionTest. We would also want an equal() built-in with the same
behaviour.

In plain words, the default behaviour would be identity semantics. Objects
that wanted value semantics could implement an __equal__ function with the
correct behaviour. Wherever possible __equal__ would be the same as
__eq__.  This function may deviate from 'proper' behaviour in some cases.
All I claim for it is that it makes collections work as intended, and that
it is clear and explicit, and reasonably intuitive.

Backwards compatibility should not be a big problem. The only behaviour
change would be that list, set, and dict would now behave the way it was
always assumed they should - and the way the documentation says they
should. On the minus side there would be the difference between
'__equal__' and '__eq__' to confuse people. On the plus side the behaviour
of objects inside collections would now be explicitly defined, and __eq__
and __equal__ would be so similar that most people could ignore the
distinction.

Some examples:

# NaN:
# For floats, __equal__ would be the same as __eq__. For NaN this gives
>>> x = float('NaN')
>>> y = float('NaN')
>>> x == x
False
>>> equal(x,x)
True
>>> equal(x,y)
False
# It may be problematical mathematically, but computationally it makes
# perfect sense that looking in a given storage location will give you the
# same value every time, even if the actual value happens to be undefined.
# The behaviour is simple to describe, and indeed NaN does behave this way
# in collections at the moment. All we are doing is documenting it clearly.

# numpy
Numpy would have no __equal__ function, so we would have pure identity
semantics - 'equals(x,y)' would be the same as 'x is y'

# ordinary numbers.
Any Python object with value semantics would need an __equal__ function
with the correct behaviour.
Mark Dickinson pointed out the thread "Comparing float and decimal", which
shows that comparisons between float and decimal numbers do not currently
satisfy 3). It would not be attractive to have __equal__ and __eq__ behave
differently for ordinary numbers, so if the relevant __eq__ can not be
fixed that is a problem for my proposal.

At this point I shall try to retire gracefully. Regrettably I am not
competent to discuss if this can be done, how it can be done, and how
much work is required.

Rasmus

---------------------------------------------------------------------------
Dr. Rasmus H. Fogh                  Email: r.h.fogh at bioc.cam.ac.uk
Dept. of Biochemistry, University of Cambridge,
80 Tennis Court Road, Cambridge CB2 1GA, UK.     FAX (01223)766002