[Python-Dev] PyObject_RichCompareBool identity shortcut

Thu Apr 28 18:55:37 CEST 2011

[This is a mega-reply, combining responses to several messages in this
thread. I may be repeating myself a bit, but I think I am being
consistent. :-)]

On Wed, Apr 27, 2011 at 10:12 PM, Nick Coghlan <ncoghlan at gmail.com> wrote:
> On Thu, Apr 28, 2011 at 2:54 PM, Guido van Rossum <guido at python.org> wrote:
>>> Well, I didn't say that. If Python changed its behavior for (float('nan') ==
>>> float('nan')), we'd have to seriously consider some changes.
>>
>> Ah, but I'm not proposing anything of the sort! float('nan') returns a
>> new object each time and two NaNs that are not the same *object* will
>> still follow the IEEE std. It's just when comparing a NaN-valued
>> *object* to *itself* (i.e. the *same* object) that I would consider
>> following the lead of Python's collections.
>
> The reason this possibility bothers me is that it doesn't mesh well
> with the "implementations are free to cache and reuse immutable
> objects" rule. Although, if the updated NaN semantics were explicit
> that identity was now considered part of the value of NaN objects
> (thus ruling out caching them at the implementation layer), I guess
> that objection would go away.

The rules for float could be expanded to disallow NaN caching.

But even if we didn't change any rules, reusing immutable objects
could currently make computations undefined, because container
comparisons use the "identity wins" rule. E.g. if we didn't change the
rule for nan==nan, but we did change float("nan") to always return a
specific singleton, comparisons like [float("nan")] == [float("nan")]
would change in outcome. (Note that not all NaNs could be the same
object, since there are multiple bit patterns meaning NaN; IIUC this
is different from Inf.)

All this makes me realize that there would be another issue, one that
I wouldn't know how to deal with: a JITting interpreter could
translate code involving floats into machine code, at which point
object identity would be lost (presumably the machine code would use
IEEE value semantics for NaN).

This also reminds me that the current "identity wins" rules for
containers, combined with the "NaN==NaN is always False" for
non-container contexts, theoretically also might pose constraints on
the correctness of certain JIT optimizations. I don't know if PyPy
optimizes any code involving tuples or lists of floats, so I don't
know if it is a problem in practice, but it does seem to pose a
complex constraint in theory.

TBH Whatever Raymond may say, I have never been a fan of the "identity
wins" rules for containers given that we don't have a corresponding
rule requiring __eq__ to return True for x.__eq__(x).

On Wed, Apr 27, 2011 at 10:27 PM, Alexander Belopolsky
<alexander.belopolsky at gmail.com> wrote:
> Note that ctypes' floats already behave this way:
>
>>>> x = c_double(float('nan'))
>>>> x == x
> True

But ctypes floats are not numbers. I don't think this provides any
evidence (except of possibly a shortcut in the ctypes implementation
for == :-).

> Before we go down this path, I would like to discuss another
> peculiarity of NaNs:
>
>>>> float('nan') < 0
> False
>>>> float('nan') > 0
> False
>
> This property in my experience causes much more trouble than nan ==
> nan being false.  The problem is that common sorting or binary search
> algorithms may degenerate into infinite loops in the presence of nans.
>  This may even happen when searching for a finite value in a large
> array that contains a single nan.  Errors like this do happen in the
> wild and and after chasing a bug like this programmers tend to avoid
> nans at all costs.  Oftentimes this leads to using "magic"
> placeholders such as 1e300 for missing data.
>
> Since py3k has already made None < 0 an error, it may be reasonable
> for float('nan') < 0 to raise an error as well (probably ValueError
> rather than TypeError).  This will not make lists with nans sortable
> or searchable using binary search, but will make associated bugs
> easier to find.

Hmm... It feels like a much bigger can of worms and I'm not at all
sure that it is going to work out any better than the current behavior
(which can be coarsely characterized as "tough shit, float + {NaN} do
not form a total ordering" :-). Remember when some string comparisons
would raise exceptions if "uncomparable" Unicode and non-Unicode
values were involved? That was a major pain and we gladly killed that
in Py3k. (Though it was for ==/!=, not for < etc.)

Basically I think the IEEE std has probably done a decent job of
defining how NaNs should behave, with the exception of object identity
-- because the IEEE std does not deal with objects, only with values.
The only other thing that could perhaps work would be to disallow NaN
from ever being created, instead always raising an exception if NaN
would be produced. Like we do with division by zero. But that would be
a *huge* incompatible change to Python's floating point capabilities
and I'm not interested in going there. The *only* point where I think
we might have a real problem is the discrepancy between individual NaN
comparisons and container comparisons involving NaN (which take
identity into account in a way that individual comparisons don't).

On Wed, Apr 27, 2011 at 10:53 PM, Alexander Belopolsky
<alexander.belopolsky at gmail.com> wrote:
> On Thu, Apr 28, 2011 at 12:24 AM, Guido van Rossum <guido at python.org> wrote:
>> So do new masks get created when the outcome of an elementwise
>> operation is a NaN?  Because that's the only reason why one should have
>> NaNs in one's data in the first place.
>
> If this is the case, why Python almost never produces NaNs as IEEE
> standard prescribes?
>
>>>> 0.0/0.0
> Traceback (most recent call last):
>  File "<stdin>", line 1, in <module>
> ZeroDivisionError: float division

Even the IEEE std, AFAIK, lets you separately control what happens on
zero division and on NaN-producing operations. Python has chosen to
always raise an exception on zero division, and I don't think this
violates the IEEE std.

>> -- not to indicate missing values!
>
> Sometimes you don't have a choice.  For example when you data comes
> from a database that uses NaNs for missing values.

I would choose to call that a bug in the database. It should use None, not NaN.

On Wed, Apr 27, 2011 at 11:07 PM, Greg Ewing
<greg.ewing at canterbury.ac.nz> wrote:
> Guido van Rossum wrote:
>>
>> Currently NaN is not violating
>> any language rules -- it is just violating users' intuition, in a much
>> worse way than Inf does.
>
> If it's to be an official language non-rule (by which I mean
> that types are officially allowed to compare non-reflexively)
> then any code assuming that identity implies equality for
> arbitrary objects is broken and should be fixed.

Only if there's a use case for passing it NaNs.

On Wed, Apr 27, 2011 at 11:51 PM, Alexander Belopolsky
<alexander.belopolsky at gmail.com> wrote:
> On Thu, Apr 28, 2011 at 2:20 AM, Glenn Linderman <v+python at g.nevcal.com> wrote:
> ..
>> In that bug, Nick, you mention that reflexive equality is something that
>> container classes rely on in their implementation.  Such reliance seems to
>> me to be a bug, or an inappropriate optimization, ..
>
> An alternative interpretation would be that it is a bug to use NaN
> values in lists.

This would be bad; the list shouldn't care what kind of objects can be
stored in it.

> It is certainly nonsensical to use NaNs as keys in
> dictionaries

But somehow it works, if you consider each NaN *object* as a different
value. :-)

> and that reportedly led Java designers to forgo the
> nonreflexivity of nans:
>
> """
> A "NaN" value is not equal to itself. However, a "NaN" Java "Float"
> object is equal to itself. The semantic is defined this way, because
> otherwise "NaN" Java "Float" objects cannot be retrieved from a hash
> table.
> """ - http://www.concentric.net/~ttwang/tech/javafloat.htm

That is exactly the change I am proposing (currently with a strength
of +0) for Python, because Python's containers (at least the built-in
ones) have already decided to follow this rule even if the float type
itself has not yet.

> With the status quo in Python, it may only make sense to store NaNs in
> array.array, but not in a list.

I do not see how this follows.

On Thu, Apr 28, 2011 at 12:57 AM, Nick Coghlan <ncoghlan at gmail.com> wrote:
> Because this assertion is an assertion about the behaviour of
> comparisons that violates IEEE754, while the assertions I list are all
> assertions about the behaviour of containers that can be made true
> *regardless* of IEEE754 by checking identity explicitly.
>
> The correct assertion under Python's current container semantics is:
>
>  if list(c1) == list(c2):  # Make ordering assumption explicit
>    assert all(x is y or x == y for x,y in zip(c1, c2))  # Enforce reflexivity

That does not apply to all containers and does not make much sense for
any containers except those we call sequences (although there are
different but similar rules for other categories of containers). And I
think you meant it backwards: the second line is actually the
(current) *definition* of sequence identity, it does not just follow
from sequence identity.

However, Python *used* to define sequence equality as plain
elementwise equality, meaning that if nan==nan is always False,
[nan]==[nan] would likewise be False.

Raymond strongly believes that containers must be allowed to use the
modified definition, I believe purely for performance reasons.
(Without this rule, a list or tuple could not even cut short being
compared to *itself*.) It seems you are in that camp too.

I think that if the rule for containers is really that important, we
should take the logical consequence and make a rule that a
well-behaved type defines __eq__ and __ne__ to let object identity
overrule whatever definition of value equality it has, and we should
change float and decimal to follow this rule. (The "well-behaved"
qualifier is intended to clarify that the language doesn't actually
try to enforce this rule, similar to the existing rule about
correspondence between __hash__ and __eq__.)

> Meyer is a purist - sticking with the mathematical definition of
> equality is the sort of thing that fits his view of the world and what
> Eiffel should be, even if it hinders interoperability with other
> languages and tools. Python tends to be a bit more pragmatic about
> things, in particular when it comes to interoperability, so it makes
> sense to follow IEEE754 and the decimal specification at the
> individual comparison level.

So what *does* Eiffel do when comparing two NaNs from different sources?

I would say that in this case, Python's approach started out as naive,
not pragmatic -- I was (and still mostly am) clueless about all issues
numeric. Augmenting float/decimal equality to let object identity win
would be an example of pragmatic.

> However, we can contain the damage to some degree by specifying that
> containers should enforce reflexivity where they need it. This is
> already the case at the implementation level (collections.Sequence
> aside), it just needs to be pushed up to the language definition
> level.

I think that when objects are involved, the word reflexivity does not
convey the right intuition.

>> Can you give examples of algorithms that would break if one of your
>> invariants is violated, but would still work if the data contains
>> NaNs?
>
> Sure, anything that cares more about objects than it does about
> values. The invariants are about making containers behave like
> containers as far as possible, even in the face of recalcitrant types
> like IEEE754 floating point.

TBH I think it's more about being allowed to take various shortcuts in
the implementation than about some abstract behavioral property. The
abstract behavioral property doesn't matter that much, but assuming it
enables the optimization, and the optimization does matter. Another
example of pragmatics.

On Thu, Apr 28, 2011 at 8:52 AM, Robert Kern <robert.kern at gmail.com> wrote:
> Smaller, certainly. But now it's a trilemma. :-)
>
> 1. Have just np.float64 and np.complex128 scalars follow the Python float
> semantics since they subclass Python float and complex, respectively.
> 2. Have all np.float* and np.complex* scalars follow the Python float
> semantics.
> 3. Keep the current IEEE-754 semantics for all float scalar types.

*If* my proposal gets accepted, there will be a blanket rule that no
matter how exotic an type's __eq__ is defined, self.__eq__(self)
(i.e., __eq__ called with the same *object* argument) must return True
if the type's __eq__ is to be considered well-behaved; and Python
containers may assume (for the purpose of optimizing their own
comparison operations) that their elements have a well-behaved __eq__.

-- 
--Guido van Rossum (python.org/~guido)