float("nan") in set or as key

Fri Jun 3 00:23:10 EDT 2011

On Fri, 03 Jun 2011 11:17:17 +1200, Gregory Ewing wrote:

> Steven D'Aprano wrote:
> 
>> def kronecker(x, y):
>>     if x == y: return 1
>>     return 0
>> 
>> This will correctly consume NAN arguments. If either x or y is a NAN,
>> it will return 0.
> 
> I'm far from convinced that this result is "correct". For one thing, the
> Kronecker delta is defined on integers, not reals, so expecting it to
> deal with NaNs at all is nonsensical. 

Fair point. Call it an extension of the Kronecker Delta to the reals then.

> For another, this function as
> written is numerically suspect, since it relies on comparing floats for
> exact equality.

Well, it is a throw away function demonstrating a principle, not battle-
hardened production code.

But it's hard to say exactly what alternative there is, if you're going 
to accept floats. Should you compare them using an absolute error? If so, 
you're going to run into trouble if your floats get large. It is very 
amusing when people feel all virtuous for avoiding equality and then 
inadvertently do something like this:

y = 2.1e12
if abs(x - y) <= 1e-9:
    # x is equal to y, within exact tolerance
    ...

Apart from being slower and harder to read, how is this different from 
the simpler, more readable x == y?

What about a relative error? Then you'll get into trouble when the floats 
are very small. And how much error should you accept? What's good for 
your application may not be good for mine.

Even if you define your equality function to accept some limited error 
measured in Units in Last Place (ULP), "equal to within 2 ULP" (or any 
other fixed tolerance) is no better, or safer, than exact equality, and 
very likely worse.

In practice, either the function needs some sort of "how to decide 
equality" parameter, so the caller can decide what counts as equal in 
their application, or you use exact floating point equality and leave it 
up to the caller to make sure the arguments are correctly rounded so that 
values which should compare equal do compare equal.

> But the most serious problem is, given that
> 
>> NAN is a sentinel for an invalid operation. NAN + NAN returns a NAN
>> because it is an invalid operation,
> 
> if kronecker(NaN, x) or kronecker(x, Nan) returns anything other than
> NaN or some other sentinel value, then you've *lost* the information
> that an invalid operation occurred somewhere earlier in the computation.

If that's the most serious problem, then I'm laughing, because of course 
I haven't lost anything.

x = result_of_some_computation(a, b, c)  # may return NAN
y = kronecker(x, 42)

How have I lost anything? I still have the result of the computation in 
x. If I throw that value away, it is because I no longer need it. If I do 
need it, it is right there, where it always was.

You seem to have fallen for the myth that NANs, once they appear, may 
never disappear. This is a common, but erroneous, misapprehension, e.g.:

    "NaN is like a trap door that once you have fallen in you cannot
     come back out. Otherwise, the possibility exists that a calculation
     will have gone off course undetectably."

http://www.rhinocerus.net/forum/lang-fortran/94839-fortran-ieee-754-
maxval-inf-nan-2.html#post530923

Certainly if you, the function writer, has any reasonable doubt about the 
validity of a NAN input, you should return a NAN. But that doesn't mean 
that NANs are "trap doors". It is fine for them to disappear *if they 
don't matter* to the final result of the calculation. I quote:

    "The key result of these rules is that once you get a NaN during 
     a computation, the NaN has a STRONG TENDENCY [emphasis added] to
     propagate itself throughout the rest of the computation..."

http://www.savrola.com/resources/NaN.html

Another couple of good examples:

- from William Kahan, and the C99 standard: hypot(INF, x) is always INF 
regardless of the value of x, hence hypot(INF, NAN) returns INF.

- since pow(x, 0) is always 1 regardless of the value of x, pow(NAN, 0) 
is also 1.

In the case of the real-valued Kronecker delta, I argue that the NAN 
doesn't matter, and it is reasonable to allow it to disappear.

Another standard example where NANs get thrown away is the max and min 
functions. The latest revision of IEEE-754 (2008) allows for max and min 
to ignore NANs.

> You can't get a valid result from data produced by an invalid
> computation. Garbage in, garbage out.

Of course you can. Here's a trivial example:

def f(x):
    return 1

It doesn't matter what value x takes, the result of f(x) should be 1. 
What advantage is there in having f(NAN) return NAN? 

>> not because NANs are magical goop that spoil everything they touch.
> 
> But that's exactly how the *have* to behave if they truly indicate an
> invalid operation.
> 
> SQL has been mentioned in relation to all this. It's worth noting that
> the result of comparing something to NULL in SQL is *not* true or false
> -- it's NULL!

I'm sure they have their reasons for that. Whether they are good reasons 
or not, I don't know. I do know that the 1999 SQL standard defined *four* 
results for boolean comparisons, true/false/unknown/null, but allowed 
implementations to treat unknown and null as the same.

-- 
Steven