NaN comparisons - Call For Anecdotes
Steven D'Aprano
steve+comp.lang.python at pearwood.info
Fri Jul 18 19:57:03 CEST 2014
On Fri, 18 Jul 2014 01:36:24 +1000, Chris Angelico wrote:
> On Fri, Jul 18, 2014 at 1:12 AM, Johann Hibschman <jhibschman at gmail.com>
> wrote:
>> Well, I just spotted this thread. An easy example is, well, pretty
>> much any case where SQL NULL would be useful. Say I have lists of
>> borrowers, the amount owed, and the amount they paid so far.
>>
>> nan = float("nan")
>> borrowers = ["Alice", "Bob", "Clem", "Dan"] amount_owed = [100.0,
>> nan, 200.0, 300.0] amount_paid = [100.0, nan, nan, 200.0]
>> who_paid_off = [b for (b, ao, ap) in
>> zip(borrowers, amount_owed, amount_paid)
>> if ao == ap]
>>
>> I want to just get Alice from that list, not Bob. I don't know how
>> much Bow owes or how much he's paid, so I certainly don't know that
>> he's paid off his loan.
>>
>>
> But you also don't know that he hasn't. NaN doesn't mean "unknown", it
> means "Not a Number". You need a more sophisticated system that allows
> for uncertainty in your data. I would advise using either None or a
> dedicated singleton (something like `unknown = object()` would work, or
> you could make a custom type with a more useful repr)
Hmmm, there's something to what you say there, but IEEE-754 NANs seem to
have been designed to do quadruple (at least!) duty with multiple
meanings, including:
- Missing values ("I took a reading, but I can't read my handwriting").
- Data known only qualitatively, not quantitatively (e.g. windspeed =
"fearsome").
- Inapplicable values, e.g. the average depth of the oceans on Mars.
- The result of calculations which are mathematically indeterminate,
such as 0/0.
- The result of real-valued calculations which are invalid due to
domain errors, such as sqrt(-1) or acos(2.5).
- The result of calculations which are conceptually valid, but are
unknown due to limitations of floats, e.g. you have two finite
quantities which have both overflowed to INF, the difference
between them ought to be finite, but there's no way to tell what
it should be.
It seems to me that the way you treat a NAN will often depend on which
category it falls under. E.g. when taking the average of a set of values,
missing values ought to be skipped over, while actual indeterminate NANs
ought to carry through:
average([1, 1, 1, Missing, 1]) => 1
average([1, 1, 1, 0/0, 1]) => NAN
I know that R distinguishes between NA and IEEE-754 NANs, although I'm
not sure how complete its support for NANs is. But many (most?) R
functions take an argument controlling whether or not to ignore NA values.
In principle, you can encode the different meanings into NANs using the
payload. There are 9007199254740988 possible Python float NANs. Half of
these are signalling NANs, half are quiet NANs. Ignoring the sign bit
leaves us with 2251799813685247 distinct sNANs and the same qNANs. That's
enough to encode a *lot* of different meanings.
[Aside: I find myself perplexed why IEEE-754 says that the sign bit of
NANs should be ignored, but then specifies that another bit is to be used
to distinguish signalling from quiet NANs. Why not just interpret NANs
with the sign bit set are signalling, those with it clear are quiet?]
--
Steven
More information about the Python-list
mailing list