NaN comparisons - Call For Anecdotes
Johann Hibschman
jhibschman at gmail.com
Thu Jul 17 14:49:15 EDT 2014
Chris Angelico <rosuav at gmail.com> writes:
> But you also don't know that he hasn't. NaN doesn't mean "unknown", it
> means "Not a Number". You need a more sophisticated system that allows
> for uncertainty in your data.
Regardless of whether this is the right design, it's still an example of
use.
As to the design, using NaN to implement NA is a hack with a long
history, see
http://www.numpy.org/NA-overview.html
for some color. Using NaN gets us a hardware-accelerated implementation
with just about the right semantics. In a real example, these lists are
numpy arrays with tens of millions of elements, so this isn't a trivial
benefit. (Technically, that's what's in the database; a given analysis
may look at a sample of 100k or so.)
> You have a special business case here (the need to
> record information with a "maybe" state), and you need to cope with
> it, which means dedicated logic and planning and design and code.
Yes, in principle. In practice, everyone is used to the semantics of
R-style missing data, which are reasonably well-matched by nan. In
principle, (NA == 1.0) should be a NA (missing) truth value, as should
(NA == NA), but in practice having it be False is more useful. As an
example, indexing R vectors by a boolean vector containing NA yields NA
results, which is a feature that I never want.
Cheers,
Johann
More information about the Python-list
mailing list