[Python-ideas] NAN handling in the statistics module
Tim Peters
tim.peters at gmail.com
Tue Jan 8 23:55:57 EST 2019
I'd like to see internal consistency across the central-tendency
statistics in the presence of NaNs. What happens now:
mean: the code appears to guarantee that a NaN will be returned if a
NaN is in the input.
median: as recently detailed, just about anything can happen,
depending on how undefined behaviors in .sort() interact.
mode: while NaN != NaN at the Python level, internally dicts use an
identity shortcut so that, effectively, "is" takes precedence over
`__eq__`. So a given NaN object will be recognized as repeated if it
appears more than once, but distinct NaN objects remain distinct: So,
e.g.,
>>> from math import inf, nan
>>> import statistics
>>> statistics.mode([2, 2, nan, nan, nan])
nan
That's NOT "NaN-in, NaN-out", it's "a single NaN object is the object
that appeared most often". Make those 3 distinct NaN objects (inf -
inf results) instead, and the mode changes:
>>> statistics.mode([2, 2, inf - inf, inf - inf, inf - inf])
2
Since the current behavior of `mean()` is the only one that's sane,
that should probably become the default for all of them (NaN in ->
NaN out).
"NaN in -> exception" and "pretend NaNs in the input don't exist" are
the other possibly useful behaviors.
About median speed, I wouldn't worry. Long ago I tried many
variations of QuickSelect, and it required very large inputs for a
Python-coded QuickSelect to run faster than a straightforward
.sort()+index. It's bound to be worse now:
- Current Python .sort() is significantly faster on one-type lists
because it figures out the single type-specific comparison routine
needed once at the start, instead of enduring N log N full-blown
PyObject_RichCompareBool calls.
- And the current .sort() can be very much faster than older ones on
data with significant order. In the limit, .sort()+index will run
faster than any QuickSelect variant on already-sorted or
already-reverse-sorted data. QuickSelect variants aren't adaptive in
any sense, except that a "fat pivot" version (3-way partition, into <
pivot, == pivot, and > pivot regions) is very effective on data with
many equal values.
In Python 3.7.2, for randomly ordered random\-ish floats I find that
median() is significantly faster than mean() even on lists with
millions of elements, despite that the former sorts and the latter
doesn't.
More information about the Python-ideas
mailing list