[Python-ideas] NAN handling in the statistics module

Tue Jan 8 23:55:57 EST 2019

I'd like to see internal consistency across the central-tendency
statistics in the presence of NaNs.  What happens now:

mean:  the code appears to guarantee that a NaN will be returned if a
NaN is in the input.

median:  as recently detailed, just about anything can happen,
depending on how undefined behaviors in .sort() interact.

mode:  while NaN != NaN at the Python level, internally dicts use an
identity shortcut so that, effectively, "is" takes precedence over
`__eq__`.  So a given NaN object will be recognized as repeated if it
appears more than once, but distinct NaN objects remain distinct:  So,
e.g.,

>>> from math import inf, nan
>>> import statistics
>>> statistics.mode([2, 2, nan, nan, nan])
nan

That's NOT "NaN-in, NaN-out", it's "a single NaN object is the object
that appeared most often".  Make those 3 distinct NaN objects (inf -
inf results) instead, and the mode changes:

>>> statistics.mode([2, 2, inf - inf, inf - inf, inf - inf])
2

Since the current behavior of `mean()` is the only one that's sane,
that should probably become  the default for all of them (NaN in ->
NaN out).

"NaN in -> exception" and "pretend NaNs in the input don't exist" are
the other possibly useful behaviors.

About median speed, I wouldn't worry.  Long ago I tried many
variations of QuickSelect, and it required very large inputs for a
Python-coded QuickSelect to run faster than a straightforward
.sort()+index.  It's bound to be worse now:

- Current Python .sort() is significantly faster on one-type lists
because it figures out the single type-specific comparison routine
needed once at the start, instead of enduring N log N full-blown
PyObject_RichCompareBool calls.

- And the current .sort() can be very much faster than older ones on
data with significant order.  In the limit, .sort()+index will run
faster than any QuickSelect variant on already-sorted or
already-reverse-sorted data.  QuickSelect variants aren't adaptive in
any sense, except that a "fat pivot" version (3-way partition, into <
pivot, == pivot, and > pivot regions) is very effective on data with
many equal values.

In Python 3.7.2, for randomly ordered random\-ish floats I find that
median() is significantly faster than mean() even on lists with
millions of elements, despite that the former sorts and the latter
doesn't.