I'd like to see internal consistency across the central-tendency statistics in the presence of NaNs. What happens now: mean: the code appears to guarantee that a NaN will be returned if a NaN is in the input. median: as recently detailed, just about anything can happen, depending on how undefined behaviors in .sort() interact. mode: while NaN != NaN at the Python level, internally dicts use an identity shortcut so that, effectively, "is" takes precedence over `__eq__`. So a given NaN object will be recognized as repeated if it appears more than once, but distinct NaN objects remain distinct: So, e.g.,
from math import inf, nan import statistics statistics.mode([2, 2, nan, nan, nan]) nan
That's NOT "NaN-in, NaN-out", it's "a single NaN object is the object that appeared most often". Make those 3 distinct NaN objects (inf - inf results) instead, and the mode changes:
statistics.mode([2, 2, inf - inf, inf - inf, inf - inf]) 2
Since the current behavior of `mean()` is the only one that's sane, that should probably become the default for all of them (NaN in -> NaN out). "NaN in -> exception" and "pretend NaNs in the input don't exist" are the other possibly useful behaviors. About median speed, I wouldn't worry. Long ago I tried many variations of QuickSelect, and it required very large inputs for a Python-coded QuickSelect to run faster than a straightforward .sort()+index. It's bound to be worse now: - Current Python .sort() is significantly faster on one-type lists because it figures out the single type-specific comparison routine needed once at the start, instead of enduring N log N full-blown PyObject_RichCompareBool calls. - And the current .sort() can be very much faster than older ones on data with significant order. In the limit, .sort()+index will run faster than any QuickSelect variant on already-sorted or already-reverse-sorted data. QuickSelect variants aren't adaptive in any sense, except that a "fat pivot" version (3-way partition, into < pivot, == pivot, and > pivot regions) is very effective on data with many equal values. In Python 3.7.2, for randomly ordered random\-ish floats I find that median() is significantly faster than mean() even on lists with millions of elements, despite that the former sorts and the latter doesn't.