[Python-ideas] NAN handling in the statistics module

Sun Jan 6 22:41:08 EST 2019

[David Mertz <mertz at gnosis.cx>]
> I have to say though that the existing behavior of `statistics.median[_low|_high|]`
> is SURPRISING if not outright wrong.  It is the behavior in existing Python,
> but it is very strange.
>
> The implementation simply does whatever `sorted()` does, which is an
> implementation detail.  In particular, NaN's being neither less than nor
> greater than any floating point number, just stay where they are during
> sorting.

I expect you inferred that from staring at a handful of examples, but
it's illusion.  Python's sort uses only __lt__ comparisons, and if
those don't implement a total ordering then _nothing_ is defined about
sort's result (beyond that it's some permutation of the original
list).

There's nothing special about NaNs in this.  For example, if you sort
a list of sets, then "<" means subset inclusion, which doesn't define
a total ordering among sets in general either (unless for every pair
of sets in a specific list, one is a proper subset of the other - in
which case the list of sets will be sorted in order of increasing
cardinality).

> But that's a particular feature of TimSort.  Yes, we are guaranteed that sorts
> are stable; and we have rules about which things can and cannot be compared
> for inequality at all.  But beyond that, I do not think Python ever promised that
> NaNs would remain in the same positions after sorting

We don't promise it, and it's not true.  For example,

>>> import math
>>> nan = math.nan
>>> xs = [0, 1, 2, 4, nan, 5, 3]
>>> sorted(xs)
[0, 1, 2, 3, 4, nan, 5]

The NaN happened to move "one place to the right" there.  There's no
point to analyzing "why" - it's purely an accident deriving from the
pattern of __lt__ outcomes the internals happened to invoke.  FYI, it
goes like so:

    is 1 < 0?  No, so the first two are already sorted.
    is 2 < 1?  No, so the first three are already sorted.
    is 4 < 2?  No, so the first four are already sorted
    is nan < 4?  No, so the first five are already sorted
    is 5 < nan?  No, so the first six are already sorted
    is 3 < 5?  Yes!

At that point a binary insertion is used to move 3 into place.

And none of timsort's "fancy" parts even come into play for lists so
small.  The patterns of comparisons the fancy parts invoke can be much
more involved.

At no point does the algorithm have any idea that there are NaNs in
the list - it only looks at boolean __lt__ outcomes.

So, certainly, if you want median to be predictable in the presence of
NaNs, sort's behavior in the presence of NaNs can't be relied on in
any respect.

>>> sorted([6, 5, nan, 4, 3, 2, 1])
[1, 2, 3, 4, 5, 6, nan]

>>> sorted([9, 9, 9, 9, 9, 9, nan, 1, 2, 3, 4, 5, 6])
[9, 9, 9, 9, 9, 9, nan, 1, 2, 3, 4, 5, 6]