Re: [Python-ideas] NAN handling in the statistics module

Jan. 6, 2019

      [David Mertz <mertz@gnosis.cx>]
...
I have to say though that the existing behavior of `statistics.median[_low|_high|]`
is SURPRISING if not outright wrong.  It is the behavior in existing Python,
but it is very strange.
The implementation simply does whatever `sorted()` does, which is an
implementation detail.  In particular, NaN's being neither less than nor
greater than any floating point number, just stay where they are during
sorting.
I expect you inferred that from staring at a handful of examples, but
it's illusion.  Python's sort uses only __lt__ comparisons, and if
those don't implement a total ordering then _nothing_ is defined about
sort's result (beyond that it's some permutation of the original
list).

There's nothing special about NaNs in this.  For example, if you sort
a list of sets, then "<" means subset inclusion, which doesn't define
a total ordering among sets in general either (unless for every pair
of sets in a specific list, one is a proper subset of the other - in
which case the list of sets will be sorted in order of increasing
cardinality).
...
But that's a particular feature of TimSort.  Yes, we are guaranteed that sorts
are stable; and we have rules about which things can and cannot be compared
for inequality at all.  But beyond that, I do not think Python ever promised that
NaNs would remain in the same positions after sorting
We don't promise it, and it's not true.  For example,
...
...
...
import math
nan = math.nan
xs = [0, 1, 2, 4, nan, 5, 3]
sorted(xs)
[0, 1, 2, 3, 4, nan, 5]
The NaN happened to move "one place to the right" there.  There's no
point to analyzing "why" - it's purely an accident deriving from the
pattern of __lt__ outcomes the internals happened to invoke.  FYI, it
goes like so:

    is 1 < 0?  No, so the first two are already sorted.
    is 2 < 1?  No, so the first three are already sorted.
    is 4 < 2?  No, so the first four are already sorted
    is nan < 4?  No, so the first five are already sorted
    is 5 < nan?  No, so the first six are already sorted
    is 3 < 5?  Yes!

At that point a binary insertion is used to move 3 into place.

And none of timsort's "fancy" parts even come into play for lists so
small.  The patterns of comparisons the fancy parts invoke can be much
more involved.

At no point does the algorithm have any idea that there are NaNs in
the list - it only looks at boolean __lt__ outcomes.

So, certainly, if you want median to be predictable in the presence of
NaNs, sort's behavior in the presence of NaNs can't be relied on in
any respect.
...
...
...
sorted([6, 5, nan, 4, 3, 2, 1])
[1, 2, 3, 4, 5, 6, nan]
...
...
...
sorted([9, 9, 9, 9, 9, 9, nan, 1, 2, 3, 4, 5, 6])
[9, 9, 9, 9, 9, 9, nan, 1, 2, 3, 4, 5, 6]

Re: [Python-ideas] NAN handling in the statistics module

Tim Peters