Re: [Python-ideas] NAN handling in the statistics module

7 Jan 2019


      [... apologies if this is dup, got a bounce ...]
...
[David Mertz <mertz@gnosis.cx>]
...
I have to say though that the existing behavior of
`statistics.median[_low|_high|]`
is SURPRISING if not outright wrong.  It is the behavior in existing
Python,
but it is very strange.
The implementation simply does whatever `sorted()` does, which is an
implementation detail.  In particular, NaN's being neither less than nor
greater than any floating point number, just stay where they are during
sorting.
I expect you inferred that from staring at a handful of examples, but
it's illusion.  Python's sort uses only __lt__ comparisons, and if
those don't implement a total ordering then _nothing_ is defined about
sort's result (beyond that it's some permutation of the original
list).
Thanks Tim for clarifying.  Is it even the case that sorts are STABLE in
the face of non-total orderings under __lt__?  A couple quick examples
don't refute that, but what I tried was not very thorough, nor did I
think much about TimSort itself.
...
So, certainly, if you want median to be predictable in the presence of
NaNs, sort's behavior in the presence of NaNs can't be relied on in
any respect.
Playing with Tim's examples, this suggests that statistics.median() is
simply outright WRONG.  I can think of absolutely no way to characterize
these as reasonable results:

Python 3.7.1 | packaged by conda-forge | (default, Nov 13 2018, 09:50:42)
In [4]: statistics.median([9, 9, 9, nan, 1, 2, 3, 4, 5])
Out[4]: 1
In [5]: statistics.median([9, 9, 9, nan, 1, 2, 3, 4])
Out[5]: nan