[Python-ideas] NAN handling in the statistics module

Sun Jan 6 23:05:39 EST 2019

[... apologies if this is dup, got a bounce ...]

> [David Mertz <mertz at gnosis.cx>]
>> I have to say though that the existing behavior of
`statistics.median[_low|_high|]`
>> is SURPRISING if not outright wrong.  It is the behavior in existing
Python,
>> but it is very strange.
>>
>> The implementation simply does whatever `sorted()` does, which is an
>> implementation detail.  In particular, NaN's being neither less than nor
>> greater than any floating point number, just stay where they are during
>> sorting.
>
> I expect you inferred that from staring at a handful of examples, but
> it's illusion.  Python's sort uses only __lt__ comparisons, and if
> those don't implement a total ordering then _nothing_ is defined about
> sort's result (beyond that it's some permutation of the original
> list).

Thanks Tim for clarifying.  Is it even the case that sorts are STABLE in
the face of non-total orderings under __lt__?  A couple quick examples
don't refute that, but what I tried was not very thorough, nor did I
think much about TimSort itself.

> So, certainly, if you want median to be predictable in the presence of
> NaNs, sort's behavior in the presence of NaNs can't be relied on in
> any respect.

Playing with Tim's examples, this suggests that statistics.median() is
simply outright WRONG.  I can think of absolutely no way to characterize
these as reasonable results:

Python 3.7.1 | packaged by conda-forge | (default, Nov 13 2018, 09:50:42)
In [4]: statistics.median([9, 9, 9, nan, 1, 2, 3, 4, 5])
Out[4]: 1
In [5]: statistics.median([9, 9, 9, nan, 1, 2, 3, 4])
Out[5]: nan
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20190106/81eeaac0/attachment-0001.html>