[Python-ideas] NAN handling in the statistics module
David Mertz
mertz at gnosis.cx
Sun Jan 6 23:05:39 EST 2019
[... apologies if this is dup, got a bounce ...]
> [David Mertz <mertz at gnosis.cx>]
>> I have to say though that the existing behavior of
`statistics.median[_low|_high|]`
>> is SURPRISING if not outright wrong. It is the behavior in existing
Python,
>> but it is very strange.
>>
>> The implementation simply does whatever `sorted()` does, which is an
>> implementation detail. In particular, NaN's being neither less than nor
>> greater than any floating point number, just stay where they are during
>> sorting.
>
> I expect you inferred that from staring at a handful of examples, but
> it's illusion. Python's sort uses only __lt__ comparisons, and if
> those don't implement a total ordering then _nothing_ is defined about
> sort's result (beyond that it's some permutation of the original
> list).
Thanks Tim for clarifying. Is it even the case that sorts are STABLE in
the face of non-total orderings under __lt__? A couple quick examples
don't refute that, but what I tried was not very thorough, nor did I
think much about TimSort itself.
> So, certainly, if you want median to be predictable in the presence of
> NaNs, sort's behavior in the presence of NaNs can't be relied on in
> any respect.
Playing with Tim's examples, this suggests that statistics.median() is
simply outright WRONG. I can think of absolutely no way to characterize
these as reasonable results:
Python 3.7.1 | packaged by conda-forge | (default, Nov 13 2018, 09:50:42)
In [4]: statistics.median([9, 9, 9, nan, 1, 2, 3, 4, 5])
Out[4]: 1
In [5]: statistics.median([9, 9, 9, nan, 1, 2, 3, 4])
Out[5]: nan
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20190106/81eeaac0/attachment-0001.html>
More information about the Python-ideas
mailing list