Re: [Python-ideas] NAN handling in the statistics module
[... apologies if this is dup, got a bounce ...]
[David Mertz <mertz@gnosis.cx>]
I have to say though that the existing behavior of `statistics.median[_low|_high|]` is SURPRISING if not outright wrong. It is the behavior in existing Python, but it is very strange.
The implementation simply does whatever `sorted()` does, which is an implementation detail. In particular, NaN's being neither less than nor greater than any floating point number, just stay where they are during sorting.
I expect you inferred that from staring at a handful of examples, but it's illusion. Python's sort uses only __lt__ comparisons, and if those don't implement a total ordering then _nothing_ is defined about sort's result (beyond that it's some permutation of the original list).
Thanks Tim for clarifying. Is it even the case that sorts are STABLE in the face of non-total orderings under __lt__? A couple quick examples don't refute that, but what I tried was not very thorough, nor did I think much about TimSort itself.
So, certainly, if you want median to be predictable in the presence of NaNs, sort's behavior in the presence of NaNs can't be relied on in any respect.
Playing with Tim's examples, this suggests that statistics.median() is simply outright WRONG. I can think of absolutely no way to characterize these as reasonable results: Python 3.7.1 | packaged by conda-forge | (default, Nov 13 2018, 09:50:42) In [4]: statistics.median([9, 9, 9, nan, 1, 2, 3, 4, 5]) Out[4]: 1 In [5]: statistics.median([9, 9, 9, nan, 1, 2, 3, 4]) Out[5]: nan
participants (1)
-
David Mertz