
[David Mertz <mertz@gnosis.cx>]
I have to say though that the existing behavior of `statistics.median[_low|_high|]` is SURPRISING if not outright wrong. It is the behavior in existing Python, but it is very strange.
The implementation simply does whatever `sorted()` does, which is an implementation detail. In particular, NaN's being neither less than nor greater than any floating point number, just stay where they are during sorting.
I expect you inferred that from staring at a handful of examples, but it's illusion. Python's sort uses only __lt__ comparisons, and if those don't implement a total ordering then _nothing_ is defined about sort's result (beyond that it's some permutation of the original list). There's nothing special about NaNs in this. For example, if you sort a list of sets, then "<" means subset inclusion, which doesn't define a total ordering among sets in general either (unless for every pair of sets in a specific list, one is a proper subset of the other - in which case the list of sets will be sorted in order of increasing cardinality).
But that's a particular feature of TimSort. Yes, we are guaranteed that sorts are stable; and we have rules about which things can and cannot be compared for inequality at all. But beyond that, I do not think Python ever promised that NaNs would remain in the same positions after sorting
We don't promise it, and it's not true. For example,
import math nan = math.nan xs = [0, 1, 2, 4, nan, 5, 3] sorted(xs) [0, 1, 2, 3, 4, nan, 5]
The NaN happened to move "one place to the right" there. There's no point to analyzing "why" - it's purely an accident deriving from the pattern of __lt__ outcomes the internals happened to invoke. FYI, it goes like so: is 1 < 0? No, so the first two are already sorted. is 2 < 1? No, so the first three are already sorted. is 4 < 2? No, so the first four are already sorted is nan < 4? No, so the first five are already sorted is 5 < nan? No, so the first six are already sorted is 3 < 5? Yes! At that point a binary insertion is used to move 3 into place. And none of timsort's "fancy" parts even come into play for lists so small. The patterns of comparisons the fancy parts invoke can be much more involved. At no point does the algorithm have any idea that there are NaNs in the list - it only looks at boolean __lt__ outcomes. So, certainly, if you want median to be predictable in the presence of NaNs, sort's behavior in the presence of NaNs can't be relied on in any respect.
sorted([6, 5, nan, 4, 3, 2, 1]) [1, 2, 3, 4, 5, 6, nan]
sorted([9, 9, 9, 9, 9, 9, nan, 1, 2, 3, 4, 5, 6]) [9, 9, 9, 9, 9, 9, nan, 1, 2, 3, 4, 5, 6]