
On Thu, Aug 26, 2021 at 09:36:27AM +0200, Marc-Andre Lemburg wrote:
Indeed. The NAN handling in median() looks like a bug, more than anything else:
[slightly paraphrased]
l1 = [1,2,nan,4] l2 = [nan,1,2,4]
statistics.median(l1) nan statistics.median(l2) 1.5
Looks can be deceiving, it's actually a feature *wink* That behaviour is actually the direct consequence of NANs being unordered. The IEEE-754 standard requires that comparisons with NANs all return False (apart from not-equal, which returns True). So NANs are neither less than, equal to, or greater than other values. Which makes sense numerically, NANs do not appear on the number line and are not ordered with numbers. So when you sort a list containing NANs, they end up in some arbitrary position that depends on the sort implementation, the other values in the list, and their initial position. NANs can even throw out the order of other values:
sorted([3, nan, 4, 2, nan, 1]) [3, nan, 1, 2, 4, nan]
and *that* violates `median`'s assumption that sorting values actually puts them in sorted order, which is why median returns the wrong value. I don't think that Timsort is buggy here. I expect that every sort algorithm on the planet will require a Total Order to get sensible results, and NANs violate that expectation. https://eli.thegreenplace.net/2018/partial-and-total-orders/ If we define the less than operator `<` as "isn't greater than (or equal to)", then we can see that sorted is *locally* correct: * 3 isn't greater than nan; * nan isn't greater than 1; * 1 isn't greater than 2; * 2 isn't greater than 4; * and 4 isn't greater than nan. sorted() has correctly sorted the values in the sense that the invariant "a comes before b iff a isn't greater than b" is satisfied between each pair of consecutive values, but globally the order is violated because NAN's are unordered and mess up transitivity: 3 isn't greater than NAN, and NAN isn't greater than 1, but it is not true that 3 isn't greater than 1. In the general case of sorting elements, I think that the solution is "don't do that". If you have objects which don't form a total order, then you can't expect to get sensible results from sorting them. In the case of floats, it would be nice to have a totalOrder function as specified in the 2008 revision of IEEE-354: https://irem.univ-reunion.fr/IMG/pdf/ieee-754-2008.pdf Then we could sensibly do: sorted(floats_or_decimals, key=totalorder) and at least NANs would end up in a consistent place and everything else sorted correctly. -- Steve