
On 28.08.2021 05:32, Steven D'Aprano wrote:
On Thu, Aug 26, 2021 at 09:36:27AM +0200, Marc-Andre Lemburg wrote:
Indeed. The NAN handling in median() looks like a bug, more than anything else:
[slightly paraphrased]
l1 = [1,2,nan,4] l2 = [nan,1,2,4]
statistics.median(l1) nan statistics.median(l2) 1.5
Looks can be deceiving, it's actually a feature *wink*
That behaviour is actually the direct consequence of NANs being unordered. The IEEE-754 standard requires that comparisons with NANs all return False (apart from not-equal, which returns True). So NANs are neither less than, equal to, or greater than other values.
Which makes sense numerically, NANs do not appear on the number line and are not ordered with numbers.
So when you sort a list containing NANs, they end up in some arbitrary position that depends on the sort implementation, the other values in the list, and their initial position. NANs can even throw out the order of other values:
sorted([3, nan, 4, 2, nan, 1]) [3, nan, 1, 2, 4, nan]
and *that* violates `median`'s assumption that sorting values actually puts them in sorted order, which is why median returns the wrong value.
I don't think that Timsort is buggy here. I expect that every sort algorithm on the planet will require a Total Order to get sensible results, and NANs violate that expectation.
https://eli.thegreenplace.net/2018/partial-and-total-orders/
If we define the less than operator `<` as "isn't greater than (or equal to)", then we can see that sorted is *locally* correct:
* 3 isn't greater than nan; * nan isn't greater than 1; * 1 isn't greater than 2; * 2 isn't greater than 4; * and 4 isn't greater than nan.
sorted() has correctly sorted the values in the sense that the invariant "a comes before b iff a isn't greater than b" is satisfied between each pair of consecutive values, but globally the order is violated because NAN's are unordered and mess up transitivity:
3 isn't greater than NAN, and NAN isn't greater than 1, but it is not true that 3 isn't greater than 1.
In the general case of sorting elements, I think that the solution is "don't do that". If you have objects which don't form a total order, then you can't expect to get sensible results from sorting them.
In the case of floats, it would be nice to have a totalOrder function as specified in the 2008 revision of IEEE-354:
https://irem.univ-reunion.fr/IMG/pdf/ieee-754-2008.pdf
Then we could sensibly do:
sorted(floats_or_decimals, key=totalorder)
and at least NANs would end up in a consistent place and everything else sorted correctly.
Thanks for the analysis. To me, the behavior looked a lot like stripping NANs left and right from the list, but what you're explaining makes this appear even more as a bug in the implementation of median() - basically wrong assumptions about NANs sorting correctly. The outcome could be more or less random, it seems. In SQL NULL always sort smaller than anything else. Perhaps that would be a strategy to use here as well. The totalOrder predicate in the IEEE spec would make NANs get shifted to the left or right part of the sequence, depending on the NAN sign. In any case, +1 on anything which fixes this :-) -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Experts (#1, Aug 28 2021)
Python Projects, Coaching and Support ... https://www.egenix.com/ Python Product Development ... https://consulting.egenix.com/
::: We implement business ideas - efficiently in both time and costs ::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 https://www.egenix.com/company/contact/ https://www.malemburg.com/