On Sun, Dec 29, 2019 at 06:22:49PM -0500, Richard Damon wrote:
The way I see it, is that median doesn't handle NaNs in a reasonable way, because sorted doesn't handle them, because it is easy and quick to not handle NaN, and to handle them you need to define an Official meaning for them, and there are multiple reasonable meanings.
There is already an official way to handle NAN comparisons, and Python follows that official standard, and so does sorted. It might be annoying, it might be inconvenient, it might be surprising to people's intuition (if they only think about values with a total order and not the possibility of partial order) but the way NANs are sorted seemingly at random throughout a sequence is a logical and correct consequence of the semantics of NANs and the way sorting operates. It might be surprising, but what are the alternatives? We can: (1) Detect sets of values which make a partial order and raise an exception or warning. But that's *really* expensive, it is at least O(N**2) and requires comparing every element with every other element. (2) Don't bother exhaustively comparing every element to every other element, but when sorting, each time you compare two elements, check for a partial order between those two elements alone. But that too increases the cost of sorting, it has many false negatives (you miss a lot of partial ordered sets), and it breaks the invariant that sorting requires only the less than operator. See https://bugs.python.org/issue36095 for more details. (3) Don't try to detect all sets of values with a partial order, but single out NANs only. That violates the "special cases aren't special enough" koan in the Zen, and it will slow down sorting for everyone even if they don't have NANs, but okay, maybe this special case *is* special enough. So what do we do now? (4) There's a NAN in your data, so sort() will raise. That's a bit extreme. It will annoy and frustrate people for whom the odd placement of NANs is not a problem they care about. E.g. if I want to process a set of numbers in order from smallest to largest, I don't care if a NAN happens to get placed between 3.5 and 4.25. (5) Move all the NANs to the front of the sequence! No the end of the sequence! Half to the front and half to the back! Order them according to their sign bit! No, order them according to the payload! Or both! That's already six ways to order NANs. Let's focus on the two obvious ones: (6) Move all NANs to the start of the list. Or the end of the list. Either way, you artifically decrease or increase the median: [NAN, NAN, NAN, NAN, 1, 2, 3, 4, 5, 6, 7] # median 2 [1, 2, 3, 4, 5, 6, 7, NAN, NAN, NAN, NAN] # median 6 (7) No no no, once you have sorted the NANs to one end or the other, median can check for the existance of even a single NAN, and then raise or return NAN or strip them out or do whatever you like. Great, but we can do the same without changing sort in any way. Doing this in sort when it could be done elsewhere is the wrong place, and probably less efficient too. Doing it in median, rather than sort, lets you short-circuit as soon as you see a single NAN. So changing the way sort() doesn't seem very promising. All the alternatives have serious downsides that outweigh the benefits. (8) How about fixing NANs? (For some definition of "fixing".) That will violate the standard, and will change the behaviour of any code that uses comparisons, some potentially far-reaching changes in behaviour. And it glosses over the question of *how* we should change NANs to order differently. Raise? Less than everything? Greater than everything? Sort between negatives and zero? Justify your choice.
One big reason to NOT fix the issue with NaNs in median is that such a fix likely has a measurable impact ction the processing of the median.
I just did some quick and dirty tests to try to estimate the likely cost of NAN processing on median(). I get a slowdown of between 4 x and 8 x by applying NAN processing. For mean(), I get a slowdown of a factor of between 1.4 to 1.7 times. -- Steven