[Python-ideas] Re: Fix statistics.median()?

30 Dec 2019

      On Sun, Dec 29, 2019 at 06:22:49PM -0500, Richard Damon wrote:
...
The way I see it, is that median doesn't handle NaNs in a reasonable 
way, because sorted doesn't handle them, because it is easy and quick to 
not handle NaN, and to handle them you need to define an Official 
meaning for them, and there are multiple reasonable meanings.
There is already an official way to handle NAN comparisons, and Python 
follows that official standard, and so does sorted.

It might be annoying, it might be inconvenient, it might be surprising 
to people's intuition (if they only think about values with a total 
order and not the possibility of partial order) but the way NANs are 
sorted seemingly at random throughout a sequence is a logical and 
correct consequence of the semantics of NANs and the way sorting 
operates.

It might be surprising, but what are the alternatives?

We can:

(1) Detect sets of values which make a partial order and raise an 
exception or warning. But that's *really* expensive, it is at least 
O(N**2) and requires comparing every element with every other element.

(2) Don't bother exhaustively comparing every element to every other 
element, but when sorting, each time you compare two elements, check for 
a partial order between those two elements alone.

But that too increases the cost of sorting, it has many false negatives 
(you miss a lot of partial ordered sets), and it breaks the invariant 
that sorting requires only the less than operator.

See https://bugs.python.org/issue36095 for more details.

(3) Don't try to detect all sets of values with a partial order, but 
single out NANs only.

That violates the "special cases aren't special enough" koan in the Zen, 
and it will slow down sorting for everyone even if they don't have NANs, 
but okay, maybe this special case *is* special enough. So what do we do 
now?

(4) There's a NAN in your data, so sort() will raise.

That's a bit extreme. It will annoy and frustrate people for whom the 
odd placement of NANs is not a problem they care about. E.g. if I want 
to process a set of numbers in order from smallest to largest, I don't 
care if a NAN happens to get placed between 3.5 and 4.25.

(5) Move all the NANs to the front of the sequence! No the end of the 
sequence! Half to the front and half to the back! Order them according 
to their sign bit! No, order them according to the payload! Or both!

That's already six ways to order NANs. Let's focus on the two obvious 
ones:

(6) Move all NANs to the start of the list. Or the end of the list.

Either way, you artifically decrease or increase the median:

    [NAN, NAN, NAN, NAN, 1, 2, 3, 4, 5, 6, 7]  # median 2
    [1, 2, 3, 4, 5, 6, 7, NAN, NAN, NAN, NAN]  # median 6

(7) No no no, once you have sorted the NANs to one end or the other, 
median can check for the existance of even a single NAN, and then raise 
or return NAN or strip them out or do whatever you like.

Great, but we can do the same without changing sort in any way. Doing 
this in sort when it could be done elsewhere is the wrong place, and 
probably less efficient too. Doing it in median, rather than sort, lets 
you short-circuit as soon as you see a single NAN.

So changing the way sort() doesn't seem very promising. All the 
alternatives have serious downsides that outweigh the benefits.

(8) How about fixing NANs? (For some definition of "fixing".)

That will violate the standard, and will change the behaviour of any 
code that uses comparisons, some potentially far-reaching changes in 
behaviour. And it glosses over the question of *how* we should change 
NANs to order differently. Raise? Less than everything? Greater than 
everything? Sort between negatives and zero? Justify your choice.
...
One big reason to NOT fix the issue with NaNs in median is that such a 
fix likely has a measurable impact ction the processing of the median.
I just did some quick and dirty tests to try to estimate the likely cost 
of NAN processing on median(). I get a slowdown of between 4 x and 8 x 
by applying NAN processing.

For mean(), I get a slowdown of a factor of between 1.4 to 1.7 times.

-- 
Steven

[Python-ideas] Re: Fix statistics.median()?

Steven D'Aprano