[Python-ideas] Re: Fix statistics.median()?

30 Dec 2019

      On 12/29/19 1:16 AM, Christopher Barker wrote:
...
OMG! Thus is fun and all, but:
On Sat, Dec 28, 2019 at 9:11 PM Richard Damon 
<Richard@damon-family.org <mailto:Richard@damon-family.org>> wrote:
... practicality beats purity.
And practically, everyone in this thread understands what a float is, 
and what a NaN is and is not.
Richard: I am honestly confused about what you think we should do. 
Sure, you can justify why the statistics module doesn’t currently 
handle NaN’s well, but that doesn’t address the question of what it 
should do.
As far as I can tell,  the only reasons for the current approach is 
ease of implementation and performance. Which are fine reasons, and 
why it was done that way in the first place.
But there seems to be (mostly) a consensus that it would be good to 
better handle NaNs in the statistics module.
I think the thing to do is decide what we want NaNs to mean: should 
they be interpreting as missing values or, essentially, errors.
You’ve made a good case that None is the “right” thing to use for 
missing values — and could be used with int and other types. So yes, 
if the statistics module were to grow support for missing values, that 
could be the way to do it.
Which means that NaNs should either raise an exception or return NaN 
as a result. Those are options that are better than the current state.
Nevertheless, I think there is a practical argument for NaN-as-missing 
value. Granted, it is widely used in other systems because it can be 
stored in a float data type, and that is not required for Python. But 
it is widely used, so is familiar to many.
But if we don’t go that route, it would be good to provide 
NaN-filtering routines in the statistics module — as the related 
thread shows, NaN detection is not trivial.
Frankly, I’m also confused as to why folks seem to think this is an 
issue to be addressed in the sort() functions — those are way too 
general and low level to be expected to solve this. And it would be a 
much heavier lift to make a change that central to Python anyway.
-CHB
The way I see it, is that median doesn't handle NaNs in a reasonable 
way, because sorted doesn't handle them, because it is easy and quick to 
not handle NaN, and to handle them you need to define an Official 
meaning for them, and there are multiple reasonable meanings. The reason 
to push most solutions to sorted, is that except for ignore, which can 
easily be implemented as a data filter to the input of the function, the 
exact same problem occurs in multiple functions (in the statistics 
module, that would include quantile) so by the principle of DRY, that is 
the logical place to implement the solution (if we don't implement the 
solution as an input filter)

At its beginning, the statistics module disclaims being a complete all 
encompassing statistics package, and suggests using one if you need more 
advanced features, which I would consider most processing of NaN to be 
included in. One big reason to NOT fix the issue with NaNs in median is 
that such a fix likely has a measurable impact ction the processing of 
the median. I suspect that the simplest solution, and one that doesn't 
impact other uses would be simple filter functions (and perhaps median 
could be defined with a arguement for what function to use, with a None 
option that would be fairly quick. One filter would remove Nans (or 
None), one would throw an exception if there is a Nan, and another would 
just return the sequence [nan] if there are any NaNs in the input 
sequence (so the median would be nan). The same options could be added 
other operations like quantile which has the similar issue, and made 
available to the program for other use.

There is one other option that might be possible to fix sorted, is that 
the IEEE spec does define another comparison function that could be used 
by sorted to sort the numbers, called something like total_order(a, b) 
which returns true if a has a total order less than b, the total order 
being defined such that it acts like < for normal numbers, but also 
provides a total order for value that < doesn't work as well for, 
(including equal values that have different representations, like -0 < 
+0 in the total_order but are == in the normal order). total_order 
defines positive NaNs to be greater than infinity (and negative NaNs 
less then negative infinity) NaNs with differing representations being 
ordered by their representation, which puts sNaNs on the extremes beyond 
the quiet NaNs.

To do this, float would need to define a dunder (maybe __ltto__) for 
total order compare, which sorted would use instead of __lt__ if it 
exists (and either sorted does the fallback, or Object just defaults 
this new dunder to call __lt__ if not overridden. Having object do the 
fallover would allow classes like set to remove this new dunder so 
sorted generates an error if you try to sort them, since in general, 
sets don't provide anthing close to a total order with <

This would say that sorted would work with NaNs, but for median most 
NaNs are treated as more positive than infinity, so the median is 
biased, but at least you don't get absurd results. My expectation would 
be if written in C or assembly, the total order comparison of two floats 
would be fast, maybe not as fast as a simple compare, but is just a 
couple of machine instructions, so small compared to the other code in 
the loop.

-- 
Richard Damon