On 12/29/19 1:16 AM, Christopher Barker wrote:
OMG! Thus is fun and all, but:
On Sat, Dec 28, 2019 at 9:11 PM Richard Damon <Richard@damon-family.org <mailto:Richard@damon-family.org>> wrote:
... practicality beats purity.
And practically, everyone in this thread understands what a float is, and what a NaN is and is not.
Richard: I am honestly confused about what you think we should do. Sure, you can justify why the statistics module doesn’t currently handle NaN’s well, but that doesn’t address the question of what it should do.
As far as I can tell, the only reasons for the current approach is ease of implementation and performance. Which are fine reasons, and why it was done that way in the first place.
But there seems to be (mostly) a consensus that it would be good to better handle NaNs in the statistics module.
I think the thing to do is decide what we want NaNs to mean: should they be interpreting as missing values or, essentially, errors.
You’ve made a good case that None is the “right” thing to use for missing values — and could be used with int and other types. So yes, if the statistics module were to grow support for missing values, that could be the way to do it.
Which means that NaNs should either raise an exception or return NaN as a result. Those are options that are better than the current state.
Nevertheless, I think there is a practical argument for NaN-as-missing value. Granted, it is widely used in other systems because it can be stored in a float data type, and that is not required for Python. But it is widely used, so is familiar to many.
But if we don’t go that route, it would be good to provide NaN-filtering routines in the statistics module — as the related thread shows, NaN detection is not trivial.
Frankly, I’m also confused as to why folks seem to think this is an issue to be addressed in the sort() functions — those are way too general and low level to be expected to solve this. And it would be a much heavier lift to make a change that central to Python anyway.
-CHB
The way I see it, is that median doesn't handle NaNs in a reasonable way, because sorted doesn't handle them, because it is easy and quick to not handle NaN, and to handle them you need to define an Official meaning for them, and there are multiple reasonable meanings. The reason to push most solutions to sorted, is that except for ignore, which can easily be implemented as a data filter to the input of the function, the exact same problem occurs in multiple functions (in the statistics module, that would include quantile) so by the principle of DRY, that is the logical place to implement the solution (if we don't implement the solution as an input filter) At its beginning, the statistics module disclaims being a complete all encompassing statistics package, and suggests using one if you need more advanced features, which I would consider most processing of NaN to be included in. One big reason to NOT fix the issue with NaNs in median is that such a fix likely has a measurable impact ction the processing of the median. I suspect that the simplest solution, and one that doesn't impact other uses would be simple filter functions (and perhaps median could be defined with a arguement for what function to use, with a None option that would be fairly quick. One filter would remove Nans (or None), one would throw an exception if there is a Nan, and another would just return the sequence [nan] if there are any NaNs in the input sequence (so the median would be nan). The same options could be added other operations like quantile which has the similar issue, and made available to the program for other use. There is one other option that might be possible to fix sorted, is that the IEEE spec does define another comparison function that could be used by sorted to sort the numbers, called something like total_order(a, b) which returns true if a has a total order less than b, the total order being defined such that it acts like < for normal numbers, but also provides a total order for value that < doesn't work as well for, (including equal values that have different representations, like -0 < +0 in the total_order but are == in the normal order). total_order defines positive NaNs to be greater than infinity (and negative NaNs less then negative infinity) NaNs with differing representations being ordered by their representation, which puts sNaNs on the extremes beyond the quiet NaNs. To do this, float would need to define a dunder (maybe __ltto__) for total order compare, which sorted would use instead of __lt__ if it exists (and either sorted does the fallback, or Object just defaults this new dunder to call __lt__ if not overridden. Having object do the fallover would allow classes like set to remove this new dunder so sorted generates an error if you try to sort them, since in general, sets don't provide anthing close to a total order with < This would say that sorted would work with NaNs, but for median most NaNs are treated as more positive than infinity, so the median is biased, but at least you don't get absurd results. My expectation would be if written in C or assembly, the total order comparison of two floats would be fast, maybe not as fast as a simple compare, but is just a couple of machine instructions, so small compared to the other code in the loop. -- Richard Damon