[Python-ideas] Re: Fix statistics.median()?

Dec. 30, 2019

      ...
On Dec 30, 2019, at 08:55, David Mertz <mertz@gnosis.cx> wrote:
...
Presumably the end user (unlike the statistics module) knows what data they have.
No, Steven is right here.  In Python we might very sensibly mix numeric datatypes.
The statistics module explicitly doesn’t support doing so. Which means anyone who’s doing it anyway is into “experienced user” territory, and ought to know what they’re doing.

At any rate, I wasn’t arguing that we don’t need a NaN test function in statistics. My point—lost by snipping off all the context—was nearly the opposite. The fact that you can NaN-filter things yourself (more easily than the statistics module can) doesn’t mean the module shouldn’t offer an ignore option—and therefore, the fact that you can DSU things yourself (less easily than using a key function) doesn’t mean the module shouldn’t offer a key parameter.

(There may be other good arguments against a key parameter. The fact that all three of the alternate orders anyone’s asked for or suggested turned out to be spurious, and nobody can think of a good use for a different one, that’s a pretty good argument that YAGNI. But that doesn’t make the bogus argument from “theoretically you could do it yourself so we don’t need to offer it no matter how useful” any less bogus.)
...
But this means we need an `is_nan()` function like some discussed in these threads, not rely on a method (and not the same behavior as math.isnan()).
Wait, what’s wrong with the behavior of math.isnan for floats? If you want a NaN test that differs from the one defined by IEEE, I think we’re off into uncharted waters.

Let’s get concrete: say we have a function that tries the method, and, on exception, tries math for floats, returns false for other Numbers, and finally raises a TypeError if all of the above failed. (If this were a general thing rather than a statistics thing, add trying cmath too.)

What values of what types does that not serve? People keep trying to come up with “better” NaN tests than the obvious one, but better for what? If you don’t have an actual problem to solve, what use is a solution, no matter how clever?
...
E.g.:
my_data = {'observation1': 10**400,  # really big amount
           'observation2': 1, # ordinary size
           'observation3': 2.0, # ordinary size
           'observation4': math.nan  # missing data }
median = statistics.median_high(x for x in my_data if not is_nan(x))
The answer '2.0' is plainly right here, and there's no reason we shouldn't provide it.
Wait, are you arguing that we should just offer a generic is_nan function (as a builtin?), instead of adding an on_nan handler parameter to median and friends?

If so, apologies; I guess I was disagreeing with someone else’s very different position above, not yours.

This helps users who are sophisticated enough to intentionally use NaNs for missing data, and to know they want to filter them out of a median, and to know how to do that with a genexpr, and to know when you can and can’t safely ignore the docs on which inputs are supported by statistics, but not sophisticated enough to write an isnan test for their mix of two types. But do any such users exist?

Writing a NaN test that works for your values even though you intentionally mixed two types isn’t the hard part. It’s knowing what to do with that NaN test.
Which still isn’t all that hard, but it’s something a lot of novices haven’t learned yet. I think there are a lot more users of the statistics module who would be helped by raise and ignore options on median than by just giving them the simple tools to build that behavior themselves and hoping they figure out that they need to.

[Python-ideas] Re: Fix statistics.median()?

Andrew Barnert