[Python-ideas] NAN handling in the statistics module

Mon Jan 7 11:34:24 EST 2019

On Mon, Jan 07, 2019 at 10:05:19AM -0500, David Mertz wrote:
> On Mon, Jan 7, 2019 at 6:50 AM Steven D'Aprano <steve at pearwood.info> wrote:
> 
> > > I'll provide a suggested batch on the bug.  It will simply be a wholly
> > > different implementation of median and friends.
> >
> > I ask for a documentation patch and you start talking about a whole new
> > implementation. Huh.
> > A new implementation with precisely the same behaviour is a waste of
> > time, so I presume you're planning to change the behaviour. How about if
> > you start off by explaining what the new semantics are?
> >
> 
> I think it would be counter-productive to document the bug (as something
> other than a bug).

Its not a bug in median(), because median requires the data implement a 
total order. Although that isn't explicitly documented, it is common 
sense: if the data cannot be sorted into smallest-to-largest order, how 
can you decide which value is in the middle?

What is explicitly documented is that median requires numeric data, and 
NANs aren't numbers. So the only bug here is the caller's failure to 
filter out NANs. If you pass it garbage data, you get garbage results.

Nevertheless, it is a perfectly reasonable thing to want to use data 
which may or may not contain NANs, and I want to enhance the statistics 
module to make it easier for the caller to handle NANs in whichever way 
they see fit. This is a new feature, not a bug fix.

> Picking what is a completely arbitrary element in face
> of a non-total order can never be "correct" behavior, and is never worth
> preserving for compatibility.

If you truly believe that, then you should also believe that both 
list.sort() and the bisect module are buggy, for precisely the same 
reason.

Perhaps you ought to raise a couple of bug reports, and see if you can 
get Tim and Raymond to agree that sorting and bisect should do something 
other than what they already do in the face of data that doesn't define 
a total order.

> I think the use of statistics.median against
> partially ordered elements is simply rare enough that no one tripped
> against it, or at least no one reported it before.

I'm sure it is rare. Nevertheless, I still want to make it easier for 
people to deal with this case.

> Notice that the code itself pretty much recognizes the bug in this comment:
> 
> # FIXME: investigate ways to calculate medians without sorting? Quickselect?

I doubt Quickselect will be immune to the problem of NANs. It too relies 
on comparisons, and while I don't know for sure that it requires a total 
order, I'd be surprised if it doesn't. Quickselect is basically a 
variant of Quicksort that only partially sorts the data.

> So it seems like the original author knew the implementation was wrong.

That's not why I put that comment in. Sorting is O(N log N) on average, 
and Quickselect can be O(N) on average. In principle, Quickselect or a 
similar selection algorithm could be faster than sorting.

[...]
>  It's not hard to manually check for NaNs and
> generate those in your own code.

That is correct, but by that logic, we don't need to support *any* form 
of NAN handling at all. It is easy (if inefficent) for the caller to 
pre-filter their data. I want to make it easier and more convenient and 
avoid having to iterate over the data twice if it isn't necessary.

-- 
Steve