[Python-ideas] NAN handling in the statistics module
steve at pearwood.info
Mon Jan 7 11:34:24 EST 2019
On Mon, Jan 07, 2019 at 10:05:19AM -0500, David Mertz wrote:
> On Mon, Jan 7, 2019 at 6:50 AM Steven D'Aprano <steve at pearwood.info> wrote:
> > > I'll provide a suggested batch on the bug. It will simply be a wholly
> > > different implementation of median and friends.
> > I ask for a documentation patch and you start talking about a whole new
> > implementation. Huh.
> > A new implementation with precisely the same behaviour is a waste of
> > time, so I presume you're planning to change the behaviour. How about if
> > you start off by explaining what the new semantics are?
> I think it would be counter-productive to document the bug (as something
> other than a bug).
Its not a bug in median(), because median requires the data implement a
total order. Although that isn't explicitly documented, it is common
sense: if the data cannot be sorted into smallest-to-largest order, how
can you decide which value is in the middle?
What is explicitly documented is that median requires numeric data, and
NANs aren't numbers. So the only bug here is the caller's failure to
filter out NANs. If you pass it garbage data, you get garbage results.
Nevertheless, it is a perfectly reasonable thing to want to use data
which may or may not contain NANs, and I want to enhance the statistics
module to make it easier for the caller to handle NANs in whichever way
they see fit. This is a new feature, not a bug fix.
> Picking what is a completely arbitrary element in face
> of a non-total order can never be "correct" behavior, and is never worth
> preserving for compatibility.
If you truly believe that, then you should also believe that both
list.sort() and the bisect module are buggy, for precisely the same
Perhaps you ought to raise a couple of bug reports, and see if you can
get Tim and Raymond to agree that sorting and bisect should do something
other than what they already do in the face of data that doesn't define
a total order.
> I think the use of statistics.median against
> partially ordered elements is simply rare enough that no one tripped
> against it, or at least no one reported it before.
I'm sure it is rare. Nevertheless, I still want to make it easier for
people to deal with this case.
> Notice that the code itself pretty much recognizes the bug in this comment:
> # FIXME: investigate ways to calculate medians without sorting? Quickselect?
I doubt Quickselect will be immune to the problem of NANs. It too relies
on comparisons, and while I don't know for sure that it requires a total
order, I'd be surprised if it doesn't. Quickselect is basically a
variant of Quicksort that only partially sorts the data.
> So it seems like the original author knew the implementation was wrong.
That's not why I put that comment in. Sorting is O(N log N) on average,
and Quickselect can be O(N) on average. In principle, Quickselect or a
similar selection algorithm could be faster than sorting.
> It's not hard to manually check for NaNs and
> generate those in your own code.
That is correct, but by that logic, we don't need to support *any* form
of NAN handling at all. It is easy (if inefficent) for the caller to
pre-filter their data. I want to make it easier and more convenient and
avoid having to iterate over the data twice if it isn't necessary.
More information about the Python-ideas