[Python-ideas] NAN handling in the statistics module

Mon Jan 7 12:19:33 EST 2019

On Mon, Jan 7, 2019, 11:38 AM Steven D'Aprano <steve at pearwood.info wrote:

> Its not a bug in median(), because median requires the data implement a
> total order. Although that isn't explicitly documented, it is common sense:
> if the data cannot be sorted into smallest-to-largest order, how can you
> decide which value is in the middle?
>

I can see reason that median per-se requires a total order.  Yes, the
implementation chosen (and many reasonable and obvious implementations)
make that assumption.  But here is a perfectly reasonable definition of
median:

* A median is an element of a collection such that 1/2 of all elements of
the collection are less than it.

Depending on how you interpret median, this element might also not be in
the original collection, but be some newly generated value that has that
property.  E.g. statistics.median([1,2,3,4]) == 2.5.

Under a partial ordering, a median may not be unique.  Even under a total
ordering this is true if some subset of elements form an equivalence
class.  But under partial ordering, the non-uniqueness can get much weirder.

What is explicitly documented is that median requires numeric data, and
> NANs aren't numbers. So the only bug here is the caller's failure to
> filter out NANs. If you pass it garbage data, you get garbage results.
>

OK, then we should either raise an exception or propagate the NaN if that
is the intended meaning of the function.  And obviously document that such
is the assumption.  NaN's *are* explicitly in the floating-point domain, so
it's fuzzy whether they are numeric or not, notwithstanding the name.

I'm very happy to push NaN-filtering to users (as NumPy does, although it
provides alternate functions for many reductions that incorporate this...
the basic ones always propagate NaNs though).

> Nevertheless, it is a perfectly reasonable thing to want to use data
> which may or may not contain NANs, and I want to enhance the statistics
> module to make it easier for the caller to handle NANs in whichever way
> they see fit. This is a new feature, not a bug fix.
>

I disagree about bug vs. feature.  The old behavior is simply and
unambiguously wrong, but was not previously noticed.  Obviously, the bug
does not affect most uses, which is why it was not noticed.

> If you truly believe that, then you should also believe that both
> list.sort() and the bisect module are buggy, for precisely the same
> reason.
>

I cannot perceive any close connection between the correct behavior of
statistics.mean() and that of list.sort() or bisect.  I know the concrete
implementation of the former uses the latter, but the answers for what is
RIGHT feel completely independent to me.

I doubt Quickselect will be immune to the problem of NANs. It too relies
> on comparisons, and while I don't know for sure that it requires a total
> order, I'd be surprised if it doesn't. Quickselect is basically a
> variant of Quicksort that only partially sorts the data.
>

Yes, I was thinking of trying to tweak Quickselect to handle NaNs during
the process.  I.e. probably terminate and propagate the NaN early, as soon
as one is encountered.  That might save much of the work if a NaN is
encountered early and most comparisons and moves can be avoided.  Of
course, I'm sure there is a worst case where almost all the work is done
before a NaN check is performed in some constructed example.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20190107/fef6e0b0/attachment.html>