[Python-ideas] NAN handling in the statistics module

Wed Jan 9 20:21:56 EST 2019

On Wed, 9 Jan 2019 at 05:20, Steven D'Aprano <steve at pearwood.info> wrote:
>
> On Mon, Jan 07, 2019 at 11:27:22AM +1100, Steven D'Aprano wrote:
>
> [...]
> > I propose adding a "nan_policy" keyword-only parameter to the relevant
> > statistics functions (mean, median, variance etc), and defining the
> > following policies:
>
>
> I asked some heavy users of statistics software (not just Python users)
> what behaviour they would find useful, and as I feared, I got no
> conclusive answer. So far, the answers seem to be almost evenly split
> into four camps:
>
> - don't do anything, it is the caller's responsibility to filter NANs;
>
> - raise an immediate error;
>
> - return a NAN;
>
> - treat them as missing data.

I would prefer to raise an exception in on nan. It's much easier to
debug an exception than a nan.

Take a look at the Julia docs for their statistics module:
https://docs.julialang.org/en/v1/stdlib/Statistics/index.html

In julia they have defined an explicit "missing" value. With that you
can explicitly distinguish between a calculation error and missing
data. The obvious Python equivalent would be None.

> On consideration of all the views expressed, thank you to everyone who
> commented, I'm now inclined to default to returning a NAN (which happens
> to be the current behaviour of mean etc, but not median except by
> accident) even if it impacts performance.

Whichever way you go with this it might make sense to provide helper
functions for users to deal with nans e.g.:

xbar = mean(without_nans(data))
xbar = mode(replace_nans_with_None(data))

--
Oscar