[Python-ideas] NAN handling in the statistics module
oscar.j.benjamin at gmail.com
Wed Jan 9 20:21:56 EST 2019
On Wed, 9 Jan 2019 at 05:20, Steven D'Aprano <steve at pearwood.info> wrote:
> On Mon, Jan 07, 2019 at 11:27:22AM +1100, Steven D'Aprano wrote:
> > I propose adding a "nan_policy" keyword-only parameter to the relevant
> > statistics functions (mean, median, variance etc), and defining the
> > following policies:
> I asked some heavy users of statistics software (not just Python users)
> what behaviour they would find useful, and as I feared, I got no
> conclusive answer. So far, the answers seem to be almost evenly split
> into four camps:
> - don't do anything, it is the caller's responsibility to filter NANs;
> - raise an immediate error;
> - return a NAN;
> - treat them as missing data.
I would prefer to raise an exception in on nan. It's much easier to
debug an exception than a nan.
Take a look at the Julia docs for their statistics module:
In julia they have defined an explicit "missing" value. With that you
can explicitly distinguish between a calculation error and missing
data. The obvious Python equivalent would be None.
> On consideration of all the views expressed, thank you to everyone who
> commented, I'm now inclined to default to returning a NAN (which happens
> to be the current behaviour of mean etc, but not median except by
> accident) even if it impacts performance.
Whichever way you go with this it might make sense to provide helper
functions for users to deal with nans e.g.:
xbar = mean(without_nans(data))
xbar = mode(replace_nans_with_None(data))
More information about the Python-ideas