On Wed, 9 Jan 2019 at 05:20, Steven D'Aprano email@example.com wrote:
On Mon, Jan 07, 2019 at 11:27:22AM +1100, Steven D'Aprano wrote:
I propose adding a "nan_policy" keyword-only parameter to the relevant statistics functions (mean, median, variance etc), and defining the following policies:
I asked some heavy users of statistics software (not just Python users) what behaviour they would find useful, and as I feared, I got no conclusive answer. So far, the answers seem to be almost evenly split into four camps:
don't do anything, it is the caller's responsibility to filter NANs;
raise an immediate error;
return a NAN;
treat them as missing data.
I would prefer to raise an exception in on nan. It's much easier to debug an exception than a nan.
Take a look at the Julia docs for their statistics module: https://docs.julialang.org/en/v1/stdlib/Statistics/index.html
In julia they have defined an explicit "missing" value. With that you can explicitly distinguish between a calculation error and missing data. The obvious Python equivalent would be None.
On consideration of all the views expressed, thank you to everyone who commented, I'm now inclined to default to returning a NAN (which happens to be the current behaviour of mean etc, but not median except by accident) even if it impacts performance.
Whichever way you go with this it might make sense to provide helper functions for users to deal with nans e.g.:
xbar = mean(without_nans(data)) xbar = mode(replace_nans_with_None(data))