[Python-ideas] Re: NAN handling in statistics functions

Aug. 27, 2021

      On Sat, 2021-08-28 at 11:49 +1000, Steven D'Aprano wrote:
...
On Tue, Aug 24, 2021 at 01:53:51PM +1000, Steven D'Aprano wrote:
...
I've spoken to users of other statistics packages and languages,
such as 
R, and I cannot find any consensus on what the "right" behaviour
should 
be for NANs except "not that!".
So I propose that statistics functions gain a keyword only
parameter to 
specify the desired behaviour when a NAN is found:
Thanks everyone for the feedback, does anyone have a strong opinion
on 
what to name this parameter?
In R, the usual parameter name is typically "na.rm" to remove them:
https://stat.ethz.ch/R-manual/R-patched/library/base/html/mean.html
https://stat.ethz.ch/R-manual/R-patched/library/stats/html/sd.html
Matlab optionally takes one of two strings:
https://au.mathworks.com/help/matlab/ref/mean.html?#d123e832786
It doesn't seem to have named parameters.
I'm leaning towards "nans=..." with an enum.
SciPy should probably also be a data-point, it uses:

    nan_policy : {'propagate', 'raise', 'omit'}, optional

statsmodels seems to use:

   missing : str
       Available options are ‘none’, ‘drop’, and ‘raise’

pandas has skipna=bool.  Since pandas and statsmodels hint to "missing
values", there is likely a good reason to not worry about them.

I guess it was already noted that both statsmodels and SciPy default to
propagating. [1]

Cheers,

Sebastian

[1] In general Python is more careful since it raises errors sometimes.
But this is almost only(?) when creating a non-finite value from finite
values.  Not when propagating non-finite values (which are not normally
IEEE warnings, although creating NaN from inf with `inf - inf` is).  In
that sense it is different, but probably not much.

[Python-ideas] Re: NAN handling in statistics functions

Sebastian Berg