[Python-ideas] NAN handling in the statistics module

David Mertz mertz at gnosis.cx
Sun Jan 6 19:46:03 EST 2019


Would these policies be named as strings or with an enum? Following Pandas,
we'd probably support both. I won't bikeshed the names, but they seem to
cover desired behaviors.

On Sun, Jan 6, 2019, 7:28 PM Steven D'Aprano <steve at pearwood.info wrote:

> Bug #33084 reports that the statistics library calculates median and
> other stats wrongly if the data contains NANs. Worse, the result depends
> on the initial placement of the NAN:
>
> py> from statistics import median
> py> NAN = float('nan')
> py> median([NAN, 1, 2, 3, 4])
> 2
> py> median([1, 2, 3, 4, NAN])
> 3
>
> See the bug report for more detail:
>
> https://bugs.python.org/issue33084
>
>
> The caller can always filter NANs out of their own data, but following
> the lead of some other stats packages, I propose a standard way for the
> statistics module to do so. I hope this will be uncontroversial (he
> says, optimistically...) but just in case, here is some prior art:
>
> (1) Nearly all R stats functions take a "na.rm" argument which defaults
> to False; if True, NA and NAN values will be stripped.
>
> (2) The scipy.stats.ttest_ind function takes a "nan_policy" argument
> which specifies what to do if a NAN is seen in the data.
>
>
> https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_ind.html
>
> (3) At least some Matlab functions, such as mean(), take an optional
> flag that determines whether to ignore NANs or include them.
>
> https://au.mathworks.com/help/matlab/ref/mean.html#bt5b82t-1-nanflag
>
>
> I propose adding a "nan_policy" keyword-only parameter to the relevant
> statistics functions (mean, median, variance etc), and defining the
> following policies:
>
>     IGNORE:  quietly ignore all NANs
>     FAIL:  raise an exception if any NAN is seen in the data
>     PASS:  pass NANs through unchanged (the default)
>     RETURN:  return a NAN if any NAN is seen in the data
>     WARN:  ignore all NANs but raise a warning if one is seen
>
> PASS is equivalent to saying that you, the caller, have taken full
> responsibility for filtering out NANs and there's no need for the
> function to slow down processing by doing so again. Either that, or you
> want the current implementation-dependent behaviour.
>
> FAIL is equivalent to treating all NANs as "signalling NANs". The
> presence of a NAN is an error.
>
> RETURN is equivalent to "NAN poisoning" -- the presence of a NAN in a
> calculation causes it to return a NAN, allowing NANs to propogate
> through multiple calculations.
>
> IGNORE and WARN are the same, except IGNORE is silent and WARN raises a
> warning.
>
> Questions:
>
> - does anyone have an serious objections to this?
>
> - what do you think of the names for the policies?
>
> - are there any additional policies that you would like to see?
>   (if so, please give use-cases)
>
> - are you happy with the default?
>
>
> Bike-shed away!
>
>
>
> --
> Steve
> _______________________________________________
> Python-ideas mailing list
> Python-ideas at python.org
> https://mail.python.org/mailman/listinfo/python-ideas
> Code of Conduct: http://python.org/psf/codeofconduct/
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20190106/7e41861a/attachment-0001.html>


More information about the Python-ideas mailing list