Would these policies be named as strings or with an enum? Following Pandas, we'd probably support both. I won't bikeshed the names, but they seem to cover desired behaviors.

On Sun, Jan 6, 2019, 7:28 PM Steven D'Aprano <steve@pearwood.info wrote:

Bug #33084 reports that the statistics library calculates median and

other stats wrongly if the data contains NANs. Worse, the result depends

on the initial placement of the NAN:

py> from statistics import median

py> NAN = float('nan')

py> median([NAN, 1, 2, 3, 4])

2

py> median([1, 2, 3, 4, NAN])

3

See the bug report for more detail:

https://bugs.python.org/issue33084

The caller can always filter NANs out of their own data, but following

the lead of some other stats packages, I propose a standard way for the

statistics module to do so. I hope this will be uncontroversial (he

says, optimistically...) but just in case, here is some prior art:

(1) Nearly all R stats functions take a "na.rm" argument which defaults

to False; if True, NA and NAN values will be stripped.

(2) The scipy.stats.ttest_ind function takes a "nan_policy" argument

which specifies what to do if a NAN is seen in the data.

https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_ind.html

(3) At least some Matlab functions, such as mean(), take an optional

flag that determines whether to ignore NANs or include them.

https://au.mathworks.com/help/matlab/ref/mean.html#bt5b82t-1-nanflag

I propose adding a "nan_policy" keyword-only parameter to the relevant

statistics functions (mean, median, variance etc), and defining the

following policies:

IGNORE: quietly ignore all NANs

FAIL: raise an exception if any NAN is seen in the data

PASS: pass NANs through unchanged (the default)

RETURN: return a NAN if any NAN is seen in the data

WARN: ignore all NANs but raise a warning if one is seen

PASS is equivalent to saying that you, the caller, have taken full

responsibility for filtering out NANs and there's no need for the

function to slow down processing by doing so again. Either that, or you

want the current implementation-dependent behaviour.

FAIL is equivalent to treating all NANs as "signalling NANs". The

presence of a NAN is an error.

RETURN is equivalent to "NAN poisoning" -- the presence of a NAN in a

calculation causes it to return a NAN, allowing NANs to propogate

through multiple calculations.

IGNORE and WARN are the same, except IGNORE is silent and WARN raises a

warning.

Questions:

- does anyone have an serious objections to this?

- what do you think of the names for the policies?

- are there any additional policies that you would like to see?

(if so, please give use-cases)

- are you happy with the default?

Bike-shed away!

--

Steve

_______________________________________________

Python-ideas mailing list

Python-ideas@python.org

https://mail.python.org/mailman/listinfo/python-ideas

Code of Conduct: http://python.org/psf/codeofconduct/