On Sun, Jan 6, 2019 at 4:27 PM Steven D'Aprano <steve@pearwood.info> wrote:
I propose adding a "nan_policy" keyword-only parameter to the relevant
statistics functions (mean, median, variance etc), and defining the
following policies:

    IGNORE:  quietly ignore all NANs
    FAIL:  raise an exception if any NAN is seen in the data
    PASS:  pass NANs through unchanged (the default)
    RETURN:  return a NAN if any NAN is seen in the data
    WARN:  ignore all NANs but raise a warning if one is seen

I don't think PASS should be the default behavior, and I'm not sure it would be productive to actually implement all of these options.

For reference, NumPy and pandas (the two most popular packages for data analytics in Python) support two of these modes:
- RETURN (numpy.mean() and skipna=False for pandas)
- IGNORE (numpy.nanmean() and skipna=True for pandas)

RETURN is the default behavior for NumPy; IGNORE is the default for pandas.

I'm pretty sure RETURN is the right default behavior for Python's standard library and anything else should be considered a bug. It safely propagates NaNs, along the lines of IEEE float behavior.

I'm not sure what the use cases are for PASS, FAIL, or WARN, none of which are supported by NumPy or pandas:
- PASS is a license to return silently incorrect results, in return for very marginal performance benefits. This seems at odds with the intended focus of the statistics module on correctness over speed. Returning incorrect statistics should not be considered a feature that needs to be maintained.
- FAIL would make sense if statistics functions could introduce *new* NaN values. But as far as I can tell, statistics functions already raise StatisticsError in these cases (e.g., if zero data point are provided). If users are concerned about accidentally propagating NaNs, they should be encouraged to check for NaNs at the entry points of their code.
- WARN is even less useful than FAIL. Seriously, who likes warnings? NumPy uses this approach for in array operations that produce NaNs (e.g., when dividing by zero), because *some* but not all results may be valid. But statistics functions return scalars.

I'm not even entirely sure it makes sense to add the IGNORE option, or at least to add it only for NaN. None is also a reasonable sentinel for a missing value in Python, and user defined types (e.g., pandas.NaT) also fall in this category. It seems a little strange to single NaN out in particular.