[Python-ideas] NAN handling in the statistics module

Sun Jan 6 22:40:32 EST 2019

On Sun, Jan 6, 2019 at 4:27 PM Steven D'Aprano <steve at pearwood.info> wrote:

> I propose adding a "nan_policy" keyword-only parameter to the relevant
> statistics functions (mean, median, variance etc), and defining the
> following policies:
>
>     IGNORE:  quietly ignore all NANs
>     FAIL:  raise an exception if any NAN is seen in the data
>     PASS:  pass NANs through unchanged (the default)
>     RETURN:  return a NAN if any NAN is seen in the data
>     WARN:  ignore all NANs but raise a warning if one is seen
>

I don't think PASS should be the default behavior, and I'm not sure it
would be productive to actually implement all of these options.

For reference, NumPy and pandas (the two most popular packages for data
analytics in Python) support two of these modes:
- RETURN (numpy.mean() and skipna=False for pandas)
- IGNORE (numpy.nanmean() and skipna=True for pandas)

RETURN is the default behavior for NumPy; IGNORE is the default for pandas.

I'm pretty sure RETURN is the right default behavior for Python's standard
library and anything else should be considered a bug. It safely propagates
NaNs, along the lines of IEEE float behavior.

I'm not sure what the use cases are for PASS, FAIL, or WARN, none of which
are supported by NumPy or pandas:
- PASS is a license to return silently incorrect results, in return for
very marginal performance benefits. This seems at odds with the intended
focus of the statistics module on correctness over speed. Returning
incorrect statistics should not be considered a feature that needs to be
maintained.
- FAIL would make sense if statistics functions could introduce *new* NaN
values. But as far as I can tell, statistics functions already raise
StatisticsError in these cases (e.g., if zero data point are provided). If
users are concerned about accidentally propagating NaNs, they should be
encouraged to check for NaNs at the entry points of their code.
- WARN is even less useful than FAIL. Seriously, who likes warnings? NumPy
uses this approach for in array operations that produce NaNs (e.g., when
dividing by zero), because *some* but not all results may be valid. But
statistics functions return scalars.

I'm not even entirely sure it makes sense to add the IGNORE option, or at
least to add it only for NaN. None is also a reasonable sentinel for a
missing value in Python, and user defined types (e.g., pandas.NaT) also
fall in this category. It seems a little strange to single NaN out in
particular.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20190106/7ca2da15/attachment.html>