[Python-ideas] NAN handling in the statistics module
steve at pearwood.info
Mon Jan 7 02:05:26 EST 2019
On Sun, Jan 06, 2019 at 07:40:32PM -0800, Stephan Hoyer wrote:
> On Sun, Jan 6, 2019 at 4:27 PM Steven D'Aprano <steve at pearwood.info> wrote:
> > I propose adding a "nan_policy" keyword-only parameter to the relevant
> > statistics functions (mean, median, variance etc), and defining the
> > following policies:
> > IGNORE: quietly ignore all NANs
> > FAIL: raise an exception if any NAN is seen in the data
> > PASS: pass NANs through unchanged (the default)
> > RETURN: return a NAN if any NAN is seen in the data
> > WARN: ignore all NANs but raise a warning if one is seen
> I don't think PASS should be the default behavior, and I'm not sure it
> would be productive to actually implement all of these options.
I'm not wedded to the idea that the default ought to be the current
behaviour. If there is a strong argument for one of the others, I'm
> For reference, NumPy and pandas (the two most popular packages for data
> analytics in Python) support two of these modes:
> - RETURN (numpy.mean() and skipna=False for pandas)
> - IGNORE (numpy.nanmean() and skipna=True for pandas)
> RETURN is the default behavior for NumPy; IGNORE is the default for pandas.
> I'm pretty sure RETURN is the right default behavior for Python's standard
> library and anything else should be considered a bug. It safely propagates
> NaNs, along the lines of IEEE float behavior.
How would you answer those who say that the right behaviour is not to
propogate unwanted NANs, but to fail fast and raise an exception?
> I'm not sure what the use cases are for PASS, FAIL, or WARN, none of which
> are supported by NumPy or pandas:
> - PASS is a license to return silently incorrect results, in return for
> very marginal performance benefits.
By my (very rough) preliminary testing, the cost of checking for NANs
doubles the cost of calculating the median, and increases the cost of
calculating the mean() by 25%.
I'm not trying to compete with statistics libraries written in C for
speed, but that doesn't mean I don't care about performance at all. The
statistics library is already slower than I like and I don't want to
slow it down further for the common case (numeric data with no NANs) for
the sake of the uncommon case (data with NANs).
But I hear you about the "return silently incorrect results" part.
Fortunately, I think that only applies to sort-based functions like
median(). mean() etc ought to propogate NANs with any reasonable
implementation, but I'm reluctant to make that a guarantee in case I
come up with some unreasonable implementation :-)
> This seems at odds with the intended
> focus of the statistics module on correctness over speed. Returning
> incorrect statistics should not be considered a feature that needs to be
It is only incorrect because the data violates the documented
requirement that it be *numeric data*, and the undocumented requirement
that the numbers have a total order. (So complex numbers are out.) I
admit that the docs could be improved, but there are no guarantees made
This doesn't mean I don't want to improve the situation! Far from it,
hence this discussion.
> - FAIL would make sense if statistics functions could introduce *new* NaN
> values. But as far as I can tell, statistics functions already raise
> StatisticsError in these cases (e.g., if zero data point are provided). If
> users are concerned about accidentally propagating NaNs, they should be
> encouraged to check for NaNs at the entry points of their code.
As far as I can tell, there are two kinds of people when it comes to
NANs: those who think that signalling NANs are a waste of time and NANs
should always propogate, and those who hate NANs and wish that they
would always signal (raise an exception).
I'm not going to get into an argument about who is right or who is
> - WARN is even less useful than FAIL. Seriously, who likes warnings?
> uses this approach for in array operations that produce NaNs (e.g., when
> dividing by zero), because *some* but not all results may be valid. But
> statistics functions return scalars.
> I'm not even entirely sure it makes sense to add the IGNORE option, or at
> least to add it only for NaN. None is also a reasonable sentinel for a
> missing value in Python, and user defined types (e.g., pandas.NaT) also
> fall in this category. It seems a little strange to single NaN out in
I am considering adding support for a dedicated "missing" value, whether
it is None or a special sentinel. But one thing at a time. Ignoring NANs
is moderately common in other statistics libraries, and although I
personally feel that NANs shouldn't be used for missing values, I know
many people do so.
More information about the Python-ideas