On Sun, Jan 06, 2019 at 07:40:32PM -0800, Stephan Hoyer wrote:
On Sun, Jan 6, 2019 at 4:27 PM Steven D'Aprano firstname.lastname@example.org wrote:
I propose adding a "nan_policy" keyword-only parameter to the relevant statistics functions (mean, median, variance etc), and defining the following policies:
IGNORE: quietly ignore all NANs FAIL: raise an exception if any NAN is seen in the data PASS: pass NANs through unchanged (the default) RETURN: return a NAN if any NAN is seen in the data WARN: ignore all NANs but raise a warning if one is seen
I don't think PASS should be the default behavior, and I'm not sure it would be productive to actually implement all of these options.
I'm not wedded to the idea that the default ought to be the current behaviour. If there is a strong argument for one of the others, I'm listening.
For reference, NumPy and pandas (the two most popular packages for data analytics in Python) support two of these modes:
- RETURN (numpy.mean() and skipna=False for pandas)
- IGNORE (numpy.nanmean() and skipna=True for pandas)
RETURN is the default behavior for NumPy; IGNORE is the default for pandas.
I'm pretty sure RETURN is the right default behavior for Python's standard library and anything else should be considered a bug. It safely propagates NaNs, along the lines of IEEE float behavior.
How would you answer those who say that the right behaviour is not to propogate unwanted NANs, but to fail fast and raise an exception?
I'm not sure what the use cases are for PASS, FAIL, or WARN, none of which are supported by NumPy or pandas:
- PASS is a license to return silently incorrect results, in return for
very marginal performance benefits.
By my (very rough) preliminary testing, the cost of checking for NANs doubles the cost of calculating the median, and increases the cost of calculating the mean() by 25%.
I'm not trying to compete with statistics libraries written in C for speed, but that doesn't mean I don't care about performance at all. The statistics library is already slower than I like and I don't want to slow it down further for the common case (numeric data with no NANs) for the sake of the uncommon case (data with NANs).
But I hear you about the "return silently incorrect results" part.
Fortunately, I think that only applies to sort-based functions like median(). mean() etc ought to propogate NANs with any reasonable implementation, but I'm reluctant to make that a guarantee in case I come up with some unreasonable implementation :-)
This seems at odds with the intended focus of the statistics module on correctness over speed. Returning incorrect statistics should not be considered a feature that needs to be maintained.
It is only incorrect because the data violates the documented requirement that it be *numeric data*, and the undocumented requirement that the numbers have a total order. (So complex numbers are out.) I admit that the docs could be improved, but there are no guarantees made about NANs.
This doesn't mean I don't want to improve the situation! Far from it, hence this discussion.
- FAIL would make sense if statistics functions could introduce *new* NaN
values. But as far as I can tell, statistics functions already raise StatisticsError in these cases (e.g., if zero data point are provided). If users are concerned about accidentally propagating NaNs, they should be encouraged to check for NaNs at the entry points of their code.
As far as I can tell, there are two kinds of people when it comes to NANs: those who think that signalling NANs are a waste of time and NANs should always propogate, and those who hate NANs and wish that they would always signal (raise an exception).
I'm not going to get into an argument about who is right or who is wrong.
- WARN is even less useful than FAIL. Seriously, who likes warnings?
NumPy uses this approach for in array operations that produce NaNs (e.g., when dividing by zero), because *some* but not all results may be valid. But statistics functions return scalars.
I'm not even entirely sure it makes sense to add the IGNORE option, or at least to add it only for NaN. None is also a reasonable sentinel for a missing value in Python, and user defined types (e.g., pandas.NaT) also fall in this category. It seems a little strange to single NaN out in particular.
I am considering adding support for a dedicated "missing" value, whether it is None or a special sentinel. But one thing at a time. Ignoring NANs is moderately common in other statistics libraries, and although I personally feel that NANs shouldn't be used for missing values, I know many people do so.