Treatment of NANs in the statistics module
Terry Reedy
tjreedy at udel.edu
Fri Mar 16 22:08:42 EDT 2018
On 3/16/2018 7:16 PM, Steven D'Aprano wrote:
> The bug tracker currently has a discussion of a bug in the median(),
> median_low() and median_high() functions that they wrongly compute the
> medians in the face of NANs in the data:
>
> https://bugs.python.org/issue33084
>
> I would like to ask people how they would prefer to handle this issue:
>
> (1) Put the responsibility on the caller to strip NANs from their data.
1 to 3 all put responsibility on the caller to strip NANs to get a sane
answer. The question is what to do if the caller does not
(1)
> If there is a NAN in your data, the result of calling median() is
> implementation-defined. This is the current behaviour, and is likely to
> be the fastest.
I hate implementation-defined behavior.
> (2) Return a NAN.
I don't like NANs as implemented and used, or unused.
> (3) Raise an exception.
That leave this.
> (4) median() should strip out NANs.
and then proceed in a deterministic fashion to give an answer.
> (5) All of the above, selected by the caller. (In which case, which would
> you prefer as the default?)
I would frame this as an alternative: 'ignore_nan=False (3) or =True
(4). Or nan='ignore' versus 'raise' (or 'strict') These are like the
choices encoding.
What do statistics.mean() and other functions do? The proposed
quantile() will have the same issue.
BMDP and other packages had and have general options for dealing with
missing values, and that is what NAN is.
--
Terry Jan Reedy
More information about the Python-list
mailing list