[Python-ideas] NAN handling in the statistics module

Sun Jan 6 21:46:51 EST 2019

I have to say though that the existing behavior of
`statistics.median[_low|_high|]` is SURPRISING if not outright wrong.  It
is the behavior in existing Python, but it is very strange.

The implementation simply does whatever `sorted()` does, which is an
implementation detail.  In particular, NaN's being neither less than nor
greater than any floating point number, just stay where they are during
sorting.  But that's a particular feature of TimSort.  Yes, we are
guaranteed that sorts are stable; and we have rules about which things can
and cannot be compared for inequality at all.  But beyond that, I do not
think Python ever promised that NaNs would remain in the same positions
after sorting if some other position was stable under a different sorting
algorithm.

So in the incredibly unlikely even I invent a DavidSort that behaves better
than TimSort, is stable, and compares only the same Python objects as
current CPython, a future version could use this algorithm without breaking
promises... even if NaN's sometimes sorted differently than in TimSort.
For that matter, some new implementation could use my not-nearly-as-good
DavidSort, and while being slower, would still be compliant.

Relying on that for the result of `median()` feels strange to me.  It feels
strange as the default behavior, but that's the status quo.  But it feels
even stranger that there are not at least options to deal with NaNs in more
of the signaling or poisoning ways that every other numeric library does.

On Sun, Jan 6, 2019 at 7:28 PM Steven D'Aprano <steve at pearwood.info> wrote:

> Bug #33084 reports that the statistics library calculates median and
> other stats wrongly if the data contains NANs. Worse, the result depends
> on the initial placement of the NAN:
>
> py> from statistics import median
> py> NAN = float('nan')
> py> median([NAN, 1, 2, 3, 4])
> 2
> py> median([1, 2, 3, 4, NAN])
> 3
>
> See the bug report for more detail:
>
> https://bugs.python.org/issue33084
>
>
> The caller can always filter NANs out of their own data, but following
> the lead of some other stats packages, I propose a standard way for the
> statistics module to do so. I hope this will be uncontroversial (he
> says, optimistically...) but just in case, here is some prior art:
>
> (1) Nearly all R stats functions take a "na.rm" argument which defaults
> to False; if True, NA and NAN values will be stripped.
>
> (2) The scipy.stats.ttest_ind function takes a "nan_policy" argument
> which specifies what to do if a NAN is seen in the data.
>
>
> https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_ind.html
>
> (3) At least some Matlab functions, such as mean(), take an optional
> flag that determines whether to ignore NANs or include them.
>
> https://au.mathworks.com/help/matlab/ref/mean.html#bt5b82t-1-nanflag
>
>
> I propose adding a "nan_policy" keyword-only parameter to the relevant
> statistics functions (mean, median, variance etc), and defining the
> following policies:
>
>     IGNORE:  quietly ignore all NANs
>     FAIL:  raise an exception if any NAN is seen in the data
>     PASS:  pass NANs through unchanged (the default)
>     RETURN:  return a NAN if any NAN is seen in the data
>     WARN:  ignore all NANs but raise a warning if one is seen
>
> PASS is equivalent to saying that you, the caller, have taken full
> responsibility for filtering out NANs and there's no need for the
> function to slow down processing by doing so again. Either that, or you
> want the current implementation-dependent behaviour.
>
> FAIL is equivalent to treating all NANs as "signalling NANs". The
> presence of a NAN is an error.
>
> RETURN is equivalent to "NAN poisoning" -- the presence of a NAN in a
> calculation causes it to return a NAN, allowing NANs to propogate
> through multiple calculations.
>
> IGNORE and WARN are the same, except IGNORE is silent and WARN raises a
> warning.
>
> Questions:
>
> - does anyone have an serious objections to this?
>
> - what do you think of the names for the policies?
>
> - are there any additional policies that you would like to see?
>   (if so, please give use-cases)
>
> - are you happy with the default?
>
>
> Bike-shed away!
>
>
>
> --
> Steve
> _______________________________________________
> Python-ideas mailing list
> Python-ideas at python.org
> https://mail.python.org/mailman/listinfo/python-ideas
> Code of Conduct: http://python.org/psf/codeofconduct/
>

-- 
Keeping medicines from the bloodstreams of the sick; food
from the bellies of the hungry; books from the hands of the
uneducated; technology from the underdeveloped; and putting
advocates of freedom in prisons.  Intellectual property is
to the 21st century what the slave trade was to the 16th.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20190106/5100e838/attachment.html>