[Python-ideas] NAN handling in the statistics module
Steven D'Aprano
steve at pearwood.info
Mon Jan 7 01:26:30 EST 2019
On Sun, Jan 06, 2019 at 10:52:47PM -0500, David Mertz wrote:
> Playing with Tim's examples, this suggests that statistics.median() is
> simply outright WRONG. I can think of absolutely no way to characterize
> these as reasonable results:
>
> Python 3.7.1 | packaged by conda-forge | (default, Nov 13 2018, 09:50:42)
> In [4]: statistics.median([9, 9, 9, nan, 1, 2, 3, 4, 5])
> Out[4]: 1
> In [5]: statistics.median([9, 9, 9, nan, 1, 2, 3, 4])
> Out[5]: nan
The second is possibly correct if one thinks that the median of a list
containing NAN should return NAN -- but its only correct by accident,
not design.
As I wrote on the bug tracker:
"I agree that the current implementation-dependent behaviour when there
are NANs in the data is troublesome."
The only reason why I don't call it a bug is that median() makes no
promises about NANs at all, any more than it makes promises about the
median of a list of sets or any other values which don't define a total
order. help(median) says:
Return the median (middle value) of numeric data.
By definition, data containing Not A Number values isn't numeric :-)
I'm not opposed to documenting this better. Patches welcome :-)
There are at least three correct behaviours in the face of data
containing NANs: propogate a NAN result, fail fast with an exception, or
treat NANs as missing data that can be ignored. Only the caller can
decide which is the right policy for their data set.
Aside: the IEEE-754 standard provides both signalling and quiet NANs. It
is hard and unreliable to generate signalling float NANs in Python, but
we can do it with Decimal:
py> from statistics import median
py> from decimal import Decimal
py> median([1, 3, 4, Decimal("sNAN"), 2])
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python3.5/statistics.py", line 349, in median
data = sorted(data)
decimal.InvalidOperation: [<class 'decimal.InvalidOperation'>]
In principle, one ought to be able to construct float signalling NANs
too, but unfortunately that's platform dependent:
https://mail.python.org/pipermail/python-dev/2018-November/155713.html
Back to the topic on hand: I agree that median() does "the wrong thing"
when NANs are involved, but there is no one "right thing" that we can do
in its place. People disagree as to whether NANs should propogate, or
raise, or be treated as missing data, and I see good arguments for all
three.
--
Steve
More information about the Python-ideas
mailing list