[Python-ideas] NAN handling in the statistics module

Steven D'Aprano steve at pearwood.info
Mon Jan 7 01:26:30 EST 2019


On Sun, Jan 06, 2019 at 10:52:47PM -0500, David Mertz wrote:

> Playing with Tim's examples, this suggests that statistics.median() is
> simply outright WRONG.  I can think of absolutely no way to characterize
> these as reasonable results:
> 
> Python 3.7.1 | packaged by conda-forge | (default, Nov 13 2018, 09:50:42)
> In [4]: statistics.median([9, 9, 9, nan, 1, 2, 3, 4, 5])
> Out[4]: 1
> In [5]: statistics.median([9, 9, 9, nan, 1, 2, 3, 4])
> Out[5]: nan

The second is possibly correct if one thinks that the median of a list 
containing NAN should return NAN -- but its only correct by accident, 
not design.

As I wrote on the bug tracker:

"I agree that the current implementation-dependent behaviour when there 
are NANs in the data is troublesome."

The only reason why I don't call it a bug is that median() makes no 
promises about NANs at all, any more than it makes promises about the 
median of a list of sets or any other values which don't define a total 
order. help(median) says:

    Return the median (middle value) of numeric data.


By definition, data containing Not A Number values isn't numeric :-)

I'm not opposed to documenting this better. Patches welcome :-)

There are at least three correct behaviours in the face of data 
containing NANs: propogate a NAN result, fail fast with an exception, or 
treat NANs as missing data that can be ignored. Only the caller can 
decide which is the right policy for their data set.

Aside: the IEEE-754 standard provides both signalling and quiet NANs. It 
is hard and unreliable to generate signalling float NANs in Python, but 
we can do it with Decimal:

py> from statistics import median
py> from decimal import Decimal
py> median([1, 3, 4, Decimal("sNAN"), 2])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.5/statistics.py", line 349, in median
    data = sorted(data)
decimal.InvalidOperation: [<class 'decimal.InvalidOperation'>]


In principle, one ought to be able to construct float signalling NANs 
too, but unfortunately that's platform dependent:

https://mail.python.org/pipermail/python-dev/2018-November/155713.html

Back to the topic on hand: I agree that median() does "the wrong thing" 
when NANs are involved, but there is no one "right thing" that we can do 
in its place. People disagree as to whether NANs should propogate, or 
raise, or be treated as missing data, and I see good arguments for all 
three.


-- 
Steve


More information about the Python-ideas mailing list