On Sun, Jan 06, 2019 at 10:52:47PM -0500, David Mertz wrote:
Playing with Tim's examples, this suggests that statistics.median() is simply outright WRONG. I can think of absolutely no way to characterize these as reasonable results:
Python 3.7.1 | packaged by conda-forge | (default, Nov 13 2018, 09:50:42) In [4]: statistics.median([9, 9, 9, nan, 1, 2, 3, 4, 5]) Out[4]: 1 In [5]: statistics.median([9, 9, 9, nan, 1, 2, 3, 4]) Out[5]: nan
The second is possibly correct if one thinks that the median of a list containing NAN should return NAN -- but its only correct by accident, not design. As I wrote on the bug tracker: "I agree that the current implementation-dependent behaviour when there are NANs in the data is troublesome." The only reason why I don't call it a bug is that median() makes no promises about NANs at all, any more than it makes promises about the median of a list of sets or any other values which don't define a total order. help(median) says: Return the median (middle value) of numeric data. By definition, data containing Not A Number values isn't numeric :-) I'm not opposed to documenting this better. Patches welcome :-) There are at least three correct behaviours in the face of data containing NANs: propogate a NAN result, fail fast with an exception, or treat NANs as missing data that can be ignored. Only the caller can decide which is the right policy for their data set. Aside: the IEEE-754 standard provides both signalling and quiet NANs. It is hard and unreliable to generate signalling float NANs in Python, but we can do it with Decimal: py> from statistics import median py> from decimal import Decimal py> median([1, 3, 4, Decimal("sNAN"), 2]) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/local/lib/python3.5/statistics.py", line 349, in median data = sorted(data) decimal.InvalidOperation: [<class 'decimal.InvalidOperation'>] In principle, one ought to be able to construct float signalling NANs too, but unfortunately that's platform dependent: https://mail.python.org/pipermail/python-dev/2018-November/155713.html Back to the topic on hand: I agree that median() does "the wrong thing" when NANs are involved, but there is no one "right thing" that we can do in its place. People disagree as to whether NANs should propogate, or raise, or be treated as missing data, and I see good arguments for all three. -- Steve