> In [4]: statistics.median([9, 9, 9, nan, 1, 2, 3, 4, 5])
> Out[4]: 1
> In [5]: statistics.median([9, 9, 9, nan, 1, 2, 3, 4])
> Out[5]: nan
The second is possibly correct if one thinks that the median of a list
containing NAN should return NAN -- but its only correct by accident,
not design.
Exactly... in the second example, the nan just happens to wind up "in the middle" of the sorted() list. The fact that is the return value has nothing to do propagating the nan (if it did, I think it would be a reasonable answer). I contrived the examples to get these... the first answer which is the "most wrong number" is also selected for the same reason than a nan is "near the middle."
I'm not opposed to documenting this better. Patches welcome :-)
I'll provide a suggested batch on the bug. It will simply be a wholly different implementation of median and friends.
There are at least three correct behaviours in the face of data
containing NANs: propogate a NAN result, fail fast with an exception, or
treat NANs as missing data that can be ignored. Only the caller can
decide which is the right policy for their data set.
I'm not sure that raising right away is necessary as an option. That feels like something a user could catch at the end when they get a NaN result. But those seem reasonable as three options.