On Mon, Jan 7, 2019 at 1:27 AM Steven D'Aprano <steve@pearwood.info> wrote:
In [4]: statistics.median([9, 9, 9, nan, 1, 2, 3, 4, 5]) Out[4]: 1 In [5]: statistics.median([9, 9, 9, nan, 1, 2, 3, 4]) Out[5]: nan
The second is possibly correct if one thinks that the median of a list containing NAN should return NAN -- but its only correct by accident, not design.
Exactly... in the second example, the nan just happens to wind up "in the middle" of the sorted() list. The fact that is the return value has nothing to do propagating the nan (if it did, I think it would be a reasonable answer). I contrived the examples to get these... the first answer which is the "most wrong number" is also selected for the same reason than a nan is "near the middle."
I'm not opposed to documenting this better. Patches welcome :-)
I'll provide a suggested batch on the bug. It will simply be a wholly different implementation of median and friends.
There are at least three correct behaviours in the face of data containing NANs: propogate a NAN result, fail fast with an exception, or treat NANs as missing data that can be ignored. Only the caller can decide which is the right policy for their data set.
I'm not sure that raising right away is necessary as an option. That feels like something a user could catch at the end when they get a NaN result. But those seem reasonable as three options. -- Keeping medicines from the bloodstreams of the sick; food from the bellies of the hungry; books from the hands of the uneducated; technology from the underdeveloped; and putting advocates of freedom in prisons. Intellectual property is to the 21st century what the slave trade was to the 16th.