On Mon, Jan 7, 2019 at 1:27 AM Steven D'Aprano <steve@pearwood.info> wrote:

> In [4]: statistics.median([9, 9, 9, nan, 1, 2, 3, 4, 5])
> Out[4]: 1
> In [5]: statistics.median([9, 9, 9, nan, 1, 2, 3, 4])
> Out[5]: nan

The second is possibly correct if one thinks that the median of a list
containing NAN should return NAN -- but its only correct by accident,
not design.

Exactly... in the second example, the nan just happens to wind up "in the middle" of the sorted() list. The fact that is the return value has nothing to do propagating the nan (if it did, I think it would be a reasonable answer). I contrived the examples to get these... the first answer which is the "most wrong number" is also selected for the same reason than a nan is "near the middle."

I'm not opposed to documenting this better. Patches welcome :-)

I'll provide a suggested batch on the bug. It will simply be a wholly different implementation of median and friends.

There are at least three correct behaviours in the face of data
containing NANs: propogate a NAN result, fail fast with an exception, or
treat NANs as missing data that can be ignored. Only the caller can
decide which is the right policy for their data set.

I'm not sure that raising right away is necessary as an option. That feels like something a user could catch at the end when they get a NaN result. But those seem reasonable as three options.

Keeping medicines from the bloodstreams of the sick; food
from the bellies of the hungry; books from the hands of the
uneducated; technology from the underdeveloped; and putting
advocates of freedom in prisons. Intellectual property is
to the 21st century what the slave trade was to the 16th.