[Python-ideas] NAN handling in the statistics module

Mon Jan 7 01:34:47 EST 2019

On Mon, Jan 7, 2019 at 1:27 AM Steven D'Aprano <steve at pearwood.info> wrote:

> > In [4]: statistics.median([9, 9, 9, nan, 1, 2, 3, 4, 5])
> > Out[4]: 1
> > In [5]: statistics.median([9, 9, 9, nan, 1, 2, 3, 4])
> > Out[5]: nan
>
> The second is possibly correct if one thinks that the median of a list
> containing NAN should return NAN -- but its only correct by accident,
> not design.
>

Exactly... in the second example, the nan just happens to wind up "in the
middle" of the sorted() list.  The fact that is the return value has
nothing to do propagating the nan (if it did, I think it would be a
reasonable answer).  I contrived the examples to get these... the first
answer which is the "most wrong number" is also selected for the same
reason than a nan is "near the middle."

> I'm not opposed to documenting this better. Patches welcome :-)
>

I'll provide a suggested batch on the bug.  It will simply be a wholly
different implementation of median and friends.

> There are at least three correct behaviours in the face of data
> containing NANs: propogate a NAN result, fail fast with an exception, or
> treat NANs as missing data that can be ignored. Only the caller can
> decide which is the right policy for their data set.

I'm not sure that raising right away is necessary as an option.  That feels
like something a user could catch at the end when they get a NaN result.
But those seem reasonable as three options.

-- 
Keeping medicines from the bloodstreams of the sick; food
from the bellies of the hungry; books from the hands of the
uneducated; technology from the underdeveloped; and putting
advocates of freedom in prisons.  Intellectual property is
to the 21st century what the slave trade was to the 16th.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20190107/20ed2c3d/attachment.html>