[Python-ideas] NAN handling in the statistics module

Wed Jan 9 01:11:28 EST 2019

[David Mertz <mertz at gnosis.cx>]
> I think consistent NaN-poisoning would be excellent behavior.  It will
> always make sense for median (and its variants).
>
>> >>> statistics.mode([2, 2, nan, nan, nan])
>> nan
>> >>> statistics.mode([2, 2, inf - inf, inf - inf, inf - inf])
>> 2
>
>
> But in the mode case, I'm not sure we should ALWAYS treat a NaN as
> poisoning the result.

I am:  I thought about the following but didn't write about it because
it's too strained to be of actual sane use ;-)

>  If NaN means "missing value" then sometimes it could change things,
>?and we shouldn't guess.  But what if it cannot?
>
>     >>> statistics.mode([9, 9, 9, 9, nan1, nan2, nan3])
>
> No matter what missing value we take those nans to maybe-possibly represent, 9
> is still the most common element.  This is only true when the most common thing
> occurs at least as often as the 2nd most common thing PLUS the number
> of all NaNs.  But in that case, 9 really is the mode.

See "too strained" above.

It's equally true that, e.g., the _median_ of your list above:

    [9, 9, 9, 9, nan1, nan2, nan3]

is also 9 regardless of what values are plugged in for the nans.  That
may be easier to realize at first with a simpler list, like

    [5, 5, nan]

It sounds essentially useless to me, just theoretically possible to
make a mess of implementations to cater to.

"The right" (obvious, unsurprising, useful, easy to implement, easy to
understand) non-exceptional behavior in the presence of NaNs is to
pretend they weren't in the list to begin with.  But I'd rather
;people ask for that _if_ that's what they want.