[Python-ideas] NAN handling in the statistics module
Tim Peters
tim.peters at gmail.com
Wed Jan 9 01:11:28 EST 2019
[David Mertz <mertz at gnosis.cx>]
> I think consistent NaN-poisoning would be excellent behavior. It will
> always make sense for median (and its variants).
>
>> >>> statistics.mode([2, 2, nan, nan, nan])
>> nan
>> >>> statistics.mode([2, 2, inf - inf, inf - inf, inf - inf])
>> 2
>
>
> But in the mode case, I'm not sure we should ALWAYS treat a NaN as
> poisoning the result.
I am: I thought about the following but didn't write about it because
it's too strained to be of actual sane use ;-)
> If NaN means "missing value" then sometimes it could change things,
>?and we shouldn't guess. But what if it cannot?
>
> >>> statistics.mode([9, 9, 9, 9, nan1, nan2, nan3])
>
> No matter what missing value we take those nans to maybe-possibly represent, 9
> is still the most common element. This is only true when the most common thing
> occurs at least as often as the 2nd most common thing PLUS the number
> of all NaNs. But in that case, 9 really is the mode.
See "too strained" above.
It's equally true that, e.g., the _median_ of your list above:
[9, 9, 9, 9, nan1, nan2, nan3]
is also 9 regardless of what values are plugged in for the nans. That
may be easier to realize at first with a simpler list, like
[5, 5, nan]
It sounds essentially useless to me, just theoretically possible to
make a mess of implementations to cater to.
"The right" (obvious, unsurprising, useful, easy to implement, easy to
understand) non-exceptional behavior in the presence of NaNs is to
pretend they weren't in the list to begin with. But I'd rather
;people ask for that _if_ that's what they want.
More information about the Python-ideas
mailing list