[Python-ideas] NAN handling in the statistics module

David Mertz mertz at gnosis.cx
Mon Jan 7 10:05:19 EST 2019

On Mon, Jan 7, 2019 at 6:50 AM Steven D'Aprano <steve at pearwood.info> wrote:

> > I'll provide a suggested batch on the bug.  It will simply be a wholly
> > different implementation of median and friends.
> I ask for a documentation patch and you start talking about a whole new
> implementation. Huh.
> A new implementation with precisely the same behaviour is a waste of
> time, so I presume you're planning to change the behaviour. How about if
> you start off by explaining what the new semantics are?

I think it would be counter-productive to document the bug (as something
other than a bug).  Picking what is a completely arbitrary element in face
of a non-total order can never be "correct" behavior, and is never worth
preserving for compatibility.  I think the use of statistics.median against
partially ordered elements is simply rare enough that no one tripped
against it, or at least no one reported it before.

Notice that the code itself pretty much recognizes the bug in this comment:

# FIXME: investigate ways to calculate medians without sorting? Quickselect?

So it seems like the original author knew the implementation was wrong.
But you're right, the new behavior needs to be decided.  Propagating NaNs
is reasonable.  Filtering out NaN's is reasonable.  Those are the default
behaviors of NumPy and Pandas, respectively:

np.median([1,2,3,nan]) # -> nan
pd.Series([1,2,3,nan]).median() # -> 2.0

(Yes, of course there are ways in each to get the other behavior).  Other
non-Python tools similarly suggest one of those behaviors, but really
nothing else.

So yeah, what I was suggesting as a patch was an implementation that had
PROPAGATE and IGNORE semantics.  I don't have a real opinion about which
should be the default, but the current behavior should simply not exist at
all.  As I think about it, warnings and exceptions are really too complex
an API for this module.  It's not hard to manually check for NaNs and
generate those in your own code.

Keeping medicines from the bloodstreams of the sick; food
from the bellies of the hungry; books from the hands of the
uneducated; technology from the underdeveloped; and putting
advocates of freedom in prisons.  Intellectual property is
to the 21st century what the slave trade was to the 16th.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20190107/dd39f552/attachment.html>

More information about the Python-ideas mailing list