[Numpy-discussion] min() of array containing NaN

robert.kern at gmail.com robert.kern at gmail.com
Thu Aug 14 14:10:09 EDT 2008


On 2008-08-14, Joe Harrington <jh at physics.ucf.edu> wrote:
>> I'm doing nothing. Someone else must volunteer.
>
> Fair enough.  Would the code be accepted if contributed?

Like I said, I would be amenable to such a change. The other
developers haven't weighed in on this particular proposal, but I
suspect they will agree with me.

>> There is a
>> reasonable design rule that if you have a boolean argument which you
>> expect to only be passed literal Trues and Falses, you should instead
>> just have two different functions.
>
> Robert, can you list some reasons to favor this design rule?

nanmin(x) vs. min(x, nan=True)

A boolean argument that will almost always take literal Trues and
Falses basically is just a switch between different functionality. The
usual mechanism for the programmer to pick between different
functionality is to use the appropriate function.

The =True is extraneous, and puts important semantic information last
rather than at the front.

> Here are some reasons to favor richly functional routines:
>
> User's code is more readable because subtle differences affect args,
>    not functions

This isn't subtle.

> Easier learning for new users

You have no evidence of this.

> Much briefer and more readable docs

Briefer is possible. More readable is debatable. "Much" is overstating the case.

> Similar behavior across languages

This is not, has never been, and never will be a goal. Similar
behavior happens because of convergent design constraints and
occasionally laziness, never for it's own sake.

> Smaller number of functions in the core package (a recent list topic)

In general, this is a reasonable concern that must be traded off with
the other concerns. In this particular case, it has no weight.
nanmin() and nanmax() already exist.

> Many fewer routines to maintain, particularly if multiple switches exist

Again, in this case, neither of these are relevant. Yes, if there are
multiple boolean switches, it might make sense to keep them all into
the same function. Typically, these switches will also be affecting
the semantics only in minor details, too.

> Availability of the NaN functionality in a method of ndarray

Point, but see below.

> The last point is key.  The NaN behavior is central to analyzing real
> data containing unavoidable bad values, which is the bread and butter
> of a substantial fraction of the user base.  In the languages they're
> switching from, handling NaNs is just part of doing business, and is
> an option of every relevant routine; there's no need for redundant
> sets of routines.  In contrast, numpy appears to consider data
> analysis to be secondary, somehow, to pure math, and takes the NaN
> functionality out of routines like min() and std().  This means it's
> not possible to use many ndarray methods.  If we're ready to handle a
> NaN by returning it, why not enable the more useful behavior of
> ignoring it, at user discretion?

Let's get something straight. numpy has no opinion on the primacy of
data analysis tasks versus "pure math", however you want to define
those. Now, the numpy developers *do* tend to have an opinion on how
NaNs are used. NaNs were invented to handle invalid results of
*computations*. They were not invented as place markers for missing
data. They can frequently be used as such because the IEEE-754
semantics of NaNs sometimes works for missing data (e.g. in z=x+y, z
will have a NaN wherever either x or y have NaNs). But at least as
frequently, they don't, and other semantics need to be specifically
placed on top of it (e.g. nanmin()).

numpy is a general purpose computational tool that needs to apply to
many different fields and use cases. Consequently, when presented with
a choice like this, we tend to go for the path that makes the minimum
of assumptions and overlaid semantics.

Now to address the idea that all of the relevant ndarray methods
should take nan=True arguments. I am sympathetic to the idea that we
should have the functionality somewhere. I do doubt that the users you
are thinking about will be happy adding nan=True to a substantial
fraction of their calls. My experience with such APIs is that it gets
tedious real fast. Instead, I would suggest that if you want a wide
range of nan-skipping versions of functions that we have, let's put
them all as functions into a module. This gives the programmer the
possibility of using relatively clean calls.

-- 
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless
enigma that is made terrible by our own mad attempt to interpret it as
though it had an underlying truth."
  -- Umberto Eco



More information about the NumPy-Discussion mailing list