[Numpy-discussion] What should be the result in some statistics corner cases?

Charles R Harris charlesr.harris at gmail.com
Mon Jul 15 13:29:43 EDT 2013


On Mon, Jul 15, 2013 at 9:55 AM, Sebastian Berg
<sebastian at sipsolutions.net>wrote:

> On Mon, 2013-07-15 at 08:47 -0600, Charles R Harris wrote:
> >
> >
> > On Mon, Jul 15, 2013 at 8:34 AM, Sebastian Berg
> > <sebastian at sipsolutions.net> wrote:
> >         On Mon, 2013-07-15 at 07:52 -0600, Charles R Harris wrote:
> >         >
> >         >
> >         > On Sun, Jul 14, 2013 at 3:35 PM, Charles R Harris
> >         > <charlesr.harris at gmail.com> wrote:
> >         >
> >
> >
> >         <snip>
> >
> >         >
> >         >                 For nansum, I would expect 0 even in the
> >         case of all
> >         >                 nans.  The point
> >         >                 of these functions is to simply ignore nans,
> >         correct?
> >         >                  So I would aim
> >         >                 for this behaviour:  nanfunc(x) behaves the
> >         same as
> >         >                 func(x[~isnan(x)])
> >         >
> >         >
> >         >         Agreed, although that changes current behavior. What
> >         about the
> >         >         other cases?
> >         >
> >         >
> >         >
> >         > Looks like there isn't much interest in the topic, so I'll
> >         just go
> >         > ahead with the following choices:
> >         >
> >         > Non-NaN case
> >         >
> >         > 1) Empty array -> ValueError
> >         >
> >         > The current behavior with stats is an accident, i.e., the
> >         nan arises
> >         > from 0/0. I like to think that in this case the result is
> >         any number,
> >         > rather than not a number, so *the* value is simply not
> >         defined. So in
> >         > this case raise a ValueError for empty array.
> >         >
> >
> >         To be honest, I don't mind the current behaviour much sum([])
> >         = 0,
> >         len([]) = 0, so it is in a way well defined. At least I am not
> >         sure if I
> >         would prefer always an error. I am a bit worried that just
> >         changing it
> >         might break code out there, such as plotting code where it
> >         makes
> >         perfectly sense to plot a NaN (i.e. nothing), but if that is
> >         the case it
> >         would probably be visible fast.
> >
> > I'm talking about mean, var, and std as statistics, sum isn't part of
> > that. If there is agreement that nansum of empty arrays/columns should
> > be zero I will do that. Note the sums of empty arrays may or may not
> > be empty.
> >
> > In [1]: ones((0, 3)).sum(axis=0)
> > Out[1]: array([ 0.,  0.,  0.])
> >
> > In [2]: ones((3, 0)).sum(axis=0)
> > Out[2]: array([], dtype=float64)
> >
> > Which, sort of, makes sense.
> >
> >
> I think we can agree that the behaviour for reductions with an identity
> should default to returning the identity, including for the nanfuncs,
> i.e. sum([]) is 0, product([]) is 1...
>
> Since mean = sum/length is a sensible definition, having 0/0 as a result
> doesn't seem to bad to me to be honest, it might be accidental but it is
> not a special case in the code ;). Though I don't mind an error as long
> as it doesn't break matplotlib or so.
>
> I agree about the nanfuncs raising an error would probably be more of a
> problem then for a usual ufunc, but still a bit hesitant about saying
> that it is ok too. I could imagine adding a very general "identity"
> argument (though I would not call it identity, because it is not the
> same as `np.add.identity`, just used in a place where that would be used
> otherwise):
>
> np.add.reduce([], identity=123) -> [123]
> np.add.reduce([1], identity=123) -> [1]
> np.nanmean([np.nan], identity=None) -> Error
> np.nanmean([np.nan], identity=np.nan) -> np.nan
>
> It doesn't really make sense, but:
> np.subtract.reduce([]) -> Error, since np.substract.identity is None
> np.subtract.reduce([], identity=0) -> 0, suppressing the error.
>
> I am not sure if I am convinced myself, but especially for the nanfuncs
> it could maybe provide a way to circumvent the problem somewhat.
> Including functions such as np.nanargmin, whose result type does not
> even support NaN. Plus it gives an argument allowing for warnings about
> changing behaviour.
>
>
Let me try to summarize. To begin with, the environment of the nan
functions is rather special.

1) if the array is of not of inexact type, they punt to the non-nan
versions.
2) if the array is of inexact type, then out and dtype must be inexact if
specified

The second assumption guarantees that NaN can be used in the return values.

*sum and nansum*

These should be consistent so that empty sums are 0. This should cover the
empty array case, but will change the behaviour of nansum which currently
returns NaN if the array isn't empty but the slice is after NaN removal.

*mean and nanmean*

In the case of empty arrays, an empty slice, this leads to 0/0. For Python
this is always a zero division error, for Numpy this raises a warning and
and returns NaN for floats, 0 for integers.

Currently mean returns NaN and raises a RuntimeWarning when 0/0 occurs. In
the special case where dtype=int, the NaN is cast to integer.

Option1
1) mean raise error on 0/0
2) nanmean no warning, return NaN

Option2
1) mean raise warning, return NaN (current behavior)
2) nanmean no warning, return NaN

Option3
1) mean raise warning, return NaN (current behavior)
2) nanmean raise warning, return NaN

*var, std, nanvar, nanstd*

1) if ddof > axis(axes) size, raise error, probably a program bug.
2) If ddof=0, then whatever is the case for mean, nanmean

For nanvar, nanstd it is possible that some slice are good, some bad, so

option1
1) if n - ddof <= 0 for a slice, raise warning, return NaN for slice

option2
1) if n - ddof <= 0 for a slice, don't warn, return NaN for slice

<snip>

Chuck
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20130715/00d20cc7/attachment.html>


More information about the NumPy-Discussion mailing list