[Numpy-discussion] What should be the result in some statistics corner cases?

Mon Jul 15 11:55:44 EDT 2013

On Mon, 2013-07-15 at 08:47 -0600, Charles R Harris wrote:
> 
> 
> On Mon, Jul 15, 2013 at 8:34 AM, Sebastian Berg
> <sebastian at sipsolutions.net> wrote:
>         On Mon, 2013-07-15 at 07:52 -0600, Charles R Harris wrote:
>         >
>         >
>         > On Sun, Jul 14, 2013 at 3:35 PM, Charles R Harris
>         > <charlesr.harris at gmail.com> wrote:
>         >
>         
>         
>         <snip>
>         
>         >
>         >                 For nansum, I would expect 0 even in the
>         case of all
>         >                 nans.  The point
>         >                 of these functions is to simply ignore nans,
>         correct?
>         >                  So I would aim
>         >                 for this behaviour:  nanfunc(x) behaves the
>         same as
>         >                 func(x[~isnan(x)])
>         >
>         >
>         >         Agreed, although that changes current behavior. What
>         about the
>         >         other cases?
>         >
>         >
>         >
>         > Looks like there isn't much interest in the topic, so I'll
>         just go
>         > ahead with the following choices:
>         >
>         > Non-NaN case
>         >
>         > 1) Empty array -> ValueError
>         >
>         > The current behavior with stats is an accident, i.e., the
>         nan arises
>         > from 0/0. I like to think that in this case the result is
>         any number,
>         > rather than not a number, so *the* value is simply not
>         defined. So in
>         > this case raise a ValueError for empty array.
>         >
>         
>         To be honest, I don't mind the current behaviour much sum([])
>         = 0,
>         len([]) = 0, so it is in a way well defined. At least I am not
>         sure if I
>         would prefer always an error. I am a bit worried that just
>         changing it
>         might break code out there, such as plotting code where it
>         makes
>         perfectly sense to plot a NaN (i.e. nothing), but if that is
>         the case it
>         would probably be visible fast.
> 
> I'm talking about mean, var, and std as statistics, sum isn't part of
> that. If there is agreement that nansum of empty arrays/columns should
> be zero I will do that. Note the sums of empty arrays may or may not
> be empty.
> 
> In [1]: ones((0, 3)).sum(axis=0)
> Out[1]: array([ 0.,  0.,  0.])
> 
> In [2]: ones((3, 0)).sum(axis=0)
> Out[2]: array([], dtype=float64)
> 
> Which, sort of, makes sense.
>  
> 
I think we can agree that the behaviour for reductions with an identity
should default to returning the identity, including for the nanfuncs,
i.e. sum([]) is 0, product([]) is 1...

Since mean = sum/length is a sensible definition, having 0/0 as a result
doesn't seem to bad to me to be honest, it might be accidental but it is
not a special case in the code ;). Though I don't mind an error as long
as it doesn't break matplotlib or so.

I agree about the nanfuncs raising an error would probably be more of a
problem then for a usual ufunc, but still a bit hesitant about saying
that it is ok too. I could imagine adding a very general "identity"
argument (though I would not call it identity, because it is not the
same as `np.add.identity`, just used in a place where that would be used
otherwise):

np.add.reduce([], identity=123) -> [123]
np.add.reduce([1], identity=123) -> [1]
np.nanmean([np.nan], identity=None) -> Error
np.nanmean([np.nan], identity=np.nan) -> np.nan

It doesn't really make sense, but:
np.subtract.reduce([]) -> Error, since np.substract.identity is None
np.subtract.reduce([], identity=0) -> 0, suppressing the error.

I am not sure if I am convinced myself, but especially for the nanfuncs
it could maybe provide a way to circumvent the problem somewhat.
Including functions such as np.nanargmin, whose result type does not
even support NaN. Plus it gives an argument allowing for warnings about
changing behaviour.

- Sebastian

>         
>         > 2) ddof >= n -> ValueError
>         >
>         > If the number of elements, n, is not zero and ddof >= n,
>         raise a
>         > ValueError for the ddof value.
>         >
>         
>         Makes sense to me, especially for ddof > n. Just returning nan
>         in all
>         cases for backward compatibility would be fine with me too.
>         
>         > Nan case
>         >
>         > 1) Empty array -> Value Error
>         > 2) Empty slice -> NaN
>         > 3) For slice ddof >= n -> Nan
>         >
>         
>         Personally I would somewhat prefer if 1) and 2) would at least
>         default
>         to the same thing. But I don't use the nanfuncs anyway. I was
>         wondering
>         about adding the option for the user to pick what the fill is
>         (and i.e.
>         if it is None (maybe default) -> ValueError). We could also
>         allow this
>         for normal reductions without an identity, but I am not sure
>         if it is
>         useful there.
>         
> 
> Chuck 
> 
> 
> 
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion