[Numpy-discussion] What should be the result in some statistics corner cases?

Mon Jul 15 16:24:52 EDT 2013

On Mon, Jul 15, 2013 at 2:55 PM, Nathaniel Smith <njs at pobox.com> wrote:
> On Mon, Jul 15, 2013 at 6:29 PM, Charles R Harris
> <charlesr.harris at gmail.com> wrote:
>> Let me try to summarize. To begin with, the environment of the nan functions
>> is rather special.
>>
>> 1) if the array is of not of inexact type, they punt to the non-nan
>> versions.
>> 2) if the array is of inexact type, then out and dtype must be inexact if
>> specified
>>
>> The second assumption guarantees that NaN can be used in the return values.
>
> The requirement on the 'out' dtype only exists because currently the
> nan function like to return nan for things like empty arrays, right?
> If not for that, it could be relaxed? (it's a rather weird
> requirement, since the whole point of these functions is that they
> ignore nans, yet they don't always...)
>
>> sum and nansum
>>
>> These should be consistent so that empty sums are 0. This should cover the
>> empty array case, but will change the behaviour of nansum which currently
>> returns NaN if the array isn't empty but the slice is after NaN removal.
>
> I agree that returning 0 is the right behaviour, but we might need a
> FutureWarning period.
>
>> mean and nanmean
>>
>> In the case of empty arrays, an empty slice, this leads to 0/0. For Python
>> this is always a zero division error, for Numpy this raises a warning and
>> and returns NaN for floats, 0 for integers.
>>
>> Currently mean returns NaN and raises a RuntimeWarning when 0/0 occurs. In
>> the special case where dtype=int, the NaN is cast to integer.
>>
>> Option1
>> 1) mean raise error on 0/0
>> 2) nanmean no warning, return NaN
>>
>> Option2
>> 1) mean raise warning, return NaN (current behavior)
>> 2) nanmean no warning, return NaN
>>
>> Option3
>> 1) mean raise warning, return NaN (current behavior)
>> 2) nanmean raise warning, return NaN
>
> I have mixed feelings about the whole np.seterr apparatus, but since
> it exists, shouldn't we use it for consistency? I.e., just do whatever
> numpy is set up to do with 0/0? (Which I think means, warn and return
> NaN by default, but this can be changed.)
>
>> var, std, nanvar, nanstd
>>
>> 1) if ddof > axis(axes) size, raise error, probably a program bug.
>> 2) If ddof=0, then whatever is the case for mean, nanmean
>>
>> For nanvar, nanstd it is possible that some slice are good, some bad, so
>>
>> option1
>> 1) if n - ddof <= 0 for a slice, raise warning, return NaN for slice
>>
>> option2
>> 1) if n - ddof <= 0 for a slice, don't warn, return NaN for slice
>
> I don't really have any intuition for these ddof cases. Just raising
> an error on negative effective dof is pretty defensible and might be
> the safest -- it's a easy to turn an error into something sensible
> later if people come up with use cases...

related why does reduceat not have empty slices?

>>> np.add.reduceat(np.arange(8),[0,4, 5, 7,7])
array([ 6,  4, 11,  7,  7])

I'm in favor of returning nans instead of raising exceptions, except
if the return type is int and we cannot cast nan to int.

If we get functions into numpy that know how to handle nans, then it
would be useful to get the nans, so we can work with them

Some cases where this might come in handy are when we iterate over
slices of an array that define groups or category levels with possible
empty groups *)

>>> idx = np.repeat(np.array([0, 1, 2, 3]), [4, 3, 0, 2])
>>> x = np.arange(9)
>>> [x[idx==ii].mean() for ii in range(4)]
[1.5, 5.0, nan, 7.5]

instead of
>>> [x[idx==ii].mean() for ii in range(4) if (idx==ii).sum()>0]
[1.5, 5.0, 7.5]

same for var, I wouldn't have to check that the size is larger than
the ddof (whatever that is in the specific case)

*) groups could be empty because they were defined for a larger
dataset or as a union of different datasets

PS: I used mean() above and not var() because

>>> np.__version__
'1.5.1'
>>> np.mean([])
nan
>>> np.var([])
0.0

Josef

>
> -n
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion