[Numpy-discussion] What should be the result in some statistics corner cases?

Mon Jul 15 16:44:18 EDT 2013

On Mon, Jul 15, 2013 at 4:24 PM,  <josef.pktd at gmail.com> wrote:
> On Mon, Jul 15, 2013 at 2:55 PM, Nathaniel Smith <njs at pobox.com> wrote:
>> On Mon, Jul 15, 2013 at 6:29 PM, Charles R Harris
>> <charlesr.harris at gmail.com> wrote:
>>> Let me try to summarize. To begin with, the environment of the nan functions
>>> is rather special.
>>>
>>> 1) if the array is of not of inexact type, they punt to the non-nan
>>> versions.
>>> 2) if the array is of inexact type, then out and dtype must be inexact if
>>> specified
>>>
>>> The second assumption guarantees that NaN can be used in the return values.
>>
>> The requirement on the 'out' dtype only exists because currently the
>> nan function like to return nan for things like empty arrays, right?
>> If not for that, it could be relaxed? (it's a rather weird
>> requirement, since the whole point of these functions is that they
>> ignore nans, yet they don't always...)
>>
>>> sum and nansum
>>>
>>> These should be consistent so that empty sums are 0. This should cover the
>>> empty array case, but will change the behaviour of nansum which currently
>>> returns NaN if the array isn't empty but the slice is after NaN removal.
>>
>> I agree that returning 0 is the right behaviour, but we might need a
>> FutureWarning period.
>>
>>> mean and nanmean
>>>
>>> In the case of empty arrays, an empty slice, this leads to 0/0. For Python
>>> this is always a zero division error, for Numpy this raises a warning and
>>> and returns NaN for floats, 0 for integers.
>>>
>>> Currently mean returns NaN and raises a RuntimeWarning when 0/0 occurs. In
>>> the special case where dtype=int, the NaN is cast to integer.
>>>
>>> Option1
>>> 1) mean raise error on 0/0
>>> 2) nanmean no warning, return NaN
>>>
>>> Option2
>>> 1) mean raise warning, return NaN (current behavior)
>>> 2) nanmean no warning, return NaN
>>>
>>> Option3
>>> 1) mean raise warning, return NaN (current behavior)
>>> 2) nanmean raise warning, return NaN
>>
>> I have mixed feelings about the whole np.seterr apparatus, but since
>> it exists, shouldn't we use it for consistency? I.e., just do whatever
>> numpy is set up to do with 0/0? (Which I think means, warn and return
>> NaN by default, but this can be changed.)
>>
>>> var, std, nanvar, nanstd
>>>
>>> 1) if ddof > axis(axes) size, raise error, probably a program bug.
>>> 2) If ddof=0, then whatever is the case for mean, nanmean
>>>
>>> For nanvar, nanstd it is possible that some slice are good, some bad, so
>>>
>>> option1
>>> 1) if n - ddof <= 0 for a slice, raise warning, return NaN for slice
>>>
>>> option2
>>> 1) if n - ddof <= 0 for a slice, don't warn, return NaN for slice
>>
>> I don't really have any intuition for these ddof cases. Just raising
>> an error on negative effective dof is pretty defensible and might be
>> the safest -- it's a easy to turn an error into something sensible
>> later if people come up with use cases...
>
> related why does reduceat not have empty slices?
>
>>>> np.add.reduceat(np.arange(8),[0,4, 5, 7,7])
> array([ 6,  4, 11,  7,  7])
>
>
> I'm in favor of returning nans instead of raising exceptions, except
> if the return type is int and we cannot cast nan to int.
>
> If we get functions into numpy that know how to handle nans, then it
> would be useful to get the nans, so we can work with them
>
> Some cases where this might come in handy are when we iterate over
> slices of an array that define groups or category levels with possible
> empty groups *)
>
>>>> idx = np.repeat(np.array([0, 1, 2, 3]), [4, 3, 0, 2])
>>>> x = np.arange(9)
>>>> [x[idx==ii].mean() for ii in range(4)]
> [1.5, 5.0, nan, 7.5]
>
> instead of
>>>> [x[idx==ii].mean() for ii in range(4) if (idx==ii).sum()>0]
> [1.5, 5.0, 7.5]
>
> same for var, I wouldn't have to check that the size is larger than
> the ddof (whatever that is in the specific case)
>
> *) groups could be empty because they were defined for a larger
> dataset or as a union of different datasets

background:

I wrote several robust anova versions a few weeks ago, that were
essentially list comprehension as above. However, I didn't allow nans
and didn't check for minimum size.
Allowing for empty groups to return nan would mainly be a convenience,
since I need to check the group size only once.

ddof: tests for proportions have ddof=0, for regular t-test ddof=1,
for tests of correlation ddof=2   IIRC
so we would need to check for the corresponding minimum size that n-ddof>0

"negative effective dof" doesn't exist, that's np.maximum(n - ddof, 0)
which is always non-negative but might result in a zero-division
error. :)

I don't think making anything conditional on ddof>0 is useful.

Josef

>
>
> PS: I used mean() above and not var() because
>
>>>> np.__version__
> '1.5.1'
>>>> np.mean([])
> nan
>>>> np.var([])
> 0.0
>
> Josef
>
>>
>> -n
>> _______________________________________________
>> NumPy-Discussion mailing list
>> NumPy-Discussion at scipy.org
>> http://mail.scipy.org/mailman/listinfo/numpy-discussion