[Numpy-discussion] What should be the result in some statistics corner cases?

Mon Jul 15 14:55:04 EDT 2013

On Mon, Jul 15, 2013 at 6:29 PM, Charles R Harris
<charlesr.harris at gmail.com> wrote:
> Let me try to summarize. To begin with, the environment of the nan functions
> is rather special.
>
> 1) if the array is of not of inexact type, they punt to the non-nan
> versions.
> 2) if the array is of inexact type, then out and dtype must be inexact if
> specified
>
> The second assumption guarantees that NaN can be used in the return values.

The requirement on the 'out' dtype only exists because currently the
nan function like to return nan for things like empty arrays, right?
If not for that, it could be relaxed? (it's a rather weird
requirement, since the whole point of these functions is that they
ignore nans, yet they don't always...)

> sum and nansum
>
> These should be consistent so that empty sums are 0. This should cover the
> empty array case, but will change the behaviour of nansum which currently
> returns NaN if the array isn't empty but the slice is after NaN removal.

I agree that returning 0 is the right behaviour, but we might need a
FutureWarning period.

> mean and nanmean
>
> In the case of empty arrays, an empty slice, this leads to 0/0. For Python
> this is always a zero division error, for Numpy this raises a warning and
> and returns NaN for floats, 0 for integers.
>
> Currently mean returns NaN and raises a RuntimeWarning when 0/0 occurs. In
> the special case where dtype=int, the NaN is cast to integer.
>
> Option1
> 1) mean raise error on 0/0
> 2) nanmean no warning, return NaN
>
> Option2
> 1) mean raise warning, return NaN (current behavior)
> 2) nanmean no warning, return NaN
>
> Option3
> 1) mean raise warning, return NaN (current behavior)
> 2) nanmean raise warning, return NaN

I have mixed feelings about the whole np.seterr apparatus, but since
it exists, shouldn't we use it for consistency? I.e., just do whatever
numpy is set up to do with 0/0? (Which I think means, warn and return
NaN by default, but this can be changed.)

> var, std, nanvar, nanstd
>
> 1) if ddof > axis(axes) size, raise error, probably a program bug.
> 2) If ddof=0, then whatever is the case for mean, nanmean
>
> For nanvar, nanstd it is possible that some slice are good, some bad, so
>
> option1
> 1) if n - ddof <= 0 for a slice, raise warning, return NaN for slice
>
> option2
> 1) if n - ddof <= 0 for a slice, don't warn, return NaN for slice

I don't really have any intuition for these ddof cases. Just raising
an error on negative effective dof is pretty defensible and might be
the safest -- it's a easy to turn an error into something sensible
later if people come up with use cases...

-n