[Numpy-discussion] What should be the result in some statistics corner cases?

Mon Jul 15 18:03:01 EDT 2013

On Mon, Jul 15, 2013 at 3:57 PM, <josef.pktd at gmail.com> wrote:

> On Mon, Jul 15, 2013 at 5:34 PM, Charles R Harris
> <charlesr.harris at gmail.com> wrote:
> >
> >
> > On Mon, Jul 15, 2013 at 2:44 PM, <josef.pktd at gmail.com> wrote:
> >>
> >> On Mon, Jul 15, 2013 at 4:24 PM,  <josef.pktd at gmail.com> wrote:
> >> > On Mon, Jul 15, 2013 at 2:55 PM, Nathaniel Smith <njs at pobox.com>
> wrote:
> >> >> On Mon, Jul 15, 2013 at 6:29 PM, Charles R Harris
> >> >> <charlesr.harris at gmail.com> wrote:
> >> >>> Let me try to summarize. To begin with, the environment of the nan
> >> >>> functions
> >> >>> is rather special.
> >> >>>
> >> >>> 1) if the array is of not of inexact type, they punt to the non-nan
> >> >>> versions.
> >> >>> 2) if the array is of inexact type, then out and dtype must be
> inexact
> >> >>> if
> >> >>> specified
> >> >>>
> >> >>> The second assumption guarantees that NaN can be used in the return
> >> >>> values.
> >> >>
> >> >> The requirement on the 'out' dtype only exists because currently the
> >> >> nan function like to return nan for things like empty arrays, right?
> >> >> If not for that, it could be relaxed? (it's a rather weird
> >> >> requirement, since the whole point of these functions is that they
> >> >> ignore nans, yet they don't always...)
> >> >>
> >> >>> sum and nansum
> >> >>>
> >> >>> These should be consistent so that empty sums are 0. This should
> cover
> >> >>> the
> >> >>> empty array case, but will change the behaviour of nansum which
> >> >>> currently
> >> >>> returns NaN if the array isn't empty but the slice is after NaN
> >> >>> removal.
> >> >>
> >> >> I agree that returning 0 is the right behaviour, but we might need a
> >> >> FutureWarning period.
> >> >>
> >> >>> mean and nanmean
> >> >>>
> >> >>> In the case of empty arrays, an empty slice, this leads to 0/0. For
> >> >>> Python
> >> >>> this is always a zero division error, for Numpy this raises a
> warning
> >> >>> and
> >> >>> and returns NaN for floats, 0 for integers.
> >> >>>
> >> >>> Currently mean returns NaN and raises a RuntimeWarning when 0/0
> >> >>> occurs. In
> >> >>> the special case where dtype=int, the NaN is cast to integer.
> >> >>>
> >> >>> Option1
> >> >>> 1) mean raise error on 0/0
> >> >>> 2) nanmean no warning, return NaN
> >> >>>
> >> >>> Option2
> >> >>> 1) mean raise warning, return NaN (current behavior)
> >> >>> 2) nanmean no warning, return NaN
> >> >>>
> >> >>> Option3
> >> >>> 1) mean raise warning, return NaN (current behavior)
> >> >>> 2) nanmean raise warning, return NaN
> >> >>
> >> >> I have mixed feelings about the whole np.seterr apparatus, but since
> >> >> it exists, shouldn't we use it for consistency? I.e., just do
> whatever
> >> >> numpy is set up to do with 0/0? (Which I think means, warn and return
> >> >> NaN by default, but this can be changed.)
> >> >>
> >> >>> var, std, nanvar, nanstd
> >> >>>
> >> >>> 1) if ddof > axis(axes) size, raise error, probably a program bug.
> >> >>> 2) If ddof=0, then whatever is the case for mean, nanmean
> >> >>>
> >> >>> For nanvar, nanstd it is possible that some slice are good, some
> bad,
> >> >>> so
> >> >>>
> >> >>> option1
> >> >>> 1) if n - ddof <= 0 for a slice, raise warning, return NaN for slice
> >> >>>
> >> >>> option2
> >> >>> 1) if n - ddof <= 0 for a slice, don't warn, return NaN for slice
> >> >>
> >> >> I don't really have any intuition for these ddof cases. Just raising
> >> >> an error on negative effective dof is pretty defensible and might be
> >> >> the safest -- it's a easy to turn an error into something sensible
> >> >> later if people come up with use cases...
> >> >
> >> > related why does reduceat not have empty slices?
> >> >
> >> >>>> np.add.reduceat(np.arange(8),[0,4, 5, 7,7])
> >> > array([ 6,  4, 11,  7,  7])
> >> >
> >> >
> >> > I'm in favor of returning nans instead of raising exceptions, except
> >> > if the return type is int and we cannot cast nan to int.
> >> >
> >> > If we get functions into numpy that know how to handle nans, then it
> >> > would be useful to get the nans, so we can work with them
> >> >
> >> > Some cases where this might come in handy are when we iterate over
> >> > slices of an array that define groups or category levels with possible
> >> > empty groups *)
> >> >
> >> >>>> idx = np.repeat(np.array([0, 1, 2, 3]), [4, 3, 0, 2])
> >> >>>> x = np.arange(9)
> >> >>>> [x[idx==ii].mean() for ii in range(4)]
> >> > [1.5, 5.0, nan, 7.5]
> >> >
> >> > instead of
> >> >>>> [x[idx==ii].mean() for ii in range(4) if (idx==ii).sum()>0]
> >> > [1.5, 5.0, 7.5]
> >> >
> >> > same for var, I wouldn't have to check that the size is larger than
> >> > the ddof (whatever that is in the specific case)
> >> >
> >> > *) groups could be empty because they were defined for a larger
> >> > dataset or as a union of different datasets
> >>
> >> background:
> >>
> >> I wrote several robust anova versions a few weeks ago, that were
> >> essentially list comprehension as above. However, I didn't allow nans
> >> and didn't check for minimum size.
> >> Allowing for empty groups to return nan would mainly be a convenience,
> >> since I need to check the group size only once.
> >>
> >> ddof: tests for proportions have ddof=0, for regular t-test ddof=1,
> >> for tests of correlation ddof=2   IIRC
> >> so we would need to check for the corresponding minimum size that
> n-ddof>0
> >>
> >> "negative effective dof" doesn't exist, that's np.maximum(n - ddof, 0)
> >> which is always non-negative but might result in a zero-division
> >> error. :)
> >>
> >> I don't think making anything conditional on ddof>0 is useful.
> >>
> >
> > So how would you want it?
> >
> > To summarize the problem areas:
> >
> > 1) What is the sum of an empty slice? NaN or 0?
> 0 as it is now for sum, (including 0 for nansum with no valid entries).
>
> > 2) What is mean of empy slice? NaN, NaN and warn, or error?
> > 3) What if n - ddof < 0 for slice? NaN, NaN and warn, or error?
> > 4) What if n - ddof = 0 for slice? NaN, NaN and warn, or error?
> >
> > I'm tending to NaN and warn for 2 -- 3, because, as Nathaniel notes, the
> > warning can be turned into an error by the user. The errstate context
> > manager would be good for that.
>
> Yes, That's what I would prefer also, NaN and ZeroDivisionError, for
> 2-4, including mean, var and std, for both nan and non-nan functions.
>
> with the extra argument that 3) and 4) are the same case   (except in
> polyfit :)
>

One extra possibility with the nan functions could be a new keyword, error,
which would turn warnings into errors. But that might be a bit much.

Chuck
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20130715/9fdaf12b/attachment.html>