<br><br><div class="gmail_quote">On Mon, Jul 15, 2013 at 2:44 PM,  <span dir="ltr"><<a href="mailto:josef.pktd@gmail.com" target="_blank">josef.pktd@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<div class="HOEnZb"><div class="h5">On Mon, Jul 15, 2013 at 4:24 PM,  <<a href="mailto:josef.pktd@gmail.com">josef.pktd@gmail.com</a>> wrote:<br>

> On Mon, Jul 15, 2013 at 2:55 PM, Nathaniel Smith <<a href="mailto:njs@pobox.com">njs@pobox.com</a>> wrote:<br>

>> On Mon, Jul 15, 2013 at 6:29 PM, Charles R Harris<br>

>> <<a href="mailto:charlesr.harris@gmail.com">charlesr.harris@gmail.com</a>> wrote:<br>

>>> Let me try to summarize. To begin with, the environment of the nan functions<br>

>>> is rather special.<br>

>>><br>

>>> 1) if the array is of not of inexact type, they punt to the non-nan<br>

>>> versions.<br>

>>> 2) if the array is of inexact type, then out and dtype must be inexact if<br>

>>> specified<br>

>>><br>

>>> The second assumption guarantees that NaN can be used in the return values.<br>

>><br>

>> The requirement on the 'out' dtype only exists because currently the<br>

>> nan function like to return nan for things like empty arrays, right?<br>

>> If not for that, it could be relaxed? (it's a rather weird<br>

>> requirement, since the whole point of these functions is that they<br>

>> ignore nans, yet they don't always...)<br>

>><br>

>>> sum and nansum<br>

>>><br>

>>> These should be consistent so that empty sums are 0. This should cover the<br>

>>> empty array case, but will change the behaviour of nansum which currently<br>

>>> returns NaN if the array isn't empty but the slice is after NaN removal.<br>

>><br>

>> I agree that returning 0 is the right behaviour, but we might need a<br>

>> FutureWarning period.<br>

>><br>

>>> mean and nanmean<br>

>>><br>

>>> In the case of empty arrays, an empty slice, this leads to 0/0. For Python<br>

>>> this is always a zero division error, for Numpy this raises a warning and<br>

>>> and returns NaN for floats, 0 for integers.<br>

>>><br>

>>> Currently mean returns NaN and raises a RuntimeWarning when 0/0 occurs. In<br>

>>> the special case where dtype=int, the NaN is cast to integer.<br>

>>><br>

>>> Option1<br>

>>> 1) mean raise error on 0/0<br>

>>> 2) nanmean no warning, return NaN<br>

>>><br>

>>> Option2<br>

>>> 1) mean raise warning, return NaN (current behavior)<br>

>>> 2) nanmean no warning, return NaN<br>

>>><br>

>>> Option3<br>

>>> 1) mean raise warning, return NaN (current behavior)<br>

>>> 2) nanmean raise warning, return NaN<br>

>><br>

>> I have mixed feelings about the whole np.seterr apparatus, but since<br>

>> it exists, shouldn't we use it for consistency? I.e., just do whatever<br>

>> numpy is set up to do with 0/0? (Which I think means, warn and return<br>

>> NaN by default, but this can be changed.)<br>

>><br>

>>> var, std, nanvar, nanstd<br>

>>><br>

>>> 1) if ddof > axis(axes) size, raise error, probably a program bug.<br>

>>> 2) If ddof=0, then whatever is the case for mean, nanmean<br>

>>><br>

>>> For nanvar, nanstd it is possible that some slice are good, some bad, so<br>

>>><br>

>>> option1<br>

>>> 1) if n - ddof <= 0 for a slice, raise warning, return NaN for slice<br>

>>><br>

>>> option2<br>

>>> 1) if n - ddof <= 0 for a slice, don't warn, return NaN for slice<br>

>><br>

>> I don't really have any intuition for these ddof cases. Just raising<br>

>> an error on negative effective dof is pretty defensible and might be<br>

>> the safest -- it's a easy to turn an error into something sensible<br>

>> later if people come up with use cases...<br>

><br>

> related why does reduceat not have empty slices?<br>

><br>

>>>> np.add.reduceat(np.arange(8),[0,4, 5, 7,7])<br>

> array([ 6,  4, 11,  7,  7])<br>

><br>

><br>

> I'm in favor of returning nans instead of raising exceptions, except<br>

> if the return type is int and we cannot cast nan to int.<br>

><br>

> If we get functions into numpy that know how to handle nans, then it<br>

> would be useful to get the nans, so we can work with them<br>

><br>

> Some cases where this might come in handy are when we iterate over<br>

> slices of an array that define groups or category levels with possible<br>

> empty groups *)<br>

><br>

>>>> idx = np.repeat(np.array([0, 1, 2, 3]), [4, 3, 0, 2])<br>

>>>> x = np.arange(9)<br>

>>>> [x[idx==ii].mean() for ii in range(4)]<br>

> [1.5, 5.0, nan, 7.5]<br>

><br>

> instead of<br>

>>>> [x[idx==ii].mean() for ii in range(4) if (idx==ii).sum()>0]<br>

> [1.5, 5.0, 7.5]<br>

><br>

> same for var, I wouldn't have to check that the size is larger than<br>

> the ddof (whatever that is in the specific case)<br>

><br>

> *) groups could be empty because they were defined for a larger<br>

> dataset or as a union of different datasets<br>

<br>

</div></div>background:<br>

<br>

I wrote several robust anova versions a few weeks ago, that were<br>

essentially list comprehension as above. However, I didn't allow nans<br>

and didn't check for minimum size.<br>

Allowing for empty groups to return nan would mainly be a convenience,<br>

since I need to check the group size only once.<br>

<br>

ddof: tests for proportions have ddof=0, for regular t-test ddof=1,<br>

for tests of correlation ddof=2   IIRC<br>

so we would need to check for the corresponding minimum size that n-ddof>0<br>

<br>

"negative effective dof" doesn't exist, that's np.maximum(n - ddof, 0)<br>

which is always non-negative but might result in a zero-division<br>

error. :)<br>

<br>

I don't think making anything conditional on ddof>0 is useful.<br>

<span class="HOEnZb"><font color="#888888"><br></font></span></blockquote><div><br>So how would you want it?<br><br>To summarize the problem areas:<br><br>1) What is the sum of an empty slice? NaN or 0?<br>2) What is mean of empy slice? NaN, NaN and warn, or error?<br>

3) What if n - ddof < 0 for slice? NaN, NaN and warn, or error?<br>4) What if n - ddof = 0 for slice? NaN, NaN and warn, or error?<br> <br>I'm tending to NaN and warn for 2 -- 3, because, as Nathaniel notes, the warning can be turned into an error by the user. The errstate context manager would be good for that.<br>

<br>Chuck<br></div><br></div>