Mailman 3 What should be the result in some statistics corner cases? - NumPy-Discussion

What should be the result in some statistics corner cases?

Charles R Harris

July 14, 2013

12:55 p.m.

Some corner cases in the mean, var, std. *Empty arrays* I think these cases should either raise an error or just return nan. Warnings seem ineffective to me as they are only issued once by default. In [3]: ones(0).mean() /home/charris/.local/lib/python2.7/site-packages/numpy/core/_methods.py:61: RuntimeWarning: invalid value encountered in double_scalars ret = ret / float(rcount) Out[3]: nan In [4]: ones(0).var() /home/charris/.local/lib/python2.7/site-packages/numpy/core/_methods.py:76: RuntimeWarning: invalid value encountered in true_divide out=arrmean, casting='unsafe', subok=False) /home/charris/.local/lib/python2.7/site-packages/numpy/core/_methods.py:100: RuntimeWarning: invalid value encountered in double_scalars ret = ret / float(rcount) Out[4]: nan In [5]: ones(0).std() /home/charris/.local/lib/python2.7/site-packages/numpy/core/_methods.py:76: RuntimeWarning: invalid value encountered in true_divide out=arrmean, casting='unsafe', subok=False) /home/charris/.local/lib/python2.7/site-packages/numpy/core/_methods.py:100: RuntimeWarning: invalid value encountered in double_scalars ret = ret / float(rcount) Out[5]: nan *ddof >= number of elements* I think these should just raise errors. The results for ddof >= #elements is happenstance, and certainly negative numbers should never be returned. In [6]: ones(2).var(ddof=2) /home/charris/.local/lib/python2.7/site-packages/numpy/core/_methods.py:100: RuntimeWarning: invalid value encountered in double_scalars ret = ret / float(rcount) Out[6]: nan In [7]: ones(2).var(ddof=3) Out[7]: -0.0 * nansum* Currently returns nan for empty arrays. I suspect it should return nan for slices that are all nan, but 0 for empty slices. That would make it consistent with sum in the empty case. Chuck

Attachments:

attachment.htm (text/html — 1.9 KB)

Show replies by date

Warren Weckesser

July 2013

2:55 p.m.

New subject: What should be the result in some statistics corner cases?

On 7/14/13, Charles R Harris <charlesr.harris@gmail.com> wrote:

...

Some corner cases in the mean, var, std.

*Empty arrays*

I think these cases should either raise an error or just return nan. Warnings seem ineffective to me as they are only issued once by default.

In [3]: ones(0).mean() /home/charris/.local/lib/python2.7/site-packages/numpy/core/_methods.py:61: RuntimeWarning: invalid value encountered in double_scalars ret = ret / float(rcount) Out[3]: nan

In [4]: ones(0).var() /home/charris/.local/lib/python2.7/site-packages/numpy/core/_methods.py:76: RuntimeWarning: invalid value encountered in true_divide out=arrmean, casting='unsafe', subok=False) /home/charris/.local/lib/python2.7/site-packages/numpy/core/_methods.py:100: RuntimeWarning: invalid value encountered in double_scalars ret = ret / float(rcount) Out[4]: nan

In [5]: ones(0).std() /home/charris/.local/lib/python2.7/site-packages/numpy/core/_methods.py:76: RuntimeWarning: invalid value encountered in true_divide out=arrmean, casting='unsafe', subok=False) /home/charris/.local/lib/python2.7/site-packages/numpy/core/_methods.py:100: RuntimeWarning: invalid value encountered in double_scalars ret = ret / float(rcount) Out[5]: nan

*ddof >= number of elements*

I think these should just raise errors. The results for ddof >= #elements is happenstance, and certainly negative numbers should never be returned.

In [6]: ones(2).var(ddof=2) /home/charris/.local/lib/python2.7/site-packages/numpy/core/_methods.py:100: RuntimeWarning: invalid value encountered in double_scalars ret = ret / float(rcount) Out[6]: nan

In [7]: ones(2).var(ddof=3) Out[7]: -0.0 * nansum*

Currently returns nan for empty arrays. I suspect it should return nan for slices that are all nan, but 0 for empty slices. That would make it consistent with sum in the empty case.

For nansum, I would expect 0 even in the case of all nans. The point of these functions is to simply ignore nans, correct? So I would aim for this behaviour: nanfunc(x) behaves the same as func(x[~isnan(x)]) Warren

...

Chuck

Charles R Harris

3:35 p.m.

New subject: What should be the result in some statistics corner cases?

On Sun, Jul 14, 2013 at 2:55 PM, Warren Weckesser < warren.weckesser@gmail.com> wrote:

...

On 7/14/13, Charles R Harris <charlesr.harris@gmail.com> wrote:

...
Some corner cases in the mean, var, std.

*Empty arrays*

I think these cases should either raise an error or just return nan. Warnings seem ineffective to me as they are only issued once by default.

In [3]: ones(0).mean()

/home/charris/.local/lib/python2.7/site-packages/numpy/core/_methods.py:61:

...
RuntimeWarning: invalid value encountered in double_scalars ret = ret / float(rcount) Out[3]: nan

In [4]: ones(0).var()

/home/charris/.local/lib/python2.7/site-packages/numpy/core/_methods.py:76:

...
RuntimeWarning: invalid value encountered in true_divide out=arrmean, casting='unsafe', subok=False)

/home/charris/.local/lib/python2.7/site-packages/numpy/core/_methods.py:100:

...
RuntimeWarning: invalid value encountered in double_scalars ret = ret / float(rcount) Out[4]: nan

In [5]: ones(0).std()

/home/charris/.local/lib/python2.7/site-packages/numpy/core/_methods.py:76:

...
RuntimeWarning: invalid value encountered in true_divide out=arrmean, casting='unsafe', subok=False)

/home/charris/.local/lib/python2.7/site-packages/numpy/core/_methods.py:100:

...
RuntimeWarning: invalid value encountered in double_scalars ret = ret / float(rcount) Out[5]: nan

*ddof >= number of elements*

I think these should just raise errors. The results for ddof >= #elements is happenstance, and certainly negative numbers should never be returned.

In [6]: ones(2).var(ddof=2)

/home/charris/.local/lib/python2.7/site-packages/numpy/core/_methods.py:100:

...
RuntimeWarning: invalid value encountered in double_scalars ret = ret / float(rcount) Out[6]: nan

In [7]: ones(2).var(ddof=3) Out[7]: -0.0 * nansum*

Currently returns nan for empty arrays. I suspect it should return nan for slices that are all nan, but 0 for empty slices. That would make it consistent with sum in the empty case.

For nansum, I would expect 0 even in the case of all nans. The point of these functions is to simply ignore nans, correct? So I would aim for this behaviour: nanfunc(x) behaves the same as func(x[~isnan(x)])

Agreed, although that changes current behavior. What about the other cases? Chuck

Charles R Harris

7:52 a.m.

New subject: What should be the result in some statistics corner cases?

On Sun, Jul 14, 2013 at 3:35 PM, Charles R Harris <charlesr.harris@gmail.com

...

wrote:

...

On Sun, Jul 14, 2013 at 2:55 PM, Warren Weckesser < warren.weckesser@gmail.com> wrote:

...
On 7/14/13, Charles R Harris <charlesr.harris@gmail.com> wrote:

...
Some corner cases in the mean, var, std.

*Empty arrays*

I think these cases should either raise an error or just return nan. Warnings seem ineffective to me as they are only issued once by default.

In [3]: ones(0).mean()

/home/charris/.local/lib/python2.7/site-packages/numpy/core/_methods.py:61:

...
RuntimeWarning: invalid value encountered in double_scalars ret = ret / float(rcount) Out[3]: nan

In [4]: ones(0).var()

/home/charris/.local/lib/python2.7/site-packages/numpy/core/_methods.py:76:

...
RuntimeWarning: invalid value encountered in true_divide out=arrmean, casting='unsafe', subok=False)

/home/charris/.local/lib/python2.7/site-packages/numpy/core/_methods.py:100:

...
RuntimeWarning: invalid value encountered in double_scalars ret = ret / float(rcount) Out[4]: nan

In [5]: ones(0).std()

/home/charris/.local/lib/python2.7/site-packages/numpy/core/_methods.py:76:

...
RuntimeWarning: invalid value encountered in true_divide out=arrmean, casting='unsafe', subok=False)

/home/charris/.local/lib/python2.7/site-packages/numpy/core/_methods.py:100:

...
RuntimeWarning: invalid value encountered in double_scalars ret = ret / float(rcount) Out[5]: nan

*ddof >= number of elements*

I think these should just raise errors. The results for ddof >= #elements is happenstance, and certainly negative numbers should never be returned.

In [6]: ones(2).var(ddof=2)

/home/charris/.local/lib/python2.7/site-packages/numpy/core/_methods.py:100:

...
RuntimeWarning: invalid value encountered in double_scalars ret = ret / float(rcount) Out[6]: nan

In [7]: ones(2).var(ddof=3) Out[7]: -0.0 * nansum*

Currently returns nan for empty arrays. I suspect it should return nan for slices that are all nan, but 0 for empty slices. That would make it consistent with sum in the empty case.

For nansum, I would expect 0 even in the case of all nans. The point of these functions is to simply ignore nans, correct? So I would aim for this behaviour: nanfunc(x) behaves the same as func(x[~isnan(x)])

Agreed, although that changes current behavior. What about the other cases?

Looks like there isn't much interest in the topic, so I'll just go ahead with the following choices: Non-NaN case 1) Empty array -> ValueError The current behavior with stats is an accident, i.e., the nan arises from 0/0. I like to think that in this case the result is any number, rather than not a number, so *the* value is simply not defined. So in this case raise a ValueError for empty array. 2) ddof >= n -> ValueError If the number of elements, n, is not zero and ddof >= n, raise a ValueError for the ddof value. Nan case 1) Empty array -> Value Error 2) Empty slice -> NaN 3) For slice ddof >= n -> Nan Chuck

Benjamin Root

8:25 a.m.

New subject: What should be the result in some statistics corner cases?

This is going to need to be heavily documented with doctests. Also, just to clarify, are we talking about a ValueError for doing a nansum on an empty array as well, or will that now return a zero? Ben Root On Mon, Jul 15, 2013 at 9:52 AM, Charles R Harris <charlesr.harris@gmail.com

...

wrote:

...

On Sun, Jul 14, 2013 at 3:35 PM, Charles R Harris < charlesr.harris@gmail.com> wrote:

...
On Sun, Jul 14, 2013 at 2:55 PM, Warren Weckesser < warren.weckesser@gmail.com> wrote:

...
On 7/14/13, Charles R Harris <charlesr.harris@gmail.com> wrote:

...
Some corner cases in the mean, var, std.

*Empty arrays*

I think these cases should either raise an error or just return nan. Warnings seem ineffective to me as they are only issued once by default.

In [3]: ones(0).mean()

/home/charris/.local/lib/python2.7/site-packages/numpy/core/_methods.py:61:

...
RuntimeWarning: invalid value encountered in double_scalars ret = ret / float(rcount) Out[3]: nan

In [4]: ones(0).var()

/home/charris/.local/lib/python2.7/site-packages/numpy/core/_methods.py:76:

...
RuntimeWarning: invalid value encountered in true_divide out=arrmean, casting='unsafe', subok=False)

/home/charris/.local/lib/python2.7/site-packages/numpy/core/_methods.py:100:

...
RuntimeWarning: invalid value encountered in double_scalars ret = ret / float(rcount) Out[4]: nan

In [5]: ones(0).std()

/home/charris/.local/lib/python2.7/site-packages/numpy/core/_methods.py:76:

...
RuntimeWarning: invalid value encountered in true_divide out=arrmean, casting='unsafe', subok=False)

/home/charris/.local/lib/python2.7/site-packages/numpy/core/_methods.py:100:

...
RuntimeWarning: invalid value encountered in double_scalars ret = ret / float(rcount) Out[5]: nan

*ddof >= number of elements*

I think these should just raise errors. The results for ddof >= #elements is happenstance, and certainly negative numbers should never be returned.

In [6]: ones(2).var(ddof=2)

/home/charris/.local/lib/python2.7/site-packages/numpy/core/_methods.py:100:

...
RuntimeWarning: invalid value encountered in double_scalars ret = ret / float(rcount) Out[6]: nan

In [7]: ones(2).var(ddof=3) Out[7]: -0.0 * nansum*

Currently returns nan for empty arrays. I suspect it should return nan for slices that are all nan, but 0 for empty slices. That would make it consistent with sum in the empty case.

For nansum, I would expect 0 even in the case of all nans. The point of these functions is to simply ignore nans, correct? So I would aim for this behaviour: nanfunc(x) behaves the same as func(x[~isnan(x)])

Agreed, although that changes current behavior. What about the other cases?

Looks like there isn't much interest in the topic, so I'll just go ahead with the following choices:

Non-NaN case

1) Empty array -> ValueError

The current behavior with stats is an accident, i.e., the nan arises from 0/0. I like to think that in this case the result is any number, rather than not a number, so *the* value is simply not defined. So in this case raise a ValueError for empty array.

2) ddof >= n -> ValueError

If the number of elements, n, is not zero and ddof >= n, raise a ValueError for the ddof value.

Nan case

1) Empty array -> Value Error 2) Empty slice -> NaN 3) For slice ddof >= n -> Nan

Chuck

_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

Charles R Harris

8:33 a.m.

New subject: What should be the result in some statistics corner cases?

On Mon, Jul 15, 2013 at 8:25 AM, Benjamin Root <ben.root@ou.edu> wrote:

...

This is going to need to be heavily documented with doctests. Also, just to clarify, are we talking about a ValueError for doing a nansum on an empty array as well, or will that now return a zero?

I was going to leave nansum as is, as it seems that the result was by choice rather than by accident. Tests, not doctests. I detest doctests ;) Examples, OTOH... Chuck

Stéfan van der Walt

6:22 p.m.

New subject: What should be the result in some statistics corner cases?

On Mon, 15 Jul 2013 08:33:47 -0600, Charles R Harris wrote:

...

On Mon, Jul 15, 2013 at 8:25 AM, Benjamin Root <ben.root@ou.edu> wrote:

...
This is going to need to be heavily documented with doctests. Also, just to clarify, are we talking about a ValueError for doing a nansum on an empty array as well, or will that now return a zero?

I was going to leave nansum as is, as it seems that the result was by choice rather than by accident.

That makes sense--I like Sebastian's explanation whereby operations that define an identity yields that upon empty input. Stéfan

Charles R Harris

6:46 p.m.

New subject: What should be the result in some statistics corner cases?

On Mon, Jul 15, 2013 at 6:22 PM, Stéfan van der Walt <stefan@sun.ac.za>wrote:

...

On Mon, 15 Jul 2013 08:33:47 -0600, Charles R Harris wrote:

...
On Mon, Jul 15, 2013 at 8:25 AM, Benjamin Root <ben.root@ou.edu> wrote:

...
This is going to need to be heavily documented with doctests. Also, just to clarify, are we talking about a ValueError for doing a nansum on an empty array as well, or will that now return a zero?

I was going to leave nansum as is, as it seems that the result was by choice rather than by accident.

That makes sense--I like Sebastian's explanation whereby operations that define an identity yields that upon empty input.

So nansum should return zeros rather than the current NaNs? Chuck

Stéfan van der Walt

6:55 p.m.

New subject: What should be the result in some statistics corner cases?

On Mon, 15 Jul 2013 18:46:33 -0600, Charles R Harris wrote:

...

So nansum should return zeros rather than the current NaNs?

Yes, my feeling is that nansum([]) should be 0. Stéfan

Benjamin Root

6:58 p.m.

New subject: What should be the result in some statistics corner cases?

To add a bit of context to the question of nansum on empty results, we currently differ from MATLAB and R in this respect, they return zero no matter what. Personally, I think it should return zero, but our current behavior of returning nans has existed for a long time. Personally, I think we need a deprecation warning and possibly wait to change this until 2.0, with plenty of warning that this will change. Ben Root On Jul 15, 2013 8:46 PM, "Charles R Harris" <charlesr.harris@gmail.com> wrote:

...

On Mon, Jul 15, 2013 at 6:22 PM, Stéfan van der Walt <stefan@sun.ac.za>wrote:

...
On Mon, 15 Jul 2013 08:33:47 -0600, Charles R Harris wrote:

...
On Mon, Jul 15, 2013 at 8:25 AM, Benjamin Root <ben.root@ou.edu> wrote:

...
This is going to need to be heavily documented with doctests. Also, just to clarify, are we talking about a ValueError for doing a nansum on an empty array as well, or will that now return a zero?

I was going to leave nansum as is, as it seems that the result was by choice rather than by accident.

That makes sense--I like Sebastian's explanation whereby operations that define an identity yields that upon empty input.

So nansum should return zeros rather than the current NaNs?

Chuck

_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

Charles R Harris

7:50 p.m.

New subject: What should be the result in some statistics corner cases?

On Mon, Jul 15, 2013 at 6:58 PM, Benjamin Root <ben.root@ou.edu> wrote:

...

To add a bit of context to the question of nansum on empty results, we currently differ from MATLAB and R in this respect, they return zero no matter what. Personally, I think it should return zero, but our current behavior of returning nans has existed for a long time.

Personally, I think we need a deprecation warning and possibly wait to change this until 2.0, with plenty of warning that this will change.

Waiting for the mythical 2.0 probably won't work ;) We also need to give folks a way to adjust ahead of time. I think the easiest way to do that is with an extra keyword, say nanok, with True as the starting default, then later we can make False the default. <snip> Chuck

Ralf Gommers

11:36 p.m.

New subject: What should be the result in some statistics corner cases?

On Tue, Jul 16, 2013 at 3:50 AM, Charles R Harris <charlesr.harris@gmail.com

...

wrote:

...

On Mon, Jul 15, 2013 at 6:58 PM, Benjamin Root <ben.root@ou.edu> wrote:

...
To add a bit of context to the question of nansum on empty results, we currently differ from MATLAB and R in this respect, they return zero no matter what. Personally, I think it should return zero, but our current behavior of returning nans has existed for a long time.

Personally, I think we need a deprecation warning and possibly wait to change this until 2.0, with plenty of warning that this will change.

Waiting for the mythical 2.0 probably won't work ;) We also need to give folks a way to adjust ahead of time. I think the easiest way to do that is with an extra keyword, say nanok, with True as the starting default, then later we can make False the default.

No special keywords to work around behavior change please, it doesn't work well and you end up with a keyword you don't really want. Why not just give a FutureWarning in 1.8 and change to returning zero in 1.9? Ralf

Sebastian Berg

8:34 a.m.

New subject: What should be the result in some statistics corner cases?

On Mon, 2013-07-15 at 07:52 -0600, Charles R Harris wrote:

...

On Sun, Jul 14, 2013 at 3:35 PM, Charles R Harris <charlesr.harris@gmail.com> wrote:

<snip>

...

For nansum, I would expect 0 even in the case of all nans. The point of these functions is to simply ignore nans, correct? So I would aim for this behaviour: nanfunc(x) behaves the same as func(x[~isnan(x)])

Agreed, although that changes current behavior. What about the other cases?

Looks like there isn't much interest in the topic, so I'll just go ahead with the following choices:

Non-NaN case

1) Empty array -> ValueError

The current behavior with stats is an accident, i.e., the nan arises from 0/0. I like to think that in this case the result is any number, rather than not a number, so *the* value is simply not defined. So in this case raise a ValueError for empty array.

To be honest, I don't mind the current behaviour much sum([]) = 0, len([]) = 0, so it is in a way well defined. At least I am not sure if I would prefer always an error. I am a bit worried that just changing it might break code out there, such as plotting code where it makes perfectly sense to plot a NaN (i.e. nothing), but if that is the case it would probably be visible fast.

...

2) ddof >= n -> ValueError

If the number of elements, n, is not zero and ddof >= n, raise a ValueError for the ddof value.

Makes sense to me, especially for ddof > n. Just returning nan in all cases for backward compatibility would be fine with me too.

...

Nan case

1) Empty array -> Value Error 2) Empty slice -> NaN 3) For slice ddof >= n -> Nan

Personally I would somewhat prefer if 1) and 2) would at least default to the same thing. But I don't use the nanfuncs anyway. I was wondering about adding the option for the user to pick what the fill is (and i.e. if it is None (maybe default) -> ValueError). We could also allow this for normal reductions without an identity, but I am not sure if it is useful there. - Sebastian

...

Chuck

_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

Charles R Harris

8:47 a.m.

New subject: What should be the result in some statistics corner cases?

On Mon, Jul 15, 2013 at 8:34 AM, Sebastian Berg <sebastian@sipsolutions.net>wrote:

...

On Mon, 2013-07-15 at 07:52 -0600, Charles R Harris wrote:

...
On Sun, Jul 14, 2013 at 3:35 PM, Charles R Harris <charlesr.harris@gmail.com> wrote:

<snip>

...
For nansum, I would expect 0 even in the case of all nans. The point of these functions is to simply ignore nans, correct? So I would aim for this behaviour: nanfunc(x) behaves the same as func(x[~isnan(x)])

Agreed, although that changes current behavior. What about the other cases?

Looks like there isn't much interest in the topic, so I'll just go ahead with the following choices:

Non-NaN case

1) Empty array -> ValueError

The current behavior with stats is an accident, i.e., the nan arises from 0/0. I like to think that in this case the result is any number, rather than not a number, so *the* value is simply not defined. So in this case raise a ValueError for empty array.

To be honest, I don't mind the current behaviour much sum([]) = 0, len([]) = 0, so it is in a way well defined. At least I am not sure if I would prefer always an error. I am a bit worried that just changing it might break code out there, such as plotting code where it makes perfectly sense to plot a NaN (i.e. nothing), but if that is the case it would probably be visible fast.

I'm talking about mean, var, and std as statistics, sum isn't part of that. If there is agreement that nansum of empty arrays/columns should be zero I will do that. Note the sums of empty arrays may or may not be empty. In [1]: ones((0, 3)).sum(axis=0) Out[1]: array([ 0., 0., 0.]) In [2]: ones((3, 0)).sum(axis=0) Out[2]: array([], dtype=float64) Which, sort of, makes sense.

...

...
2) ddof >= n -> ValueError

If the number of elements, n, is not zero and ddof >= n, raise a ValueError for the ddof value.

Makes sense to me, especially for ddof > n. Just returning nan in all cases for backward compatibility would be fine with me too.

...
Nan case

1) Empty array -> Value Error 2) Empty slice -> NaN 3) For slice ddof >= n -> Nan

Personally I would somewhat prefer if 1) and 2) would at least default to the same thing. But I don't use the nanfuncs anyway. I was wondering about adding the option for the user to pick what the fill is (and i.e. if it is None (maybe default) -> ValueError). We could also allow this for normal reductions without an identity, but I am not sure if it is useful there.

Chuck

Sebastian Berg

9:55 a.m.

New subject: What should be the result in some statistics corner cases?

On Mon, 2013-07-15 at 08:47 -0600, Charles R Harris wrote:

...

On Mon, Jul 15, 2013 at 8:34 AM, Sebastian Berg <sebastian@sipsolutions.net> wrote: On Mon, 2013-07-15 at 07:52 -0600, Charles R Harris wrote: > > > On Sun, Jul 14, 2013 at 3:35 PM, Charles R Harris > <charlesr.harris@gmail.com> wrote: >

<snip>

> > For nansum, I would expect 0 even in the case of all > nans. The point > of these functions is to simply ignore nans, correct? > So I would aim > for this behaviour: nanfunc(x) behaves the same as > func(x[~isnan(x)]) > > > Agreed, although that changes current behavior. What about the > other cases? > > > > Looks like there isn't much interest in the topic, so I'll just go > ahead with the following choices: > > Non-NaN case > > 1) Empty array -> ValueError > > The current behavior with stats is an accident, i.e., the nan arises > from 0/0. I like to think that in this case the result is any number, > rather than not a number, so *the* value is simply not defined. So in > this case raise a ValueError for empty array. >

To be honest, I don't mind the current behaviour much sum([]) = 0, len([]) = 0, so it is in a way well defined. At least I am not sure if I would prefer always an error. I am a bit worried that just changing it might break code out there, such as plotting code where it makes perfectly sense to plot a NaN (i.e. nothing), but if that is the case it would probably be visible fast.

I'm talking about mean, var, and std as statistics, sum isn't part of that. If there is agreement that nansum of empty arrays/columns should be zero I will do that. Note the sums of empty arrays may or may not be empty.

In [1]: ones((0, 3)).sum(axis=0) Out[1]: array([ 0., 0., 0.])

In [2]: ones((3, 0)).sum(axis=0) Out[2]: array([], dtype=float64)

Which, sort of, makes sense.

I think we can agree that the behaviour for reductions with an identity should default to returning the identity, including for the nanfuncs, i.e. sum([]) is 0, product([]) is 1... Since mean = sum/length is a sensible definition, having 0/0 as a result doesn't seem to bad to me to be honest, it might be accidental but it is not a special case in the code ;). Though I don't mind an error as long as it doesn't break matplotlib or so. I agree about the nanfuncs raising an error would probably be more of a problem then for a usual ufunc, but still a bit hesitant about saying that it is ok too. I could imagine adding a very general "identity" argument (though I would not call it identity, because it is not the same as `np.add.identity`, just used in a place where that would be used otherwise): np.add.reduce([], identity=123) -> [123] np.add.reduce([1], identity=123) -> [1] np.nanmean([np.nan], identity=None) -> Error np.nanmean([np.nan], identity=np.nan) -> np.nan It doesn't really make sense, but: np.subtract.reduce([]) -> Error, since np.substract.identity is None np.subtract.reduce([], identity=0) -> 0, suppressing the error. I am not sure if I am convinced myself, but especially for the nanfuncs it could maybe provide a way to circumvent the problem somewhat. Including functions such as np.nanargmin, whose result type does not even support NaN. Plus it gives an argument allowing for warnings about changing behaviour. - Sebastian

...

> 2) ddof >= n -> ValueError > > If the number of elements, n, is not zero and ddof >= n, raise a > ValueError for the ddof value. >

Makes sense to me, especially for ddof > n. Just returning nan in all cases for backward compatibility would be fine with me too.

> Nan case > > 1) Empty array -> Value Error > 2) Empty slice -> NaN > 3) For slice ddof >= n -> Nan >

Personally I would somewhat prefer if 1) and 2) would at least default to the same thing. But I don't use the nanfuncs anyway. I was wondering about adding the option for the user to pick what the fill is (and i.e. if it is None (maybe default) -> ValueError). We could also allow this for normal reductions without an identity, but I am not sure if it is useful there.

Chuck

_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

Charles R Harris

11:29 a.m.

New subject: What should be the result in some statistics corner cases?

On Mon, Jul 15, 2013 at 9:55 AM, Sebastian Berg <sebastian@sipsolutions.net>wrote:

...

On Mon, 2013-07-15 at 08:47 -0600, Charles R Harris wrote:

...
On Mon, Jul 15, 2013 at 8:34 AM, Sebastian Berg <sebastian@sipsolutions.net> wrote: On Mon, 2013-07-15 at 07:52 -0600, Charles R Harris wrote: > > > On Sun, Jul 14, 2013 at 3:35 PM, Charles R Harris > <charlesr.harris@gmail.com> wrote: >

<snip>

> > For nansum, I would expect 0 even in the case of all > nans. The point > of these functions is to simply ignore nans, correct? > So I would aim > for this behaviour: nanfunc(x) behaves the same as > func(x[~isnan(x)]) > > > Agreed, although that changes current behavior. What about the > other cases? > > > > Looks like there isn't much interest in the topic, so I'll just go > ahead with the following choices: > > Non-NaN case > > 1) Empty array -> ValueError > > The current behavior with stats is an accident, i.e., the nan arises > from 0/0. I like to think that in this case the result is any number, > rather than not a number, so *the* value is simply not defined. So in > this case raise a ValueError for empty array. >

To be honest, I don't mind the current behaviour much sum([]) = 0, len([]) = 0, so it is in a way well defined. At least I am not sure if I would prefer always an error. I am a bit worried that just changing it might break code out there, such as plotting code where it makes perfectly sense to plot a NaN (i.e. nothing), but if that is the case it would probably be visible fast.

I'm talking about mean, var, and std as statistics, sum isn't part of that. If there is agreement that nansum of empty arrays/columns should be zero I will do that. Note the sums of empty arrays may or may not be empty.

In [1]: ones((0, 3)).sum(axis=0) Out[1]: array([ 0., 0., 0.])

In [2]: ones((3, 0)).sum(axis=0) Out[2]: array([], dtype=float64)

Which, sort of, makes sense.

I think we can agree that the behaviour for reductions with an identity should default to returning the identity, including for the nanfuncs, i.e. sum([]) is 0, product([]) is 1...

Since mean = sum/length is a sensible definition, having 0/0 as a result doesn't seem to bad to me to be honest, it might be accidental but it is not a special case in the code ;). Though I don't mind an error as long as it doesn't break matplotlib or so.

I agree about the nanfuncs raising an error would probably be more of a problem then for a usual ufunc, but still a bit hesitant about saying that it is ok too. I could imagine adding a very general "identity" argument (though I would not call it identity, because it is not the same as `np.add.identity`, just used in a place where that would be used otherwise):

np.add.reduce([], identity=123) -> [123] np.add.reduce([1], identity=123) -> [1] np.nanmean([np.nan], identity=None) -> Error np.nanmean([np.nan], identity=np.nan) -> np.nan

It doesn't really make sense, but: np.subtract.reduce([]) -> Error, since np.substract.identity is None np.subtract.reduce([], identity=0) -> 0, suppressing the error.

I am not sure if I am convinced myself, but especially for the nanfuncs it could maybe provide a way to circumvent the problem somewhat. Including functions such as np.nanargmin, whose result type does not even support NaN. Plus it gives an argument allowing for warnings about changing behaviour.

Let me try to summarize. To begin with, the environment of the nan functions is rather special. 1) if the array is of not of inexact type, they punt to the non-nan versions. 2) if the array is of inexact type, then out and dtype must be inexact if specified The second assumption guarantees that NaN can be used in the return values. *sum and nansum* These should be consistent so that empty sums are 0. This should cover the empty array case, but will change the behaviour of nansum which currently returns NaN if the array isn't empty but the slice is after NaN removal. *mean and nanmean* In the case of empty arrays, an empty slice, this leads to 0/0. For Python this is always a zero division error, for Numpy this raises a warning and and returns NaN for floats, 0 for integers. Currently mean returns NaN and raises a RuntimeWarning when 0/0 occurs. In the special case where dtype=int, the NaN is cast to integer. Option1 1) mean raise error on 0/0 2) nanmean no warning, return NaN Option2 1) mean raise warning, return NaN (current behavior) 2) nanmean no warning, return NaN Option3 1) mean raise warning, return NaN (current behavior) 2) nanmean raise warning, return NaN *var, std, nanvar, nanstd* 1) if ddof > axis(axes) size, raise error, probably a program bug. 2) If ddof=0, then whatever is the case for mean, nanmean For nanvar, nanstd it is possible that some slice are good, some bad, so option1 1) if n - ddof <= 0 for a slice, raise warning, return NaN for slice option2 1) if n - ddof <= 0 for a slice, don't warn, return NaN for slice <snip> Chuck

Nathaniel Smith

12:55 p.m.

New subject: What should be the result in some statistics corner cases?

On Mon, Jul 15, 2013 at 6:29 PM, Charles R Harris <charlesr.harris@gmail.com> wrote:

...

Let me try to summarize. To begin with, the environment of the nan functions is rather special.

1) if the array is of not of inexact type, they punt to the non-nan versions. 2) if the array is of inexact type, then out and dtype must be inexact if specified

The second assumption guarantees that NaN can be used in the return values.

The requirement on the 'out' dtype only exists because currently the nan function like to return nan for things like empty arrays, right? If not for that, it could be relaxed? (it's a rather weird requirement, since the whole point of these functions is that they ignore nans, yet they don't always...)

...

sum and nansum

These should be consistent so that empty sums are 0. This should cover the empty array case, but will change the behaviour of nansum which currently returns NaN if the array isn't empty but the slice is after NaN removal.

I agree that returning 0 is the right behaviour, but we might need a FutureWarning period.

...

mean and nanmean

In the case of empty arrays, an empty slice, this leads to 0/0. For Python this is always a zero division error, for Numpy this raises a warning and and returns NaN for floats, 0 for integers.

Currently mean returns NaN and raises a RuntimeWarning when 0/0 occurs. In the special case where dtype=int, the NaN is cast to integer.

Option1 1) mean raise error on 0/0 2) nanmean no warning, return NaN

Option2 1) mean raise warning, return NaN (current behavior) 2) nanmean no warning, return NaN

Option3 1) mean raise warning, return NaN (current behavior) 2) nanmean raise warning, return NaN

I have mixed feelings about the whole np.seterr apparatus, but since it exists, shouldn't we use it for consistency? I.e., just do whatever numpy is set up to do with 0/0? (Which I think means, warn and return NaN by default, but this can be changed.)

...

var, std, nanvar, nanstd

1) if ddof > axis(axes) size, raise error, probably a program bug. 2) If ddof=0, then whatever is the case for mean, nanmean

For nanvar, nanstd it is possible that some slice are good, some bad, so

option1 1) if n - ddof <= 0 for a slice, raise warning, return NaN for slice

option2 1) if n - ddof <= 0 for a slice, don't warn, return NaN for slice

I don't really have any intuition for these ddof cases. Just raising an error on negative effective dof is pretty defensible and might be the safest -- it's a easy to turn an error into something sensible later if people come up with use cases... -n

josef.pktd＠gmail.com

2:24 p.m.

New subject: What should be the result in some statistics corner cases?

On Mon, Jul 15, 2013 at 2:55 PM, Nathaniel Smith <njs@pobox.com> wrote:

...

On Mon, Jul 15, 2013 at 6:29 PM, Charles R Harris <charlesr.harris@gmail.com> wrote:

...
Let me try to summarize. To begin with, the environment of the nan functions is rather special.

1) if the array is of not of inexact type, they punt to the non-nan versions. 2) if the array is of inexact type, then out and dtype must be inexact if specified

The second assumption guarantees that NaN can be used in the return values.

The requirement on the 'out' dtype only exists because currently the nan function like to return nan for things like empty arrays, right? If not for that, it could be relaxed? (it's a rather weird requirement, since the whole point of these functions is that they ignore nans, yet they don't always...)

...
sum and nansum

These should be consistent so that empty sums are 0. This should cover the empty array case, but will change the behaviour of nansum which currently returns NaN if the array isn't empty but the slice is after NaN removal.

I agree that returning 0 is the right behaviour, but we might need a FutureWarning period.

...
mean and nanmean

In the case of empty arrays, an empty slice, this leads to 0/0. For Python this is always a zero division error, for Numpy this raises a warning and and returns NaN for floats, 0 for integers.

Currently mean returns NaN and raises a RuntimeWarning when 0/0 occurs. In the special case where dtype=int, the NaN is cast to integer.

Option1 1) mean raise error on 0/0 2) nanmean no warning, return NaN

Option2 1) mean raise warning, return NaN (current behavior) 2) nanmean no warning, return NaN

Option3 1) mean raise warning, return NaN (current behavior) 2) nanmean raise warning, return NaN

I have mixed feelings about the whole np.seterr apparatus, but since it exists, shouldn't we use it for consistency? I.e., just do whatever numpy is set up to do with 0/0? (Which I think means, warn and return NaN by default, but this can be changed.)

...
var, std, nanvar, nanstd

1) if ddof > axis(axes) size, raise error, probably a program bug. 2) If ddof=0, then whatever is the case for mean, nanmean

For nanvar, nanstd it is possible that some slice are good, some bad, so

option1 1) if n - ddof <= 0 for a slice, raise warning, return NaN for slice

option2 1) if n - ddof <= 0 for a slice, don't warn, return NaN for slice

I don't really have any intuition for these ddof cases. Just raising an error on negative effective dof is pretty defensible and might be the safest -- it's a easy to turn an error into something sensible later if people come up with use cases...

related why does reduceat not have empty slices?

...

...
...
np.add.reduceat(np.arange(8),[0,4, 5, 7,7]) array([ 6, 4, 11, 7, 7])

I'm in favor of returning nans instead of raising exceptions, except if the return type is int and we cannot cast nan to int. If we get functions into numpy that know how to handle nans, then it would be useful to get the nans, so we can work with them Some cases where this might come in handy are when we iterate over slices of an array that define groups or category levels with possible empty groups *)

...

...
...
idx = np.repeat(np.array([0, 1, 2, 3]), [4, 3, 0, 2]) x = np.arange(9) [x[idx==ii].mean() for ii in range(4)] [1.5, 5.0, nan, 7.5]

instead of

...

...
...
[x[idx==ii].mean() for ii in range(4) if (idx==ii).sum()>0] [1.5, 5.0, 7.5]

same for var, I wouldn't have to check that the size is larger than the ddof (whatever that is in the specific case) *) groups could be empty because they were defined for a larger dataset or as a union of different datasets PS: I used mean() above and not var() because

...

...
...
np.__version__ '1.5.1' np.mean([]) nan np.var([]) 0.0

Josef

...

-n _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

josef.pktd＠gmail.com

2:44 p.m.

New subject: What should be the result in some statistics corner cases?

On Mon, Jul 15, 2013 at 4:24 PM, <josef.pktd@gmail.com> wrote:

...

On Mon, Jul 15, 2013 at 2:55 PM, Nathaniel Smith <njs@pobox.com> wrote:

...
On Mon, Jul 15, 2013 at 6:29 PM, Charles R Harris <charlesr.harris@gmail.com> wrote:

...
Let me try to summarize. To begin with, the environment of the nan functions is rather special.

1) if the array is of not of inexact type, they punt to the non-nan versions. 2) if the array is of inexact type, then out and dtype must be inexact if specified

The second assumption guarantees that NaN can be used in the return values.

The requirement on the 'out' dtype only exists because currently the nan function like to return nan for things like empty arrays, right? If not for that, it could be relaxed? (it's a rather weird requirement, since the whole point of these functions is that they ignore nans, yet they don't always...)

...
sum and nansum

These should be consistent so that empty sums are 0. This should cover the empty array case, but will change the behaviour of nansum which currently returns NaN if the array isn't empty but the slice is after NaN removal.

I agree that returning 0 is the right behaviour, but we might need a FutureWarning period.

...
mean and nanmean

In the case of empty arrays, an empty slice, this leads to 0/0. For Python this is always a zero division error, for Numpy this raises a warning and and returns NaN for floats, 0 for integers.

Currently mean returns NaN and raises a RuntimeWarning when 0/0 occurs. In the special case where dtype=int, the NaN is cast to integer.

Option1 1) mean raise error on 0/0 2) nanmean no warning, return NaN

Option2 1) mean raise warning, return NaN (current behavior) 2) nanmean no warning, return NaN

Option3 1) mean raise warning, return NaN (current behavior) 2) nanmean raise warning, return NaN

I have mixed feelings about the whole np.seterr apparatus, but since it exists, shouldn't we use it for consistency? I.e., just do whatever numpy is set up to do with 0/0? (Which I think means, warn and return NaN by default, but this can be changed.)

...
var, std, nanvar, nanstd

1) if ddof > axis(axes) size, raise error, probably a program bug. 2) If ddof=0, then whatever is the case for mean, nanmean

For nanvar, nanstd it is possible that some slice are good, some bad, so

option1 1) if n - ddof <= 0 for a slice, raise warning, return NaN for slice

option2 1) if n - ddof <= 0 for a slice, don't warn, return NaN for slice

I don't really have any intuition for these ddof cases. Just raising an error on negative effective dof is pretty defensible and might be the safest -- it's a easy to turn an error into something sensible later if people come up with use cases...

related why does reduceat not have empty slices?

...
...
...
np.add.reduceat(np.arange(8),[0,4, 5, 7,7]) array([ 6, 4, 11, 7, 7])

I'm in favor of returning nans instead of raising exceptions, except if the return type is int and we cannot cast nan to int.

If we get functions into numpy that know how to handle nans, then it would be useful to get the nans, so we can work with them

Some cases where this might come in handy are when we iterate over slices of an array that define groups or category levels with possible empty groups *)

...
...
...
idx = np.repeat(np.array([0, 1, 2, 3]), [4, 3, 0, 2]) x = np.arange(9) [x[idx==ii].mean() for ii in range(4)] [1.5, 5.0, nan, 7.5]

instead of

...
...
...
[x[idx==ii].mean() for ii in range(4) if (idx==ii).sum()>0] [1.5, 5.0, 7.5]

same for var, I wouldn't have to check that the size is larger than the ddof (whatever that is in the specific case)

*) groups could be empty because they were defined for a larger dataset or as a union of different datasets

background: I wrote several robust anova versions a few weeks ago, that were essentially list comprehension as above. However, I didn't allow nans and didn't check for minimum size. Allowing for empty groups to return nan would mainly be a convenience, since I need to check the group size only once. ddof: tests for proportions have ddof=0, for regular t-test ddof=1, for tests of correlation ddof=2 IIRC so we would need to check for the corresponding minimum size that n-ddof>0 "negative effective dof" doesn't exist, that's np.maximum(n - ddof, 0) which is always non-negative but might result in a zero-division error. :) I don't think making anything conditional on ddof>0 is useful. Josef

...

PS: I used mean() above and not var() because

...
...
...
np.__version__ '1.5.1' np.mean([]) nan np.var([]) 0.0

Josef

...
-n _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

Charles R Harris

3:34 p.m.

New subject: What should be the result in some statistics corner cases?

On Mon, Jul 15, 2013 at 2:44 PM, <josef.pktd@gmail.com> wrote:

...

...
On Mon, Jul 15, 2013 at 2:55 PM, Nathaniel Smith <njs@pobox.com> wrote:

...
On Mon, Jul 15, 2013 at 6:29 PM, Charles R Harris <charlesr.harris@gmail.com> wrote:

...
Let me try to summarize. To begin with, the environment of the nan functions is rather special.

1) if the array is of not of inexact type, they punt to the non-nan versions. 2) if the array is of inexact type, then out and dtype must be inexact if specified

The second assumption guarantees that NaN can be used in the return values.

The requirement on the 'out' dtype only exists because currently the nan function like to return nan for things like empty arrays, right? If not for that, it could be relaxed? (it's a rather weird requirement, since the whole point of these functions is that they ignore nans, yet they don't always...)

...
sum and nansum

These should be consistent so that empty sums are 0. This should cover

On Mon, Jul 15, 2013 at 4:24 PM, <josef.pktd@gmail.com> wrote: the

...
...
...
empty array case, but will change the behaviour of nansum which currently returns NaN if the array isn't empty but the slice is after NaN removal.

I agree that returning 0 is the right behaviour, but we might need a FutureWarning period.

...
mean and nanmean

In the case of empty arrays, an empty slice, this leads to 0/0. For Python this is always a zero division error, for Numpy this raises a warning and and returns NaN for floats, 0 for integers.

Currently mean returns NaN and raises a RuntimeWarning when 0/0 occurs. In the special case where dtype=int, the NaN is cast to integer.

Option1 1) mean raise error on 0/0 2) nanmean no warning, return NaN

Option2 1) mean raise warning, return NaN (current behavior) 2) nanmean no warning, return NaN

Option3 1) mean raise warning, return NaN (current behavior) 2) nanmean raise warning, return NaN

I have mixed feelings about the whole np.seterr apparatus, but since it exists, shouldn't we use it for consistency? I.e., just do whatever numpy is set up to do with 0/0? (Which I think means, warn and return NaN by default, but this can be changed.)

...
var, std, nanvar, nanstd

1) if ddof > axis(axes) size, raise error, probably a program bug. 2) If ddof=0, then whatever is the case for mean, nanmean

For nanvar, nanstd it is possible that some slice are good, some bad, so

option1 1) if n - ddof <= 0 for a slice, raise warning, return NaN for slice

option2 1) if n - ddof <= 0 for a slice, don't warn, return NaN for slice

I don't really have any intuition for these ddof cases. Just raising an error on negative effective dof is pretty defensible and might be the safest -- it's a easy to turn an error into something sensible later if people come up with use cases...

related why does reduceat not have empty slices?

...
...
...
np.add.reduceat(np.arange(8),[0,4, 5, 7,7]) array([ 6, 4, 11, 7, 7])

I'm in favor of returning nans instead of raising exceptions, except if the return type is int and we cannot cast nan to int.

If we get functions into numpy that know how to handle nans, then it would be useful to get the nans, so we can work with them

Some cases where this might come in handy are when we iterate over slices of an array that define groups or category levels with possible empty groups *)

...
...
...
idx = np.repeat(np.array([0, 1, 2, 3]), [4, 3, 0, 2]) x = np.arange(9) [x[idx==ii].mean() for ii in range(4)] [1.5, 5.0, nan, 7.5]

instead of

...
...
...
[x[idx==ii].mean() for ii in range(4) if (idx==ii).sum()>0] [1.5, 5.0, 7.5]

same for var, I wouldn't have to check that the size is larger than the ddof (whatever that is in the specific case)

*) groups could be empty because they were defined for a larger dataset or as a union of different datasets

background:

I wrote several robust anova versions a few weeks ago, that were essentially list comprehension as above. However, I didn't allow nans and didn't check for minimum size. Allowing for empty groups to return nan would mainly be a convenience, since I need to check the group size only once.

ddof: tests for proportions have ddof=0, for regular t-test ddof=1, for tests of correlation ddof=2 IIRC so we would need to check for the corresponding minimum size that n-ddof>0

"negative effective dof" doesn't exist, that's np.maximum(n - ddof, 0) which is always non-negative but might result in a zero-division error. :)

I don't think making anything conditional on ddof>0 is useful.

So how would you want it? To summarize the problem areas: 1) What is the sum of an empty slice? NaN or 0? 2) What is mean of empy slice? NaN, NaN and warn, or error? 3) What if n - ddof < 0 for slice? NaN, NaN and warn, or error? 4) What if n - ddof = 0 for slice? NaN, NaN and warn, or error? I'm tending to NaN and warn for 2 -- 3, because, as Nathaniel notes, the warning can be turned into an error by the user. The errstate context manager would be good for that. Chuck

josef.pktd＠gmail.com

3:57 p.m.

New subject: What should be the result in some statistics corner cases?

On Mon, Jul 15, 2013 at 5:34 PM, Charles R Harris <charlesr.harris@gmail.com> wrote:

...

On Mon, Jul 15, 2013 at 2:44 PM, <josef.pktd@gmail.com> wrote:

...
On Mon, Jul 15, 2013 at 4:24 PM, <josef.pktd@gmail.com> wrote:

...
On Mon, Jul 15, 2013 at 2:55 PM, Nathaniel Smith <njs@pobox.com> wrote:

...
On Mon, Jul 15, 2013 at 6:29 PM, Charles R Harris <charlesr.harris@gmail.com> wrote:

...
Let me try to summarize. To begin with, the environment of the nan functions is rather special.

1) if the array is of not of inexact type, they punt to the non-nan versions. 2) if the array is of inexact type, then out and dtype must be inexact if specified

The second assumption guarantees that NaN can be used in the return values.

The requirement on the 'out' dtype only exists because currently the nan function like to return nan for things like empty arrays, right? If not for that, it could be relaxed? (it's a rather weird requirement, since the whole point of these functions is that they ignore nans, yet they don't always...)

...
sum and nansum

These should be consistent so that empty sums are 0. This should cover the empty array case, but will change the behaviour of nansum which currently returns NaN if the array isn't empty but the slice is after NaN removal.

I agree that returning 0 is the right behaviour, but we might need a FutureWarning period.

...
mean and nanmean

In the case of empty arrays, an empty slice, this leads to 0/0. For Python this is always a zero division error, for Numpy this raises a warning and and returns NaN for floats, 0 for integers.

Currently mean returns NaN and raises a RuntimeWarning when 0/0 occurs. In the special case where dtype=int, the NaN is cast to integer.

Option1 1) mean raise error on 0/0 2) nanmean no warning, return NaN

Option2 1) mean raise warning, return NaN (current behavior) 2) nanmean no warning, return NaN

Option3 1) mean raise warning, return NaN (current behavior) 2) nanmean raise warning, return NaN

I have mixed feelings about the whole np.seterr apparatus, but since it exists, shouldn't we use it for consistency? I.e., just do whatever numpy is set up to do with 0/0? (Which I think means, warn and return NaN by default, but this can be changed.)

...
var, std, nanvar, nanstd

1) if ddof > axis(axes) size, raise error, probably a program bug. 2) If ddof=0, then whatever is the case for mean, nanmean

For nanvar, nanstd it is possible that some slice are good, some bad, so

option1 1) if n - ddof <= 0 for a slice, raise warning, return NaN for slice

option2 1) if n - ddof <= 0 for a slice, don't warn, return NaN for slice

I don't really have any intuition for these ddof cases. Just raising an error on negative effective dof is pretty defensible and might be the safest -- it's a easy to turn an error into something sensible later if people come up with use cases...

related why does reduceat not have empty slices?

...
...
...
np.add.reduceat(np.arange(8),[0,4, 5, 7,7]) array([ 6, 4, 11, 7, 7])

I'm in favor of returning nans instead of raising exceptions, except if the return type is int and we cannot cast nan to int.

If we get functions into numpy that know how to handle nans, then it would be useful to get the nans, so we can work with them

Some cases where this might come in handy are when we iterate over slices of an array that define groups or category levels with possible empty groups *)

...
...
...
idx = np.repeat(np.array([0, 1, 2, 3]), [4, 3, 0, 2]) x = np.arange(9) [x[idx==ii].mean() for ii in range(4)] [1.5, 5.0, nan, 7.5]

instead of

...
...
...
[x[idx==ii].mean() for ii in range(4) if (idx==ii).sum()>0] [1.5, 5.0, 7.5]

same for var, I wouldn't have to check that the size is larger than the ddof (whatever that is in the specific case)

*) groups could be empty because they were defined for a larger dataset or as a union of different datasets

background:

I wrote several robust anova versions a few weeks ago, that were essentially list comprehension as above. However, I didn't allow nans and didn't check for minimum size. Allowing for empty groups to return nan would mainly be a convenience, since I need to check the group size only once.

ddof: tests for proportions have ddof=0, for regular t-test ddof=1, for tests of correlation ddof=2 IIRC so we would need to check for the corresponding minimum size that n-ddof>0

"negative effective dof" doesn't exist, that's np.maximum(n - ddof, 0) which is always non-negative but might result in a zero-division error. :)

I don't think making anything conditional on ddof>0 is useful.

So how would you want it?

To summarize the problem areas:

1) What is the sum of an empty slice? NaN or 0?

0 as it is now for sum, (including 0 for nansum with no valid entries).

...

2) What is mean of empy slice? NaN, NaN and warn, or error? 3) What if n - ddof < 0 for slice? NaN, NaN and warn, or error? 4) What if n - ddof = 0 for slice? NaN, NaN and warn, or error?

I'm tending to NaN and warn for 2 -- 3, because, as Nathaniel notes, the warning can be turned into an error by the user. The errstate context manager would be good for that.

Yes, That's what I would prefer also, NaN and ZeroDivisionError, for 2-4, including mean, var and std, for both nan and non-nan functions. with the extra argument that 3) and 4) are the same case (except in polyfit :) Josef

...

Chuck

_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

Charles R Harris

4:03 p.m.

New subject: What should be the result in some statistics corner cases?

On Mon, Jul 15, 2013 at 3:57 PM, <josef.pktd@gmail.com> wrote:

...

On Mon, Jul 15, 2013 at 5:34 PM, Charles R Harris <charlesr.harris@gmail.com> wrote:

...
On Mon, Jul 15, 2013 at 2:44 PM, <josef.pktd@gmail.com> wrote:

...
On Mon, Jul 15, 2013 at 4:24 PM, <josef.pktd@gmail.com> wrote:

...
On Mon, Jul 15, 2013 at 2:55 PM, Nathaniel Smith <njs@pobox.com>

wrote:

...
...
...
...
On Mon, Jul 15, 2013 at 6:29 PM, Charles R Harris <charlesr.harris@gmail.com> wrote:

...
Let me try to summarize. To begin with, the environment of the nan functions is rather special.

1) if the array is of not of inexact type, they punt to the non-nan versions. 2) if the array is of inexact type, then out and dtype must be inexact if specified

The second assumption guarantees that NaN can be used in the return values.

The requirement on the 'out' dtype only exists because currently the nan function like to return nan for things like empty arrays, right? If not for that, it could be relaxed? (it's a rather weird requirement, since the whole point of these functions is that they ignore nans, yet they don't always...)

...
sum and nansum

These should be consistent so that empty sums are 0. This should cover the empty array case, but will change the behaviour of nansum which currently returns NaN if the array isn't empty but the slice is after NaN removal.

I agree that returning 0 is the right behaviour, but we might need a FutureWarning period.

...
mean and nanmean

In the case of empty arrays, an empty slice, this leads to 0/0. For Python this is always a zero division error, for Numpy this raises a warning and and returns NaN for floats, 0 for integers.

Currently mean returns NaN and raises a RuntimeWarning when 0/0 occurs. In the special case where dtype=int, the NaN is cast to integer.

Option1 1) mean raise error on 0/0 2) nanmean no warning, return NaN

Option2 1) mean raise warning, return NaN (current behavior) 2) nanmean no warning, return NaN

Option3 1) mean raise warning, return NaN (current behavior) 2) nanmean raise warning, return NaN

I have mixed feelings about the whole np.seterr apparatus, but since it exists, shouldn't we use it for consistency? I.e., just do whatever numpy is set up to do with 0/0? (Which I think means, warn and return NaN by default, but this can be changed.)

...
var, std, nanvar, nanstd

1) if ddof > axis(axes) size, raise error, probably a program bug. 2) If ddof=0, then whatever is the case for mean, nanmean

For nanvar, nanstd it is possible that some slice are good, some bad, so

option1 1) if n - ddof <= 0 for a slice, raise warning, return NaN for slice

option2 1) if n - ddof <= 0 for a slice, don't warn, return NaN for slice

I don't really have any intuition for these ddof cases. Just raising an error on negative effective dof is pretty defensible and might be the safest -- it's a easy to turn an error into something sensible later if people come up with use cases...

related why does reduceat not have empty slices?

...
...
> np.add.reduceat(np.arange(8),[0,4, 5, 7,7]) array([ 6, 4, 11, 7, 7])

I'm in favor of returning nans instead of raising exceptions, except if the return type is int and we cannot cast nan to int.

If we get functions into numpy that know how to handle nans, then it would be useful to get the nans, so we can work with them

Some cases where this might come in handy are when we iterate over slices of an array that define groups or category levels with possible empty groups *)

...
...
> idx = np.repeat(np.array([0, 1, 2, 3]), [4, 3, 0, 2]) > x = np.arange(9) > [x[idx==ii].mean() for ii in range(4)] [1.5, 5.0, nan, 7.5]

instead of

...
...
> [x[idx==ii].mean() for ii in range(4) if (idx==ii).sum()>0] [1.5, 5.0, 7.5]

same for var, I wouldn't have to check that the size is larger than the ddof (whatever that is in the specific case)

*) groups could be empty because they were defined for a larger dataset or as a union of different datasets

background:

I wrote several robust anova versions a few weeks ago, that were essentially list comprehension as above. However, I didn't allow nans and didn't check for minimum size. Allowing for empty groups to return nan would mainly be a convenience, since I need to check the group size only once.

ddof: tests for proportions have ddof=0, for regular t-test ddof=1, for tests of correlation ddof=2 IIRC so we would need to check for the corresponding minimum size that n-ddof>0

"negative effective dof" doesn't exist, that's np.maximum(n - ddof, 0) which is always non-negative but might result in a zero-division error. :)

I don't think making anything conditional on ddof>0 is useful.

So how would you want it?

To summarize the problem areas:

1) What is the sum of an empty slice? NaN or 0? 0 as it is now for sum, (including 0 for nansum with no valid entries).

...
2) What is mean of empy slice? NaN, NaN and warn, or error? 3) What if n - ddof < 0 for slice? NaN, NaN and warn, or error? 4) What if n - ddof = 0 for slice? NaN, NaN and warn, or error?

I'm tending to NaN and warn for 2 -- 3, because, as Nathaniel notes, the warning can be turned into an error by the user. The errstate context manager would be good for that.

Yes, That's what I would prefer also, NaN and ZeroDivisionError, for 2-4, including mean, var and std, for both nan and non-nan functions.

with the extra argument that 3) and 4) are the same case (except in polyfit :)

One extra possibility with the nan functions could be a new keyword, error, which would turn warnings into errors. But that might be a bit much. Chuck

Charles R Harris

8:58 a.m.

New subject: What should be the result in some statistics corner cases?

On Mon, Jul 15, 2013 at 8:34 AM, Sebastian Berg <sebastian@sipsolutions.net>wrote:

...

On Mon, 2013-07-15 at 07:52 -0600, Charles R Harris wrote:

...
On Sun, Jul 14, 2013 at 3:35 PM, Charles R Harris <charlesr.harris@gmail.com> wrote:

<snip>

...
For nansum, I would expect 0 even in the case of all nans. The point of these functions is to simply ignore nans, correct? So I would aim for this behaviour: nanfunc(x) behaves the same as func(x[~isnan(x)])

Agreed, although that changes current behavior. What about the other cases?

Looks like there isn't much interest in the topic, so I'll just go ahead with the following choices:

Non-NaN case

1) Empty array -> ValueError

The current behavior with stats is an accident, i.e., the nan arises from 0/0. I like to think that in this case the result is any number, rather than not a number, so *the* value is simply not defined. So in this case raise a ValueError for empty array.

To be honest, I don't mind the current behaviour much sum([]) = 0, len([]) = 0, so it is in a way well defined. At least I am not sure if I would prefer always an error. I am a bit worried that just changing it might break code out there, such as plotting code where it makes perfectly sense to plot a NaN (i.e. nothing), but if that is the case it would probably be visible fast.

...
2) ddof >= n -> ValueError

If the number of elements, n, is not zero and ddof >= n, raise a ValueError for the ddof value.

Makes sense to me, especially for ddof > n. Just returning nan in all cases for backward compatibility would be fine with me too.

Currently if ddof > n it returns a negative number for variance, the NaN only comes when ddof == 0 and n == 0, leading to 0/0 (float is NaN, integer is zero division).

...

...
Nan case

1) Empty array -> Value Error 2) Empty slice -> NaN 3) For slice ddof >= n -> Nan

Personally I would somewhat prefer if 1) and 2) would at least default to the same thing. But I don't use the nanfuncs anyway. I was wondering about adding the option for the user to pick what the fill is (and i.e. if it is None (maybe default) -> ValueError). We could also allow this for normal reductions without an identity, but I am not sure if it is useful there.

In the NaN case some slices may be empty, others not. My reasoning is that that is going to be data dependent, not operator error, but if the array is empty the writer of the code should deal with that. Chuck

Charles R Harris

9:47 a.m.

New subject: What should be the result in some statistics corner cases?

On Mon, Jul 15, 2013 at 8:58 AM, Charles R Harris <charlesr.harris@gmail.com

...

wrote:

...

On Mon, Jul 15, 2013 at 8:34 AM, Sebastian Berg < sebastian@sipsolutions.net> wrote:

...
On Mon, 2013-07-15 at 07:52 -0600, Charles R Harris wrote:

...
On Sun, Jul 14, 2013 at 3:35 PM, Charles R Harris <charlesr.harris@gmail.com> wrote:

<snip>

...
For nansum, I would expect 0 even in the case of all nans. The point of these functions is to simply ignore nans, correct? So I would aim for this behaviour: nanfunc(x) behaves the same as func(x[~isnan(x)])

Agreed, although that changes current behavior. What about the other cases?

Looks like there isn't much interest in the topic, so I'll just go ahead with the following choices:

Non-NaN case

1) Empty array -> ValueError

The current behavior with stats is an accident, i.e., the nan arises from 0/0. I like to think that in this case the result is any number, rather than not a number, so *the* value is simply not defined. So in this case raise a ValueError for empty array.

To be honest, I don't mind the current behaviour much sum([]) = 0, len([]) = 0, so it is in a way well defined. At least I am not sure if I would prefer always an error. I am a bit worried that just changing it might break code out there, such as plotting code where it makes perfectly sense to plot a NaN (i.e. nothing), but if that is the case it would probably be visible fast.

...
2) ddof >= n -> ValueError

If the number of elements, n, is not zero and ddof >= n, raise a ValueError for the ddof value.

Makes sense to me, especially for ddof > n. Just returning nan in all cases for backward compatibility would be fine with me too.

Currently if ddof > n it returns a negative number for variance, the NaN only comes when ddof == 0 and n == 0, leading to 0/0 (float is NaN, integer is zero division).

...
...
Nan case

1) Empty array -> Value Error 2) Empty slice -> NaN 3) For slice ddof >= n -> Nan

Personally I would somewhat prefer if 1) and 2) would at least default to the same thing. But I don't use the nanfuncs anyway. I was wondering about adding the option for the user to pick what the fill is (and i.e. if it is None (maybe default) -> ValueError). We could also allow this for normal reductions without an identity, but I am not sure if it is useful there.

In the NaN case some slices may be empty, others not. My reasoning is that that is going to be data dependent, not operator error, but if the array is empty the writer of the code should deal with that.

In the case of the nanvar, nanstd, it might make more sense to handle ddof as 1) if ddof is >= axis size, raise ValueError 2) if ddof is >= number of values after removing NaNs, return NaN The first would be consistent with the non-nan case, the second accounts for the variable nature of data containing NaNs. Chuck

Benjamin Root

9:55 a.m.

New subject: What should be the result in some statistics corner cases?

On Jul 15, 2013 11:47 AM, "Charles R Harris" <charlesr.harris@gmail.com> wrote:

...

On Mon, Jul 15, 2013 at 8:58 AM, Charles R Harris < charlesr.harris@gmail.com> wrote:

...
On Mon, Jul 15, 2013 at 8:34 AM, Sebastian Berg < sebastian@sipsolutions.net> wrote:

...
On Mon, 2013-07-15 at 07:52 -0600, Charles R Harris wrote:

...
On Sun, Jul 14, 2013 at 3:35 PM, Charles R Harris <charlesr.harris@gmail.com> wrote:

<snip>

...
For nansum, I would expect 0 even in the case of all nans. The point of these functions is to simply ignore nans, correct? So I would aim for this behaviour: nanfunc(x) behaves the same as func(x[~isnan(x)])

Agreed, although that changes current behavior. What about the other cases?

Looks like there isn't much interest in the topic, so I'll just go ahead with the following choices:

Non-NaN case

1) Empty array -> ValueError

The current behavior with stats is an accident, i.e., the nan arises from 0/0. I like to think that in this case the result is any number, rather than not a number, so *the* value is simply not defined. So in this case raise a ValueError for empty array.

To be honest, I don't mind the current behaviour much sum([]) = 0, len([]) = 0, so it is in a way well defined. At least I am not sure if I would prefer always an error. I am a bit worried that just changing it might break code out there, such as plotting code where it makes perfectly sense to plot a NaN (i.e. nothing), but if that is the case it would probably be visible fast.

...
2) ddof >= n -> ValueError

If the number of elements, n, is not zero and ddof >= n, raise a ValueError for the ddof value.

Makes sense to me, especially for ddof > n. Just returning nan in all cases for backward compatibility would be fine with me too.

Currently if ddof > n it returns a negative number for variance, the NaN only comes when ddof == 0 and n == 0, leading to 0/0 (float is NaN, integer is zero division).

...
...
Nan case

1) Empty array -> Value Error 2) Empty slice -> NaN 3) For slice ddof >= n -> Nan

Personally I would somewhat prefer if 1) and 2) would at least default to the same thing. But I don't use the nanfuncs anyway. I was wondering about adding the option for the user to pick what the fill is (and i.e. if it is None (maybe default) -> ValueError). We could also allow this for normal reductions without an identity, but I am not sure if it is useful there.

In the NaN case some slices may be empty, others not. My reasoning is that that is going to be data dependent, not operator error, but if the array is empty the writer of the code should deal with that.

In the case of the nanvar, nanstd, it might make more sense to handle ddof as

1) if ddof is >= axis size, raise ValueError 2) if ddof is >= number of values after removing NaNs, return NaN

The first would be consistent with the non-nan case, the second accounts for the variable nature of data containing NaNs.

Chuck

I think this is a good idea in that it naturally follows well with the conventions of what to do with empty arrays / empty slices with nanmean, etc. Note, however, I am not a very big fan of the idea of having two different behaviors for what I see as semantically the same thing. But, my objections are not strong enough to veto it, and I do think this proposal is well thought-out. Ben Root

4239

Age (days ago)

4241

Last active (days ago)

List overview

Download

24 comments

8 participants

participants (8)

Benjamin Root
Charles R Harris
josef.pktd＠gmail.com
Nathaniel Smith
Ralf Gommers
Sebastian Berg
Stéfan van der Walt
Warren Weckesser

What should be the result in some statistics corner cases?

Benjamin Root

Benjamin Root

Benjamin Root

tags

participants (8)