Mailman 3 the mean, var, std of empty arrays - NumPy-Discussion

the mean, var, std of empty arrays

Charles R Harris

22 Nov 2012 22 Nov '12

1:51 a.m.

What should be the value of the mean, var, and std of empty arrays? Currently In [12]: a Out[12]: array([], dtype=int64) In [13]: a.mean() Out[13]: nan In [14]: a.std() Out[14]: nan In [15]: a.var() Out[15]: nan I think the nan comes from 0/0. All of these also raise warnings the first time they are called. Chuck

Attachments:

attachment.htm (text/html — 481 bytes)

Show replies by date

Olivier Delalleau

22 Nov 22 Nov

2:22 a.m.

Current behavior looks sensible to me. I personally would prefer no warning but I think it makes sense to have one as it can be helpful to detect issues faster. -=- Olivier 2012/11/21 Charles R Harris

...

What should be the value of the mean, var, and std of empty arrays? Currently

In [12]: a Out[12]: array([], dtype=int64)

In [13]: a.mean() Out[13]: nan

In [14]: a.std() Out[14]: nan

In [15]: a.var() Out[15]: nan

I think the nan comes from 0/0. All of these also raise warnings the first time they are called.

Chuck

_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

Charles R Harris

2:33 a.m.

Hi Olivier, Please don't top post, it isn't the custom on this list. On Wed, Nov 21, 2012 at 7:22 PM, Olivier Delalleau wrote:

...

Current behavior looks sensible to me. I personally would prefer no warning but I think it makes sense to have one as it can be helpful to detect issues faster.

-=- Olivier

2012/11/21 Charles R Harris

...
What should be the value of the mean, var, and std of empty arrays? Currently

In [12]: a Out[12]: array([], dtype=int64)

In [13]: a.mean() Out[13]: nan

In [14]: a.std() Out[14]: nan

In [15]: a.var() Out[15]: nan

I think the nan comes from 0/0. All of these also raise warnings the first time they are called.

The warnings vary and don't directly give information on the cause, i.e., empty arrays. If we do go with warnings I think they should be more specific.

Chuck

Skipper Seabold

2:38 a.m.

On Wed, Nov 21, 2012 at 9:22 PM, Olivier Delalleau wrote:

...

Current behavior looks sensible to me. I personally would prefer no warning but I think it makes sense to have one as it can be helpful to detect issues faster.

-=- Olivier

It's configurable. [~/] [1]: np.seterr(all='ignore') [1]: {'divide': 'ignore', 'invalid': 'ignore', 'over': 'ignore', 'under': 'ignore'} [~/] [2]: np.array([]).mean() [2]: nan [~/] [3]: np.seterr(all='warn') [3]: {'divide': 'ignore', 'invalid': 'ignore', 'over': 'ignore', 'under': 'ignore'} [~/] [4]: np.array([]).mean() /usr/local/lib/python2.7/dist-packages/numpy/core/_methods.py:57: RuntimeWarning: invalid value encountered in double_scalars ret = ret / float(rcount) [4]: nan Skipper

...

2012/11/21 Charles R Harris

...
What should be the value of the mean, var, and std of empty arrays? Currently

In [12]: a Out[12]: array([], dtype=int64)

In [13]: a.mean() Out[13]: nan

In [14]: a.std() Out[14]: nan

In [15]: a.var() Out[15]: nan

I think the nan comes from 0/0. All of these also raise warnings the first time they are called.

Chuck

_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

josef.pktd＠gmail.com

2:45 a.m.

On Wed, Nov 21, 2012 at 9:22 PM, Olivier Delalleau wrote:

...

Current behavior looks sensible to me. I personally would prefer no warning but I think it makes sense to have one as it can be helpful to detect issues faster.

I agree that nan should be the correct answer. (I gave up trying to define a default for 0/0 in scipy.stats ttests.) some funnier cases

...

...
...
np.var([1], ddof=1) 0.0 np.var([1], ddof=5) -0 np.var([1,2], ddof=5) -0.16666666666666666 np.std([1,2], ddof=5) nan

But maybe my numpy is too old on my open interpreter

...

...
...
np.__version__ '1.5.1'

Josef

...

-=- Olivier

2012/11/21 Charles R Harris

...
What should be the value of the mean, var, and std of empty arrays? Currently

In [12]: a Out[12]: array([], dtype=int64)

In [13]: a.mean() Out[13]: nan

In [14]: a.std() Out[14]: nan

In [15]: a.var() Out[15]: nan

I think the nan comes from 0/0. All of these also raise warnings the first time they are called.

Chuck

_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

Charles R Harris

3:35 a.m.

On Wed, Nov 21, 2012 at 7:45 PM, wrote:

...

On Wed, Nov 21, 2012 at 9:22 PM, Olivier Delalleau wrote:

...
Current behavior looks sensible to me. I personally would prefer no warning but I think it makes sense to have one as it can be helpful to detect issues faster.

I agree that nan should be the correct answer. (I gave up trying to define a default for 0/0 in scipy.stats ttests.)

some funnier cases

...
...
...
np.var([1], ddof=1) 0.0

This one is a nan in development.

...

...
...
...
np.var([1], ddof=5) -0 np.var([1,2], ddof=5) -0.16666666666666666 np.std([1,2], ddof=5) nan

These still do this. Also In [10]: var([], ddof=1) Out[10]: -0 Which suggests that the nan is pretty much an accidental byproduct of division by zero. I think it might make sense to have a definite policy for these corner cases. <snip> Chuck

josef.pktd＠gmail.com

3:58 a.m.

On Wed, Nov 21, 2012 at 10:35 PM, Charles R Harris wrote:

...

On Wed, Nov 21, 2012 at 7:45 PM, wrote:

...
On Wed, Nov 21, 2012 at 9:22 PM, Olivier Delalleau wrote:

...
Current behavior looks sensible to me. I personally would prefer no warning but I think it makes sense to have one as it can be helpful to detect issues faster.

I agree that nan should be the correct answer. (I gave up trying to define a default for 0/0 in scipy.stats ttests.)

some funnier cases

...
...
...
np.var([1], ddof=1) 0.0

This one is a nan in development.

...
...
...
...
np.var([1], ddof=5) -0 np.var([1,2], ddof=5) -0.16666666666666666 np.std([1,2], ddof=5) nan

These still do this. Also

In [10]: var([], ddof=1) Out[10]: -0

Which suggests that the nan is pretty much an accidental byproduct of division by zero. I think it might make sense to have a definite policy for these corner cases.

It would also be consistent with the usual pattern to raise a ValueError on this. ddof too large, size too small. It wouldn't be the case that for some columns or rows we get valid answers in this case, as long as we don't allow for missing values. quick check with np.ma looks correct except when delegating to numpy ?

...

...
...
s = np.ma.var(np.ma.masked_invalid([[1.,2],[1,np.nan]]), ddof=5, axis=0) s masked_array(data = [-- --], mask = [ True True], fill_value = 1e+20)

...

...
...
s = np.ma.var(np.ma.masked_invalid([[1.,2],[1,np.nan]]), ddof=1, axis=0) s masked_array(data = [0.0 --], mask = [False True], fill_value = 1e+20)

...

...
...
s = np.ma.std([1,2], ddof=5) s masked type(s)

...

...
...
np.ma.var([1,2], ddof=5) -0.16666666666666666

Josef

...

<snip>

Chuck

_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

josef.pktd＠gmail.com

4:20 a.m.

On Wed, Nov 21, 2012 at 10:58 PM, wrote:

...

On Wed, Nov 21, 2012 at 10:35 PM, Charles R Harris wrote:

...
On Wed, Nov 21, 2012 at 7:45 PM, wrote:

...
On Wed, Nov 21, 2012 at 9:22 PM, Olivier Delalleau wrote:

...
Current behavior looks sensible to me. I personally would prefer no warning but I think it makes sense to have one as it can be helpful to detect issues faster.

I agree that nan should be the correct answer. (I gave up trying to define a default for 0/0 in scipy.stats ttests.)

some funnier cases

...
...
...
np.var([1], ddof=1) 0.0

This one is a nan in development.

...
...
...
...
np.var([1], ddof=5) -0 np.var([1,2], ddof=5) -0.16666666666666666 np.std([1,2], ddof=5) nan

These still do this. Also

In [10]: var([], ddof=1) Out[10]: -0

Which suggests that the nan is pretty much an accidental byproduct of division by zero. I think it might make sense to have a definite policy for these corner cases.

It would also be consistent with the usual pattern to raise a ValueError on this. ddof too large, size too small. It wouldn't be the case that for some columns or rows we get valid answers in this case, as long as we don't allow for missing values.

I think I prefer NaNs to an exception, they propagate nicer to downstream functions. I'm in favor of a policy instead of nans or wrong numbers by accident.

...

quick check with np.ma

looks correct except when delegating to numpy ?

...
...
...
s = np.ma.var(np.ma.masked_invalid([[1.,2],[1,np.nan]]), ddof=5, axis=0) s masked_array(data = [-- --], mask = [ True True], fill_value = 1e+20)

...
...
...
s = np.ma.var(np.ma.masked_invalid([[1.,2],[1,np.nan]]), ddof=1, axis=0) s masked_array(data = [0.0 --], mask = [False True], fill_value = 1e+20)

...
...
...
s = np.ma.std([1,2], ddof=5) s masked type(s)

...
...
...
np.ma.var([1,2], ddof=5) -0.16666666666666666

and cov:

...

...
...
np.cov([1.],[3],bias=True, rowvar=False) #looks fine array([[ 0., 0.], [ 0., 0.]]) np.cov([1.],[3],bias=False, rowvar=False) array([[ nan, nan], [ nan, nan]])

...

...
...
np.cov([[1.],[3]],bias=False, rowvar=True) array([[ nan, nan], [ nan, nan]])

...

...
...
np.cov([],[],bias=False, rowvar=False) #should be nan array([[-0., -0.], [-0., -0.]]) np.cov([],[],bias=True, rowvar=False) array([[ nan, nan], [ nan, nan]])

np.corrcoef seems to have nans in the right places in the examples I tried Josef

...

Josef

...
<snip>

Chuck

_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

Sebastian Berg

12:14 p.m.

On Wed, 2012-11-21 at 22:58 -0500, josef.pktd@gmail.com wrote:

...

On Wed, Nov 21, 2012 at 10:35 PM, Charles R Harris wrote:

...
On Wed, Nov 21, 2012 at 7:45 PM, wrote:

...
On Wed, Nov 21, 2012 at 9:22 PM, Olivier Delalleau wrote:

...
Current behavior looks sensible to me. I personally would prefer no warning but I think it makes sense to have one as it can be helpful to detect issues faster.

I agree that nan should be the correct answer. (I gave up trying to define a default for 0/0 in scipy.stats ttests.)

some funnier cases

...
...
...
np.var([1], ddof=1) 0.0

This one is a nan in development.

...
...
...
...
np.var([1], ddof=5) -0 np.var([1,2], ddof=5) -0.16666666666666666 np.std([1,2], ddof=5) nan

These still do this. Also

In [10]: var([], ddof=1) Out[10]: -0

Which suggests that the nan is pretty much an accidental byproduct of division by zero. I think it might make sense to have a definite policy for these corner cases.

It would also be consistent with the usual pattern to raise a ValueError on this. ddof too large, size too small. It wouldn't be the case that for some columns or rows we get valid answers in this case, as long as we don't allow for missing values.

It seems to me that nan is the reasonable result for these operations (reduce like operations that do not have an identity). Though actually reduce operations without an identity throw a ValueError (ie. `np.minimum.reduce([])`), but then mean/std/var seem special enough to be different from other reduce operations (for example their result is always floating point). As for usability I think for example when plotting errorbars using std, it would be rather annoying to get a ValueError, so if anything the reduce machinery could give more special results for empty floating point reductions. In any case the warning should be clearer and for too large ddof's I would say it should return nan+Warning as well. Sebastian

...

quick check with np.ma

looks correct except when delegating to numpy ?

...
...
...
s = np.ma.var(np.ma.masked_invalid([[1.,2],[1,np.nan]]), ddof=5, axis=0) s masked_array(data = [-- --], mask = [ True True], fill_value = 1e+20)

...
...
...
s = np.ma.var(np.ma.masked_invalid([[1.,2],[1,np.nan]]), ddof=1, axis=0) s masked_array(data = [0.0 --], mask = [False True], fill_value = 1e+20)

...
...
...
s = np.ma.std([1,2], ddof=5) s masked type(s)

...
...
...
np.ma.var([1,2], ddof=5) -0.16666666666666666

Josef

...
<snip>

Chuck

_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

josef.pktd＠gmail.com

2:54 p.m.

On Thu, Nov 22, 2012 at 7:14 AM, Sebastian Berg wrote:

...

On Wed, 2012-11-21 at 22:58 -0500, josef.pktd@gmail.com wrote:

...
On Wed, Nov 21, 2012 at 10:35 PM, Charles R Harris wrote:

...
On Wed, Nov 21, 2012 at 7:45 PM, wrote:

...
On Wed, Nov 21, 2012 at 9:22 PM, Olivier Delalleau wrote:

...
Current behavior looks sensible to me. I personally would prefer no warning but I think it makes sense to have one as it can be helpful to detect issues faster.

I agree that nan should be the correct answer. (I gave up trying to define a default for 0/0 in scipy.stats ttests.)

some funnier cases

...
...
> np.var([1], ddof=1) 0.0

This one is a nan in development.

...
...
...
> np.var([1], ddof=5) -0 > np.var([1,2], ddof=5) -0.16666666666666666 > np.std([1,2], ddof=5) nan

These still do this. Also

In [10]: var([], ddof=1) Out[10]: -0

Which suggests that the nan is pretty much an accidental byproduct of division by zero. I think it might make sense to have a definite policy for these corner cases.

It would also be consistent with the usual pattern to raise a ValueError on this. ddof too large, size too small. It wouldn't be the case that for some columns or rows we get valid answers in this case, as long as we don't allow for missing values.

It seems to me that nan is the reasonable result for these operations (reduce like operations that do not have an identity). Though actually reduce operations without an identity throw a ValueError (ie. `np.minimum.reduce([])`), but then mean/std/var seem special enough to be different from other reduce operations (for example their result is always floating point). As for usability I think for example when plotting errorbars using std, it would be rather annoying to get a ValueError, so if anything the reduce machinery could give more special results for empty floating point reductions.

In any case the warning should be clearer and for too large ddof's I would say it should return nan+Warning as well.

Why don't operations on empty arrays not return empty arrays? but this looks ok

...

...
...
(np.array([]) - np.array([]).mean()) / np.array([]).std() array([], dtype=float64) (np.array([]) - np.array([]).mean()) / np.array([]).std(0) array([], dtype=float64)

...

...
...
(np.array([]) - np.array([]).mean(0)) / np.array([]).std(0) array([], dtype=float64) (np.array([]) - np.array([]).mean(0)) / np.array([]) array([], dtype=float64)

...

...
...
np.array([[]]) - np.expand_dims(np.array([[]]).mean(1),1) array([], shape=(1, 0), dtype=float64)

...

...
...
np.array([[]]) - np.expand_dims(np.array([]),1) array([], shape=(0, 0), dtype=float64)

...

...
...
np.array([]) - np.expand_dims(np.array([]),0) array([], shape=(1, 0), dtype=float64)

(But I doubt I will rely in many cases on correct "calculations" with empty arrays.) Josef

...

Sebastian

...
quick check with np.ma

looks correct except when delegating to numpy ?

...
...
...
s = np.ma.var(np.ma.masked_invalid([[1.,2],[1,np.nan]]), ddof=5, axis=0) s masked_array(data = [-- --], mask = [ True True], fill_value = 1e+20)

...
...
...
s = np.ma.var(np.ma.masked_invalid([[1.,2],[1,np.nan]]), ddof=1, axis=0) s masked_array(data = [0.0 --], mask = [False True], fill_value = 1e+20)

...
...
...
s = np.ma.std([1,2], ddof=5) s masked type(s)

...
...
...
np.ma.var([1,2], ddof=5) -0.16666666666666666

Josef

...
<snip>

Chuck

_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

Daπid

3:05 p.m.

On Thu, Nov 22, 2012 at 3:54 PM, wrote:

...

Why don't operations on empty arrays not return empty arrays?

Because functions like mean or std are expected to return a scalar. Functions that are piecewiese can (and should) return an empty array, but not the mean.

Sebastian Berg

3:15 p.m.

On Thu, 2012-11-22 at 16:05 +0100, Daπid wrote:

...

On Thu, Nov 22, 2012 at 3:54 PM, wrote:

...
Why don't operations on empty arrays not return empty arrays?

Because functions like mean or std are expected to return a scalar. Functions that are piecewiese can (and should) return an empty array, but not the mean.

I agree, this makes sense, note that: In [2]: a = np.empty((5,0)) In [3]: a.std(0) Out[3]: array([], dtype=float64) In [4]: a.std(1) /usr/bin/ipython:1: RuntimeWarning: invalid value encountered in divide #!/usr/bin/env python Out[4]: array([ nan, nan, nan, nan, nan]) However you are reducing, and with reducing you expect exactly 1 scalar result (along that dimension).

...

_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

josef.pktd＠gmail.com

3:31 p.m.

On Thu, Nov 22, 2012 at 10:15 AM, Sebastian Berg wrote:

...

On Thu, 2012-11-22 at 16:05 +0100, Daπid wrote:

...
On Thu, Nov 22, 2012 at 3:54 PM, wrote:

...
Why don't operations on empty arrays not return empty arrays?

Because functions like mean or std are expected to return a scalar. Functions that are piecewiese can (and should) return an empty array, but not the mean.

I agree, this makes sense, note that:

In [2]: a = np.empty((5,0))

In [3]: a.std(0) Out[3]: array([], dtype=float64)

In [4]: a.std(1) /usr/bin/ipython:1: RuntimeWarning: invalid value encountered in divide #!/usr/bin/env python Out[4]: array([ nan, nan, nan, nan, nan])

However you are reducing, and with reducing you expect exactly 1 scalar result (along that dimension).

Ok, I see. we cannot have an empty 1-D array shape (5,) Josef

...

...
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

4172

Age (days ago)

4172

Last active (days ago)

List overview

Download

12 comments

6 participants

participants (6)

Charles R Harris
Daπid
josef.pktd＠gmail.com
Olivier Delalleau
Sebastian Berg
Skipper Seabold

the mean, var, std of empty arrays

Olivier Delalleau

Skipper Seabold

tags

participants (6)