the mean, var, std of empty arrays
What should be the value of the mean, var, and std of empty arrays? Currently In [12]: a Out[12]: array([], dtype=int64) In [13]: a.mean() Out[13]: nan In [14]: a.std() Out[14]: nan In [15]: a.var() Out[15]: nan I think the nan comes from 0/0. All of these also raise warnings the first time they are called. Chuck
Current behavior looks sensible to me. I personally would prefer no warning
but I think it makes sense to have one as it can be helpful to detect
issues faster.
-=- Olivier
2012/11/21 Charles R Harris
What should be the value of the mean, var, and std of empty arrays? Currently
In [12]: a Out[12]: array([], dtype=int64)
In [13]: a.mean() Out[13]: nan
In [14]: a.std() Out[14]: nan
In [15]: a.var() Out[15]: nan
I think the nan comes from 0/0. All of these also raise warnings the first time they are called.
Chuck
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Hi Olivier,
Please don't top post, it isn't the custom on this list.
On Wed, Nov 21, 2012 at 7:22 PM, Olivier Delalleau
Current behavior looks sensible to me. I personally would prefer no warning but I think it makes sense to have one as it can be helpful to detect issues faster.
-=- Olivier
2012/11/21 Charles R Harris
What should be the value of the mean, var, and std of empty arrays? Currently
In [12]: a Out[12]: array([], dtype=int64)
In [13]: a.mean() Out[13]: nan
In [14]: a.std() Out[14]: nan
In [15]: a.var() Out[15]: nan
I think the nan comes from 0/0. All of these also raise warnings the first time they are called.
The warnings vary and don't directly give information on the cause, i.e., empty arrays. If we do go with warnings I think they should be more specific.
Chuck
On Wed, Nov 21, 2012 at 9:22 PM, Olivier Delalleau
Current behavior looks sensible to me. I personally would prefer no warning but I think it makes sense to have one as it can be helpful to detect issues faster.
-=- Olivier
It's configurable. [~/] [1]: np.seterr(all='ignore') [1]: {'divide': 'ignore', 'invalid': 'ignore', 'over': 'ignore', 'under': 'ignore'} [~/] [2]: np.array([]).mean() [2]: nan [~/] [3]: np.seterr(all='warn') [3]: {'divide': 'ignore', 'invalid': 'ignore', 'over': 'ignore', 'under': 'ignore'} [~/] [4]: np.array([]).mean() /usr/local/lib/python2.7/dist-packages/numpy/core/_methods.py:57: RuntimeWarning: invalid value encountered in double_scalars ret = ret / float(rcount) [4]: nan Skipper
2012/11/21 Charles R Harris
What should be the value of the mean, var, and std of empty arrays? Currently
In [12]: a Out[12]: array([], dtype=int64)
In [13]: a.mean() Out[13]: nan
In [14]: a.std() Out[14]: nan
In [15]: a.var() Out[15]: nan
I think the nan comes from 0/0. All of these also raise warnings the first time they are called.
Chuck
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
On Wed, Nov 21, 2012 at 9:22 PM, Olivier Delalleau
Current behavior looks sensible to me. I personally would prefer no warning but I think it makes sense to have one as it can be helpful to detect issues faster.
I agree that nan should be the correct answer. (I gave up trying to define a default for 0/0 in scipy.stats ttests.) some funnier cases
np.var([1], ddof=1) 0.0 np.var([1], ddof=5) -0 np.var([1,2], ddof=5) -0.16666666666666666 np.std([1,2], ddof=5) nan
But maybe my numpy is too old on my open interpreter
np.__version__ '1.5.1'
Josef
-=- Olivier
2012/11/21 Charles R Harris
What should be the value of the mean, var, and std of empty arrays? Currently
In [12]: a Out[12]: array([], dtype=int64)
In [13]: a.mean() Out[13]: nan
In [14]: a.std() Out[14]: nan
In [15]: a.var() Out[15]: nan
I think the nan comes from 0/0. All of these also raise warnings the first time they are called.
Chuck
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
On Wed, Nov 21, 2012 at 7:45 PM,
On Wed, Nov 21, 2012 at 9:22 PM, Olivier Delalleau
wrote: Current behavior looks sensible to me. I personally would prefer no warning but I think it makes sense to have one as it can be helpful to detect issues faster.
I agree that nan should be the correct answer. (I gave up trying to define a default for 0/0 in scipy.stats ttests.)
some funnier cases
np.var([1], ddof=1) 0.0
This one is a nan in development.
np.var([1], ddof=5) -0 np.var([1,2], ddof=5) -0.16666666666666666 np.std([1,2], ddof=5) nan
These still do this. Also In [10]: var([], ddof=1) Out[10]: -0 Which suggests that the nan is pretty much an accidental byproduct of division by zero. I think it might make sense to have a definite policy for these corner cases. <snip> Chuck
On Wed, Nov 21, 2012 at 10:35 PM, Charles R Harris
On Wed, Nov 21, 2012 at 7:45 PM,
wrote: On Wed, Nov 21, 2012 at 9:22 PM, Olivier Delalleau
wrote: Current behavior looks sensible to me. I personally would prefer no warning but I think it makes sense to have one as it can be helpful to detect issues faster.
I agree that nan should be the correct answer. (I gave up trying to define a default for 0/0 in scipy.stats ttests.)
some funnier cases
np.var([1], ddof=1) 0.0
This one is a nan in development.
np.var([1], ddof=5) -0 np.var([1,2], ddof=5) -0.16666666666666666 np.std([1,2], ddof=5) nan
These still do this. Also
In [10]: var([], ddof=1) Out[10]: -0
Which suggests that the nan is pretty much an accidental byproduct of division by zero. I think it might make sense to have a definite policy for these corner cases.
It would also be consistent with the usual pattern to raise a ValueError on this. ddof too large, size too small. It wouldn't be the case that for some columns or rows we get valid answers in this case, as long as we don't allow for missing values. quick check with np.ma looks correct except when delegating to numpy ?
s = np.ma.var(np.ma.masked_invalid([[1.,2],[1,np.nan]]), ddof=5, axis=0) s masked_array(data = [-- --], mask = [ True True], fill_value = 1e+20)
s = np.ma.var(np.ma.masked_invalid([[1.,2],[1,np.nan]]), ddof=1, axis=0) s masked_array(data = [0.0 --], mask = [False True], fill_value = 1e+20)
s = np.ma.std([1,2], ddof=5) s masked type(s)
np.ma.var([1,2], ddof=5) -0.16666666666666666
Josef
<snip>
Chuck
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
On Wed, Nov 21, 2012 at 10:58 PM,
On Wed, Nov 21, 2012 at 10:35 PM, Charles R Harris
wrote: On Wed, Nov 21, 2012 at 7:45 PM,
wrote: On Wed, Nov 21, 2012 at 9:22 PM, Olivier Delalleau
wrote: Current behavior looks sensible to me. I personally would prefer no warning but I think it makes sense to have one as it can be helpful to detect issues faster.
I agree that nan should be the correct answer. (I gave up trying to define a default for 0/0 in scipy.stats ttests.)
some funnier cases
np.var([1], ddof=1) 0.0
This one is a nan in development.
np.var([1], ddof=5) -0 np.var([1,2], ddof=5) -0.16666666666666666 np.std([1,2], ddof=5) nan
These still do this. Also
In [10]: var([], ddof=1) Out[10]: -0
Which suggests that the nan is pretty much an accidental byproduct of division by zero. I think it might make sense to have a definite policy for these corner cases.
It would also be consistent with the usual pattern to raise a ValueError on this. ddof too large, size too small. It wouldn't be the case that for some columns or rows we get valid answers in this case, as long as we don't allow for missing values.
I think I prefer NaNs to an exception, they propagate nicer to downstream functions. I'm in favor of a policy instead of nans or wrong numbers by accident.
quick check with np.ma
looks correct except when delegating to numpy ?
s = np.ma.var(np.ma.masked_invalid([[1.,2],[1,np.nan]]), ddof=5, axis=0) s masked_array(data = [-- --], mask = [ True True], fill_value = 1e+20)
s = np.ma.var(np.ma.masked_invalid([[1.,2],[1,np.nan]]), ddof=1, axis=0) s masked_array(data = [0.0 --], mask = [False True], fill_value = 1e+20)
s = np.ma.std([1,2], ddof=5) s masked type(s)
np.ma.var([1,2], ddof=5) -0.16666666666666666
and cov:
np.cov([1.],[3],bias=True, rowvar=False) #looks fine array([[ 0., 0.], [ 0., 0.]]) np.cov([1.],[3],bias=False, rowvar=False) array([[ nan, nan], [ nan, nan]])
np.cov([[1.],[3]],bias=False, rowvar=True) array([[ nan, nan], [ nan, nan]])
np.cov([],[],bias=False, rowvar=False) #should be nan array([[-0., -0.], [-0., -0.]]) np.cov([],[],bias=True, rowvar=False) array([[ nan, nan], [ nan, nan]])
np.corrcoef seems to have nans in the right places in the examples I tried Josef
Josef
<snip>
Chuck
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
On Wed, 2012-11-21 at 22:58 -0500, josef.pktd@gmail.com wrote:
On Wed, Nov 21, 2012 at 10:35 PM, Charles R Harris
wrote: On Wed, Nov 21, 2012 at 7:45 PM,
wrote: On Wed, Nov 21, 2012 at 9:22 PM, Olivier Delalleau
wrote: Current behavior looks sensible to me. I personally would prefer no warning but I think it makes sense to have one as it can be helpful to detect issues faster.
I agree that nan should be the correct answer. (I gave up trying to define a default for 0/0 in scipy.stats ttests.)
some funnier cases
np.var([1], ddof=1) 0.0
This one is a nan in development.
np.var([1], ddof=5) -0 np.var([1,2], ddof=5) -0.16666666666666666 np.std([1,2], ddof=5) nan
These still do this. Also
In [10]: var([], ddof=1) Out[10]: -0
Which suggests that the nan is pretty much an accidental byproduct of division by zero. I think it might make sense to have a definite policy for these corner cases.
It would also be consistent with the usual pattern to raise a ValueError on this. ddof too large, size too small. It wouldn't be the case that for some columns or rows we get valid answers in this case, as long as we don't allow for missing values.
It seems to me that nan is the reasonable result for these operations (reduce like operations that do not have an identity). Though actually reduce operations without an identity throw a ValueError (ie. `np.minimum.reduce([])`), but then mean/std/var seem special enough to be different from other reduce operations (for example their result is always floating point). As for usability I think for example when plotting errorbars using std, it would be rather annoying to get a ValueError, so if anything the reduce machinery could give more special results for empty floating point reductions. In any case the warning should be clearer and for too large ddof's I would say it should return nan+Warning as well. Sebastian
quick check with np.ma
looks correct except when delegating to numpy ?
s = np.ma.var(np.ma.masked_invalid([[1.,2],[1,np.nan]]), ddof=5, axis=0) s masked_array(data = [-- --], mask = [ True True], fill_value = 1e+20)
s = np.ma.var(np.ma.masked_invalid([[1.,2],[1,np.nan]]), ddof=1, axis=0) s masked_array(data = [0.0 --], mask = [False True], fill_value = 1e+20)
s = np.ma.std([1,2], ddof=5) s masked type(s)
np.ma.var([1,2], ddof=5) -0.16666666666666666
Josef
<snip>
Chuck
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
On Thu, Nov 22, 2012 at 7:14 AM, Sebastian Berg
On Wed, 2012-11-21 at 22:58 -0500, josef.pktd@gmail.com wrote:
On Wed, Nov 21, 2012 at 10:35 PM, Charles R Harris
wrote: On Wed, Nov 21, 2012 at 7:45 PM,
wrote: On Wed, Nov 21, 2012 at 9:22 PM, Olivier Delalleau
wrote: Current behavior looks sensible to me. I personally would prefer no warning but I think it makes sense to have one as it can be helpful to detect issues faster.
I agree that nan should be the correct answer. (I gave up trying to define a default for 0/0 in scipy.stats ttests.)
some funnier cases
> np.var([1], ddof=1) 0.0
This one is a nan in development.
> np.var([1], ddof=5) -0 > np.var([1,2], ddof=5) -0.16666666666666666 > np.std([1,2], ddof=5) nan
These still do this. Also
In [10]: var([], ddof=1) Out[10]: -0
Which suggests that the nan is pretty much an accidental byproduct of division by zero. I think it might make sense to have a definite policy for these corner cases.
It would also be consistent with the usual pattern to raise a ValueError on this. ddof too large, size too small. It wouldn't be the case that for some columns or rows we get valid answers in this case, as long as we don't allow for missing values.
It seems to me that nan is the reasonable result for these operations (reduce like operations that do not have an identity). Though actually reduce operations without an identity throw a ValueError (ie. `np.minimum.reduce([])`), but then mean/std/var seem special enough to be different from other reduce operations (for example their result is always floating point). As for usability I think for example when plotting errorbars using std, it would be rather annoying to get a ValueError, so if anything the reduce machinery could give more special results for empty floating point reductions.
In any case the warning should be clearer and for too large ddof's I would say it should return nan+Warning as well.
Why don't operations on empty arrays not return empty arrays? but this looks ok
(np.array([]) - np.array([]).mean()) / np.array([]).std() array([], dtype=float64) (np.array([]) - np.array([]).mean()) / np.array([]).std(0) array([], dtype=float64)
(np.array([]) - np.array([]).mean(0)) / np.array([]).std(0) array([], dtype=float64) (np.array([]) - np.array([]).mean(0)) / np.array([]) array([], dtype=float64)
np.array([[]]) - np.expand_dims(np.array([[]]).mean(1),1) array([], shape=(1, 0), dtype=float64)
np.array([[]]) - np.expand_dims(np.array([]),1) array([], shape=(0, 0), dtype=float64)
np.array([]) - np.expand_dims(np.array([]),0) array([], shape=(1, 0), dtype=float64)
(But I doubt I will rely in many cases on correct "calculations" with empty arrays.) Josef
Sebastian
quick check with np.ma
looks correct except when delegating to numpy ?
s = np.ma.var(np.ma.masked_invalid([[1.,2],[1,np.nan]]), ddof=5, axis=0) s masked_array(data = [-- --], mask = [ True True], fill_value = 1e+20)
s = np.ma.var(np.ma.masked_invalid([[1.,2],[1,np.nan]]), ddof=1, axis=0) s masked_array(data = [0.0 --], mask = [False True], fill_value = 1e+20)
s = np.ma.std([1,2], ddof=5) s masked type(s)
np.ma.var([1,2], ddof=5) -0.16666666666666666
Josef
<snip>
Chuck
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
On Thu, 2012-11-22 at 16:05 +0100, Daπid wrote:
On Thu, Nov 22, 2012 at 3:54 PM,
wrote: Why don't operations on empty arrays not return empty arrays?
Because functions like mean or std are expected to return a scalar. Functions that are piecewiese can (and should) return an empty array, but not the mean.
I agree, this makes sense, note that: In [2]: a = np.empty((5,0)) In [3]: a.std(0) Out[3]: array([], dtype=float64) In [4]: a.std(1) /usr/bin/ipython:1: RuntimeWarning: invalid value encountered in divide #!/usr/bin/env python Out[4]: array([ nan, nan, nan, nan, nan]) However you are reducing, and with reducing you expect exactly 1 scalar result (along that dimension).
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
On Thu, Nov 22, 2012 at 10:15 AM, Sebastian Berg
On Thu, 2012-11-22 at 16:05 +0100, Daπid wrote:
On Thu, Nov 22, 2012 at 3:54 PM,
wrote: Why don't operations on empty arrays not return empty arrays?
Because functions like mean or std are expected to return a scalar. Functions that are piecewiese can (and should) return an empty array, but not the mean.
I agree, this makes sense, note that:
In [2]: a = np.empty((5,0))
In [3]: a.std(0) Out[3]: array([], dtype=float64)
In [4]: a.std(1) /usr/bin/ipython:1: RuntimeWarning: invalid value encountered in divide #!/usr/bin/env python Out[4]: array([ nan, nan, nan, nan, nan])
However you are reducing, and with reducing you expect exactly 1 scalar result (along that dimension).
Ok, I see. we cannot have an empty 1-D array shape (5,) Josef
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
participants (6)
-
Charles R Harris
-
Daπid
-
josef.pktd@gmail.com
-
Olivier Delalleau
-
Sebastian Berg
-
Skipper Seabold