numpy.histogram not giving expected results
Hello, I'm trying to calculate a 1-d histogram of a distribution that contains mostly zeros, and I'm having problems with examples where the values to be histogrammed fall exactly on the bin boundaries: For example, this gives me the expected results (entering the exact bin values):
data array([ 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0.05, -0.05]) bins_list = numpy.array([-0.1, -0.05, 0.0, 0.05, 0.1]) (counts, edges) = numpy.histogram(data, bins=bins_list) counts array([ 0, 1, 10, 1]) edges array([-0.1 , -0.05, 0. , 0.05, 0.1 ])
but this does not (generating the bin values via bumpy.arange):
bins_arange = numpy.arange(-0.1, 0.101, 0.05) data array([ 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0.05, -0.05]) bins_arange array([-0.1 , -0.05, 0. , 0.05, 0.1 ]) (counts, edges) = numpy.histogram(data, bins=bins_arange) counts array([ 0, 1, 11, 0])
I'm assuming this is due to slight rounding in the calculation of bins_arange, as compared to the manually entered values in bins_list. What is the recommended way of getting the first set of results, without having to manually enter all the values in the "bins" argument? The following also gives me unexpected results:
data array([ 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0.05, -0.05]) counts, edges) = numpy.histogram(data, range=(-0.1, 0.1), bins=4) counts array([ 0, 1, 11, 0])
Thank you for any advice, Catherine
A few thoughts: 1) don't use arange() for flaoting point numbers, use linspace(). 2) histogram1d is a floating point function, and you shouldn't expect exact results for floating point -- in particular, values exactly at the bin boundaries are likely to be "uncertain" -- not quite the right word, but you get the idea. 3) if you expect have a lot of certain specific values, say, integers, or zeros -- then you don't want your bin boundaries to be exactly at the value -- they should be between the expected values. 4) remember that histogramming is inherently sensitive to bin position anyway -- if these small bin-boundary differences matter, than you may not be using teh best approach. -HTH, -Chris
data array([ 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0.05, -0.05]) bins_list = numpy.array([-0.1, -0.05, 0.0, 0.05, 0.1]) (counts, edges) = numpy.histogram(data, bins=bins_list) counts array([ 0, 1, 10, 1]) edges array([-0.1 , -0.05, 0. , 0.05, 0.1 ])
but this does not (generating the bin values via bumpy.arange):
bins_arange = numpy.arange(-0.1, 0.101, 0.05) data array([ 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0.05, -0.05]) bins_arange array([-0.1 , -0.05, 0. , 0.05, 0.1 ]) (counts, edges) = numpy.histogram(data, bins=bins_arange) counts array([ 0, 1, 11, 0])
I'm assuming this is due to slight rounding in the calculation of bins_arange, as compared to the manually entered values in bins_list.
What is the recommended way of getting the first set of results, without having to manually enter all the values in the "bins" argument?
The following also gives me unexpected results:
data array([ 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0.05, -0.05]) counts, edges) = numpy.histogram(data, range=(-0.1, 0.1), bins=4) counts array([ 0, 1, 11, 0])
Thank you for any advice,
Catherine _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
-- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov
Hi Catherine, I can't reproduce your issue with bins_list vs. bins_arange, but passing both range and number of bins to np.histogram does give the same strange behavior for me: In [16]: data = np.array([ 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0.05, -0.05]) In [17]: bins_list = np.array([-0.1, -0.05, 0.0, 0.05, 0.1]) In [18]: np.histogram(data, bins=bins_list) Out[18]: (array([ 0, 1, 10, 1]), array([-0.1 , -0.05, 0. , 0.05, 0.1 ])) In [19]: bins_arange = np.arange(-0.1, 0.101, 0.05) In [20]: np.histogram(data, bins=bins_arange) Out[20]: (array([ 0, 1, 10, 1]), array([-0.1 , -0.05, 0. , 0.05, 0.1 ])) In [21]: np.histogram(data, range=(-0.1, 0.1), bins=4) Out[21]: (array([ 0, 1, 11, 0]), array([-0.1 , -0.05, 0. , 0.05, 0.1 ])) In [22]: np.version.version Out[22]: '1.8.1' Looks like the 0.05 value of data is being binned differently in the last case, but I'm not sure why either... Mark On Wed, Jul 2, 2014 at 2:05 AM, Chris Barker <chris.barker@noaa.gov> wrote:
A few thoughts:
1) don't use arange() for flaoting point numbers, use linspace().
2) histogram1d is a floating point function, and you shouldn't expect exact results for floating point -- in particular, values exactly at the bin boundaries are likely to be "uncertain" -- not quite the right word, but you get the idea.
3) if you expect have a lot of certain specific values, say, integers, or zeros -- then you don't want your bin boundaries to be exactly at the value -- they should be between the expected values.
4) remember that histogramming is inherently sensitive to bin position anyway -- if these small bin-boundary differences matter, than you may not be using teh best approach.
-HTH, -Chris
data array([ 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0.05, -0.05]) bins_list = numpy.array([-0.1, -0.05, 0.0, 0.05, 0.1]) (counts, edges) = numpy.histogram(data, bins=bins_list) counts array([ 0, 1, 10, 1]) edges array([-0.1 , -0.05, 0. , 0.05, 0.1 ])
but this does not (generating the bin values via bumpy.arange):
bins_arange = numpy.arange(-0.1, 0.101, 0.05) data array([ 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0.05, -0.05]) bins_arange array([-0.1 , -0.05, 0. , 0.05, 0.1 ]) (counts, edges) = numpy.histogram(data, bins=bins_arange) counts array([ 0, 1, 11, 0])
I'm assuming this is due to slight rounding in the calculation of bins_arange, as compared to the manually entered values in bins_list.
What is the recommended way of getting the first set of results, without having to manually enter all the values in the "bins" argument?
The following also gives me unexpected results:
data array([ 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0.05, -0.05]) counts, edges) = numpy.histogram(data, range=(-0.1, 0.1), bins=4) counts array([ 0, 1, 11, 0])
Thank you for any advice,
Catherine _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
--
Christopher Barker, Ph.D. Oceanographer
Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception
Chris.Barker@noaa.gov
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Looks this could be a float32 vs float64 problem: In [19]: data32 = np.array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.05, -0.05], dtype=np.float32) In [20]: data64 = np.array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.05, -0.05], dtype=np.float64) In [21]: bins32 = np.arange(-0.1, 0.101, 0.05, dtype=np.float32) In [22]: bins64 = np.arange(-0.1, 0.101, 0.05, dtype=np.float64) In [23]: np.histogram(data32, bins32) Out[23]: (array([ 0, 1, 10, 1]), array([-0.1 , -0.05, 0. , 0.05, 0.1 ], dtype=float32)) In [24]: np.histogram(data32, bins64) Out[24]: (array([ 1, 0, 10, 1]), array([-0.1 , -0.05, 0. , 0.05, 0.1 ])) In [25]: np.histogram(data64, bins32) Out[25]: (array([ 0, 1, 11, 0]), array([-0.1 , -0.05, 0. , 0.05, 0.1 ], dtype=float32)) In [26]: np.histogram(data64, bins64) Out[26]: (array([ 0, 1, 10, 1]), array([-0.1 , -0.05, 0. , 0.05, 0.1 ])) I guess users always be very careful when mixing floating point types, but should numpy prevent (or warn) the user from doing so in this case? On Wed, Jul 2, 2014 at 10:07 AM, Mark Szepieniec <mszepien@gmail.com> wrote:
Hi Catherine,
I can't reproduce your issue with bins_list vs. bins_arange, but passing both range and number of bins to np.histogram does give the same strange behavior for me:
In [16]: data = np.array([ 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0.05, -0.05])
In [17]: bins_list = np.array([-0.1, -0.05, 0.0, 0.05, 0.1])
In [18]: np.histogram(data, bins=bins_list) Out[18]: (array([ 0, 1, 10, 1]), array([-0.1 , -0.05, 0. , 0.05, 0.1 ]))
In [19]: bins_arange = np.arange(-0.1, 0.101, 0.05)
In [20]: np.histogram(data, bins=bins_arange) Out[20]: (array([ 0, 1, 10, 1]), array([-0.1 , -0.05, 0. , 0.05, 0.1 ]))
In [21]: np.histogram(data, range=(-0.1, 0.1), bins=4) Out[21]: (array([ 0, 1, 11, 0]), array([-0.1 , -0.05, 0. , 0.05, 0.1 ]))
In [22]: np.version.version Out[22]: '1.8.1'
Looks like the 0.05 value of data is being binned differently in the last case, but I'm not sure why either...
Mark
On Wed, Jul 2, 2014 at 2:05 AM, Chris Barker <chris.barker@noaa.gov> wrote:
A few thoughts:
1) don't use arange() for flaoting point numbers, use linspace().
2) histogram1d is a floating point function, and you shouldn't expect exact results for floating point -- in particular, values exactly at the bin boundaries are likely to be "uncertain" -- not quite the right word, but you get the idea.
3) if you expect have a lot of certain specific values, say, integers, or zeros -- then you don't want your bin boundaries to be exactly at the value -- they should be between the expected values.
4) remember that histogramming is inherently sensitive to bin position anyway -- if these small bin-boundary differences matter, than you may not be using teh best approach.
-HTH, -Chris
data array([ 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0.05, -0.05]) bins_list = numpy.array([-0.1, -0.05, 0.0, 0.05, 0.1]) (counts, edges) = numpy.histogram(data, bins=bins_list) counts array([ 0, 1, 10, 1]) edges array([-0.1 , -0.05, 0. , 0.05, 0.1 ])
but this does not (generating the bin values via bumpy.arange):
bins_arange = numpy.arange(-0.1, 0.101, 0.05) data array([ 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0.05, -0.05]) bins_arange array([-0.1 , -0.05, 0. , 0.05, 0.1 ]) (counts, edges) = numpy.histogram(data, bins=bins_arange) counts array([ 0, 1, 11, 0])
I'm assuming this is due to slight rounding in the calculation of bins_arange, as compared to the manually entered values in bins_list.
What is the recommended way of getting the first set of results, without having to manually enter all the values in the "bins" argument?
The following also gives me unexpected results:
data array([ 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0.05, -0.05]) counts, edges) = numpy.histogram(data, range=(-0.1, 0.1), bins=4) counts array([ 0, 1, 11, 0])
Thank you for any advice,
Catherine _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
--
Christopher Barker, Ph.D. Oceanographer
Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception
Chris.Barker@noaa.gov
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
On Wed, Jul 2, 2014 at 7:57 AM, Mark Szepieniec <mszepien@gmail.com> wrote:
Looks this could be a float32 vs float64 problem:
that would explain it.
I guess users always be very careful when mixing floating point types, but should numpy prevent (or warn) the user from doing so in this case?
I don't think so -- this "uncertainty" is very much the nature of histogramming, particularly with floating point values -- you should expect to get different results with different data precisions. As you should for ANY floating point computation. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov
On 02.07.2014 19:29, Chris Barker wrote:
On Wed, Jul 2, 2014 at 7:57 AM, Mark Szepieniec <mszepien@gmail.com <mailto:mszepien@gmail.com>> wrote:
Looks this could be a float32 vs float64 problem:
that would explain it.
I guess users always be very careful when mixing floating point types, but should numpy prevent (or warn) the user from doing so in this case?
I don't think so -- this "uncertainty" is very much the nature of histogramming, particularly with floating point values -- you should expect to get different results with different data precisions. As you should for ANY floating point computation.
we recently fixed a float32/float64 issue in histogram. https://github.com/numpy/numpy/issues/4799 I think it boils down to the use of round() in histogram which is not so great in python as its based on decimals not significant figures (so it does nothing for float32 values > 1e7). Though this one seems different as it still occurs in git master.
On Wed, Jul 2, 2014 at 10:36 AM, Julian Taylor < jtaylor.debian@googlemail.com> wrote: we recently fixed a float32/float64 issue in histogram.
It's a good idea to keep the edges in the same dtype as the input data, it will make for fewer surprises, but I'm not sure that it's necessarily any more "correct". A value within an eps of a bin could arbitrarily end up on either side -- that's simply the nature of floating point.
I think it boils down to the use of round() in histogram which is not so great in python as its based on decimals not significant figures (so it does nothing for float32 values > 1e7).
Using decimals rather than sig-figs is a problem regardless of precision, and isn't that the same problem with C libmath round() ? -CHB -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov
On 02.07.2014 20:04, Chris Barker wrote:
On Wed, Jul 2, 2014 at 10:36 AM, Julian Taylor <jtaylor.debian@googlemail.com <mailto:jtaylor.debian@googlemail.com>> wrote:
we recently fixed a float32/float64 issue in histogram. https://github.com/numpy/numpy/issues/4799
It's a good idea to keep the edges in the same dtype as the input data, it will make for fewer surprises, but I'm not sure that it's necessarily any more "correct". A value within an eps of a bin could arbitrarily end up on either side -- that's simply the nature of floating point.
I think it boils down to the use of round() in histogram which is not so great in python as its based on decimals not significant figures (so it does nothing for float32 values > 1e7).
Using decimals rather than sig-figs is a problem regardless of precision, and isn't that the same problem with C libmath round() ?
C round just rounds to the nearest integer and the result is still a float. numpy/python is different and implements round as round(d * 10**decimal) / 10**decimal
participants (4)
-
Chris Barker -
Julian Taylor -
Mark Szepieniec -
Moroney, Catherine M (398D)