[Numpy-discussion] np.histogram: upper range bin
Christopher Barker
Chris.Barker at noaa.gov
Mon Jun 13 11:45:52 EDT 2011
Peter Butterworth wrote:
> Consistent bin width is important for my applications. With floating
> point numbers I usually shift my bins by a small offset to ensure
> values at bin edges always fall in the correct bin.
> With the current np.histogram behavior you _silently_ get a wrong
> count in the top bin if a value falls on the upper bin limit.
> Incidentally this happens by default with integers. ex: x=range(4);
> np.histogram(x)
Again, the trick here is that histogram really is for floating point
numbers, not integers.
> Likely I will test for the following condition when using
> np.histogram(x):
> max(x) == top bin limit
from the docstring:
"""
range : (float, float), optional
The lower and upper range of the bins. If not provided, range
is simply ``(a.min(), a.max())``.
"""
So, in fact, if you don't specify the range, the top of the largest bin
will ALWAYS be the max value in your data. You seem to be advocating
that that value not be included in any bins -- which would surely not be
a good option.
> With the current np.histogram behavior you _silently_ get a wrong
> count in the top bin if a value falls on the upper bin limit.
It is not a wrong count -- it is absolutely correct. I think the issue
here is that you are binning integers, which is really a categorical
binning, not a value-based one. If you want categorical binning, there
are other ways to do that.
I suppose having np.histogram do something different if the input is
integers might make sense, but that's probably not worth it.
By the way, what would you have it do it you had integers, but a large
number, over a large range of values:
In [68]: x = numpy.random.randint(0, 100000000, size=(100,) )
In [70]: np.histogram(x)
Out[70]:
(array([10, 10, 12, 16, 12, 9, 11, 4, 9, 7]),
array([ 712131. , 10437707.4 , 20163283.8 ,
29888860.2 , 39614436.6 , 49340013. ,
59065589.40000001, 68791165.8 , 78516742.2 ,
88242318.60000001, 97967895. ]))
> I guess it is better to always specify the correct range, but wouldn't
> it be preferable if the function provided a
> warning when this case occurs ?
>
>
> ---
> Re: [Numpy-discussion] np.histogram: upper range bin
> Christopher Barker
> Thu, 02 Jun 2011 09:19:16 -0700
>
> Peter Butterworth wrote:
>> in np.histogram the top-most bin edge is inclusive of the upper range
>> limit. As documented in the docstring (see below) this is actually the
>> expected behavior, but this can lead to some weird enough results:
>>
>> In [72]: x=[1, 2, 3, 4]; np.histogram(x, bins=3)
>> Out[72]: (array([1, 1, 2]), array([ 1., 2., 3., 4.]))
>>
>> Is there any way round this or an alternative implementation without
>> this issue ?
>
> The way around it is what you've identified -- making sure your bins are
> right. But I think the current behavior is the way it "should" be. It
> keeps folks from inadvertently loosing stuff off the end -- the lower
> end is inclusive, so the upper end should be too. In the middle bins,
> one has to make an arbitrary cut-off, and put the values on the "line"
> somewhere.
>
>
> One thing to keep in mind is that, in general, histogram is designed for
> floating point numbers, not just integers -- counting integers can be
> accomplished other ways, if that's what you really want (see
> np.bincount). But back to your example:
>
> > In [72]: x=[1, 2, 3, 4]; np.histogram(x, bins=3)
>
> Why do you want only 3 bins here? using 4 gives you what you want. If
> you want more control, then it seems you really want to know how many of
> each of the values 1,2,3,4 there are. so you want 4 bins, each
> *centered* on the integers, so you might do:
>
> In [8]: np.histogram(x, bins=4, range=(0.5, 4.5))
> Out[8]: (array([1, 1, 1, 1]), array([ 0.5, 1.5, 2.5, 3.5, 4.5]))
>
> or, if you want to be more explicit:
>
> In [14]: np.histogram(x, bins=np.linspace(0.5, 4.5, 5))
> Out[14]: (array([1, 1, 1, 1]), array([ 0.5, 1.5, 2.5, 3.5, 4.5]))
>
>
> HTH,
>
> -Chris
>
--
Christopher Barker, Ph.D.
Oceanographer
Emergency Response Division
NOAA/NOS/OR&R (206) 526-6959 voice
7600 Sand Point Way NE (206) 526-6329 fax
Seattle, WA 98115 (206) 526-6317 main reception
Chris.Barker at noaa.gov
More information about the NumPy-Discussion
mailing list