[Numpy-discussion] Initial implementation of histogram_discrete()

Sat Nov 14 10:57:51 EST 2009

On Sat, Nov 14, 2009 at 6:40 AM,  <josef.pktd at gmail.com> wrote:
> On Sat, Nov 14, 2009 at 7:10 AM,  <josef.pktd at gmail.com> wrote:
>> On Sat, Nov 14, 2009 at 6:53 AM, Priit Laes <plaes at plaes.org> wrote:
>>> Ühel kenal päeval, R, 2009-11-13 kell 13:36, kirjutas Ernest Adrogué:
>>>> 13/11/09 @ 09:41 (+0200), thus spake Priit Laes:
>>>> > Does anyone have a scenario where one would actually have both negative
>>>> > and positive numbers (integers) in the list?
>>>>
>>>> Yes: when you have a random variable that is the difference
>>>> of two (discrete) random variables. For example, if you measure
>>>> the difference in number of days off per week because of sickness
>>>> between two groups of people, you would end up with a discrete
>>>> variable with both positive and negative integers.
>>>>
>>>> > So, how about numpy.histogram_discrete() that returns data the way
>>>> > histogram() does: a list containing histogram values (ie counts) and
>>>> > list of sorted items from min(input)...max(input). ?
>>>>
>>>> In my humble opinion, it would be nice.
>>> \o/
>>>
>>> I have pushed the preliminary version to:
>>> http://github.com/plaes/numpy/commits/histogram_discrete
>>>
>>> It can currently handle datasets with negative items and weights. I'm
>>> also planning to add optional range argument to the function, but I
>>> first need to figure out how to parse the range=(min, max) using C
>>> API... ;)
>>>
>>> numpy.histogram_discrete() returns list containing histogram value and
>>> bins (hopefully this is the right definition)
>>>
>>> hist, bins = numpy.histogram_discrete(data)
>>>
>>> Example:
>>> In [1]: import numpy
>>> In [2]: data = numpy.random.poisson(3, 300)
>>> In [3]: numpy.histogram_discrete(data)
>>> Out[3]:
>>> [array([15, 50, 72, 59, 52, 34,  8,  7,  3]),
>>>  array([0, 1, 2, 3, 4, 5, 6, 7, 8])]
>>> In [4]:
>>> In [5]: data = [-1, 5]
>>> In [6]: numpy.histogram_discrete(data, weights=[2, 0])
>>> Out[6]:
>>> [array([ 2.,  0.,  0.,  0.,  0.,  0.,  0.]),
>>>  array([-1,  0,  1,  2,  3,  4,  5])]
>>
>>
>> Sorry, I still don't see much reason to do this in c
>>
>>>>> data = [-1, 5]
>>>>> c=np.bincount(data-np.min(data),weights=[2,0])
>>>>> x=np.arange(np.min(data),np.min(data)+len(c))
>>>>> c,x
>> (array([ 2.,  0.,  0.,  0.,  0.,  0.,  0.]), array([-1,  0,  1,  2,
>> 3,  4,  5]))
>>>>> data = [11,5]
>>>>> np.bincount(data,weights=[2,0])
>> array([ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  2.])
>>>>> np.arange(max(data)+1)
>> array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11])
>>>>> c=np.bincount(data-np.min(data),weights=[2,0])
>>>>> x=np.arange(np.min(data),np.min(data)+len(c))
>>>>> c,x
>> (array([ 0.,  0.,  0.,  0.,  0.,  0.,  2.]), array([ 5,  6,  7,  8,
>> 9, 10, 11]))
>>
>> Josef
>
> I think histogram in the name is misleading to me, because it suggests
> binning, not counting of all occurrences individually.
> In some cases, I also remove the zero counts:
>
>>>> data = [11,5,20]
>>>> c=np.bincount(data-np.min(data),weights=[2,1,0])
>>>> x=np.arange(np.min(data),np.min(data)+len(c))
>>>> c
> array([ 1.,  0.,  0.,  0.,  0.,  0.,  2.,  0.,  0.,  0.,  0.,  0.,  0.,
>        0.,  0.,  0.])
>>>> x
> array([ 5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20])
>>>> cind = np.nonzero(c)
>>>> cs = c[cind]
>>>> xs = x[cind]
>>>> cs
> array([ 1.,  2.])
>>>> xs
> array([ 5, 11])
>
> Josef
>>
>>
>>>
>>> Priit :)
>>> _______________________________________________
>>> NumPy-Discussion mailing list
>>> NumPy-Discussion at scipy.org
>>> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>>>
>>
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
Hi,
My concern is to avoid some of the issues that arose with changing
histogram.  So lets get it right from the start.

Basically my question is why do we need yet another histogram function?

What is the difference between your histogram_discrete and histogram
or bincount?

Is it just that bincount does not count negative numbers?
If so, then I would strongly argue that is insufficient for creating
new function. Rather you need to provide a suitable patch to fix
bincount or replace bincount with a better version.

Thanks
Bruce