[Numpy-discussion] Initial implementation of histogram_discrete()

Sat Nov 14 07:40:49 EST 2009

On Sat, Nov 14, 2009 at 7:10 AM,  <josef.pktd at gmail.com> wrote:
> On Sat, Nov 14, 2009 at 6:53 AM, Priit Laes <plaes at plaes.org> wrote:
>> Ühel kenal päeval, R, 2009-11-13 kell 13:36, kirjutas Ernest Adrogué:
>>> 13/11/09 @ 09:41 (+0200), thus spake Priit Laes:
>>> > Does anyone have a scenario where one would actually have both negative
>>> > and positive numbers (integers) in the list?
>>>
>>> Yes: when you have a random variable that is the difference
>>> of two (discrete) random variables. For example, if you measure
>>> the difference in number of days off per week because of sickness
>>> between two groups of people, you would end up with a discrete
>>> variable with both positive and negative integers.
>>>
>>> > So, how about numpy.histogram_discrete() that returns data the way
>>> > histogram() does: a list containing histogram values (ie counts) and
>>> > list of sorted items from min(input)...max(input). ?
>>>
>>> In my humble opinion, it would be nice.
>> \o/
>>
>> I have pushed the preliminary version to:
>> http://github.com/plaes/numpy/commits/histogram_discrete
>>
>> It can currently handle datasets with negative items and weights. I'm
>> also planning to add optional range argument to the function, but I
>> first need to figure out how to parse the range=(min, max) using C
>> API... ;)
>>
>> numpy.histogram_discrete() returns list containing histogram value and
>> bins (hopefully this is the right definition)
>>
>> hist, bins = numpy.histogram_discrete(data)
>>
>> Example:
>> In [1]: import numpy
>> In [2]: data = numpy.random.poisson(3, 300)
>> In [3]: numpy.histogram_discrete(data)
>> Out[3]:
>> [array([15, 50, 72, 59, 52, 34,  8,  7,  3]),
>>  array([0, 1, 2, 3, 4, 5, 6, 7, 8])]
>> In [4]:
>> In [5]: data = [-1, 5]
>> In [6]: numpy.histogram_discrete(data, weights=[2, 0])
>> Out[6]:
>> [array([ 2.,  0.,  0.,  0.,  0.,  0.,  0.]),
>>  array([-1,  0,  1,  2,  3,  4,  5])]
>
>
> Sorry, I still don't see much reason to do this in c
>
>>>> data = [-1, 5]
>>>> c=np.bincount(data-np.min(data),weights=[2,0])
>>>> x=np.arange(np.min(data),np.min(data)+len(c))
>>>> c,x
> (array([ 2.,  0.,  0.,  0.,  0.,  0.,  0.]), array([-1,  0,  1,  2,
> 3,  4,  5]))
>>>> data = [11,5]
>>>> np.bincount(data,weights=[2,0])
> array([ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  2.])
>>>> np.arange(max(data)+1)
> array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11])
>>>> c=np.bincount(data-np.min(data),weights=[2,0])
>>>> x=np.arange(np.min(data),np.min(data)+len(c))
>>>> c,x
> (array([ 0.,  0.,  0.,  0.,  0.,  0.,  2.]), array([ 5,  6,  7,  8,
> 9, 10, 11]))
>
> Josef

I think histogram in the name is misleading to me, because it suggests
binning, not counting of all occurrences individually.
In some cases, I also remove the zero counts:

>>> data = [11,5,20]
>>> c=np.bincount(data-np.min(data),weights=[2,1,0])
>>> x=np.arange(np.min(data),np.min(data)+len(c))
>>> c
array([ 1.,  0.,  0.,  0.,  0.,  0.,  2.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.])
>>> x
array([ 5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20])
>>> cind = np.nonzero(c)
>>> cs = c[cind]
>>> xs = x[cind]
>>> cs
array([ 1.,  2.])
>>> xs
array([ 5, 11])

Josef
>
>
>>
>> Priit :)
>> _______________________________________________
>> NumPy-Discussion mailing list
>> NumPy-Discussion at scipy.org
>> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>>
>