[Numpy-discussion] histogram: sum up values in each bin

josef.pktd at gmail.com josef.pktd at gmail.com
Thu Aug 27 23:37:02 EDT 2009


On Thu, Aug 27, 2009 at 1:27 PM, <josef.pktd at gmail.com> wrote:
> On Thu, Aug 27, 2009 at 12:49 PM, Tim
> Michelsen<timmichelsen at gmx-topmail.de> wrote:
>>> Tim, do you mean, that you want to apply other functions, e.g. mean or
>>> variance, to the original values but calculated per bin?
>> Sorry that I forgot to add this. Shame.
>>
>> I would like to apply these mathematical functions on the original values
>> stacked in the respective bins.
>>
>> For instance:
>>
>> The sample data measures the wight of an animal.
>>
>> 1) historam give a count of how many values are in each bin.
>>
>> I would like to calculate the average wight of all animals
>> sorted in bin1, bin2 etc.
>>
>> This is also useful in where you have a time component.
>>
>> In Spreadsheets I would use a '=' to reference to the original data and then
>> either sum it up or count it per class.
>>
>> I hope this is somehow understandable.
>
> Yes, it is a quite common use case for descriptive statistics, and I'm
> starting to collect different ways of doing it.
>
> In your case, Vincents way is the easiest.
>
> If you need to be faster, or you want to apply the same classification
> also to other variables, e.g. size of the animal,.., then creating a
> label array would be a more flexible solution.
>
> There was a similar thread recently on the scipy-user list for sorted
> arrays: "How to average different pieces or an array?"
>
> Josef
>
>>
>> Thanks,
>> Timmie
>

Here is a version where bincount and histogram produce the same
results for mean and variance per bin if no bins are empty. If a bin
is empty then either some nans or some small arbitrary numbers are
returned.

Josef

# incompletely tested if a bin has zero elements, nans or missing in variance
import numpy as np

x = np.random.normal(size=100) #+ 1e5 # + 1e8 to compare precision
c, b = np.histogram(x)


sortind = np.argsort(x)
reverse_sortind = np.argsort(sortind)
xsorted = x[sortind]
bind = np.searchsorted(xsorted,b,'right')

#construct label index
ind2 = np.zeros(x.shape, int)
ind2[bind[1:-1]] = 1   # assumes boundary indices are included in y
ind = ind2.cumsum()

labels = ind[reverse_sortind] # reverse sorting

print '\nmean'
means = np.bincount(ind,xsorted)*1.0/np.bincount(ind)
print means

count = np.bincount(labels)
means = np.bincount(labels,x)*1.0/count
print means

#compare mean with histogram
countsPerBin = np.histogram(x)[0]
sumsPerBin = np.histogram(x, weights=x)[0]
averagePerBin = sumsPerBin / countsPerBin
print averagePerBin


print '\nvariance'
meanarr = means[labels]
var = np.bincount(labels,(x-meanarr)**2)/count
print var

# with histogram
squaresums_perbin = np.histogram(x, weights=x**2)[0]
var_perbin = squaresums_perbin*1.0 / countsPerBin - averagePerBin**2
print var_perbin
print np.array(var) - np.array(var_perbin)



More information about the NumPy-Discussion mailing list