[Numpy-discussion] Efficient way of binning points and, applying functions to these groups

Thu Dec 27 02:25:48 EST 2012

Thanks Ralf!

this module looks great in fact. I didn't know it existed, and in fact 
It is only available in Scipy 0.11.0 (had to install from source since 
an Ubuntu 12.04 bin is not available). Too bad that the User-defined 
function only accepts one single array. If that function should take 
more input you need to rely on a trick to basically duplicate the input 
coordinates and concatenate the input arrays you need.

But apart from the fact that the programme looks much cleaner now, I 
just tested it and it is rather SLOW in fact.

Since I have to repeat this 2 or 3 times (I have in fact 3 functions to 
apply), it takes about 20 seconds for a full test, while with the 
changes I made (see below) it takes now 3 seconds or so [I am talking 
about the real code, not the example I give below].

So I managed to speed up things a bit by doing two things:
- keeping the first loop but replacing the second one with a loop ONLY 
on bins which contains the right number of points
- and more importantly not addressing the full array at each loop 
iteration but first selecting the right points (to reduce the size).

So something as shown below.

it is still a factor of 2 slower than the idl routine and I have no clue 
why. I will analyse it further. The idl routine has similar loops etc, 
so there is no reason for this.

if anybody has an idea .... THANKS (using cython is a bit too much on my 
side - being a low-profile python developer. As for numba, will have a look)

and thanks again for your help!

Eric

=======================================================
import numpy as np
x = np.random.random(1000)
y = np.random.random(1000)
data = np.random.random(1000)

# I have now a 2D grid given by let's say 10x10 grid points:
nx = 11
ny = 21
lx = linspace(0,1,nx)
ly = linspace(0,1,ny)
gx, gy = np.meshgrid(lx, ly)

# So my set of 2D bins are (not needed in the solution I present but
just for clarity)
bins = np.dstack((gx.ravel(), gy.ravel()))[0]

# Now I want to have the list of points in each bin and
# if the number of points in that bin is larger than 10, apply (dummy)
function func1 (see below)
# If less than 10, apply (dummy) function func2 so (dum?)
# if 0, do nothing
# for two dummy functions like (for example):
def func1(x) : return x.mean()
def func2(x) : return x.std()

h = histogram2d(x, y, bins=[lx, ly])[0]
digitX = np.digitize(x, lx)
digitY = np.digitize(y, ly)

# create the output array, with -999 values to make sure I see which ones are not filled in
result = np.zeros_like(h) - 999

for i in range(nx-1) :
     selX = (digitX == i+1)
     dataX = data[selX]
     selH10 = np.where(h > 10)
     selH0 = np.where((h > 0) & (h <= 10))
     for j in selH10 :
          selectionofpoints = (digitY == j+1)
          result[i,j] = func1(data[selectionofpoints])
     for j in selH0 :
          selectionofpoints = (digitY == j+1)
          result[i,j] = func2(data[selectionofpoints])

> Hi!
>
> I am looking for an efficient way of doing some simple binning of points
> and then applying some functions to points within each bin.
>
That's exactly what scipy.stats.binned_statistic does:
http://docs.scipy.org/doc/scipy-dev/reference/generated/scipy.stats.binned_statistic.html

binned_statistic uses np.digitize as well, but I'm not sure that in your
code below digitize is the bottleneck - the nested for-loop looks like the
more likely suspect.

Ralf