Efficient way of binning points and applying functions to these groups
Hi! I am looking for an efficient way of doing some simple binning of points and then applying some functions to points within each bin. I have tried several ways, including crude looping over the indices, or using digitize (see below) but I cannot manage to get it as efficient as I need it to be. I have a comparison with a similar (although complex) code in idl, and thought I would ask the forum. In idl there is a way to "invert" an histogram and get a reverse set of indices (via the histogram function) which seems to do the trick (or maybe it is faster for another reason). Below I provide a (dummy) example of what I wish to achieve. Any hint on how to do this EFFICIENTLY using numpy is most welcome. I need to speed things up quite a bit (at the moment, the version I have, see below, is 10 times slower than the more complex idl routine..., I must be doing something wrong!). thanks!! Eric ======================================================== # I have a random set of data points in 2D with coordinates x,y : import numpy as np x = np.random.random(1000) y = np.random.random(1000) # I have now a 2D grid given by let's say 10x10 grid points: nx = 11 ny = 21 lx = linspace(0,1,nx) ly = linspace(0,1,ny) gx, gy = np.meshgrid(lx, ly) # So my set of 2D bins are (not needed in the solution I present but just for clarity) bins = np.dstack((gx.ravel(), gy.ravel()))[0] # Now I want to have the list of points in each bin and # if the number of points in that bin is larger than 10, apply (dummy) function func1 (see below) # If less than 10, apply (dummy) function func2 so (dum?) # if 0, do nothing # for two dummy functions like (for example): def func1(x) : return x.mean() def func2(x) : return x.std() # One solution would be to use digitize in 1D and histogram in 2D (don't need gx, gy for this one): h = histogram2d(x, y, bins=[lx, ly])[0] digitX = np.digitize(x, lx) digitY = np.digitize(y, ly) # create the output array, with -999 values to make sure I see which ones are not filled in result = np.zeros_like(h) - 999 for i in range(nx-1) : for j in range(ny-1) : selectionofpoints = (digitX == i+1) & (digitY == j+1) if h[i,j] > 10 : result[i,j] = func1(x[selectionofpoints]) elif h[i,j] > 0 : result[i,j] = func2(x[selectionofpoints])
This looks like the perfect work for cython. It it's great opp optimizing loops. Another option is the new Numba, an automatic compiler. David. El 26/12/2012 10:09, "Eric Emsellem" <eric.emsellem@eso.org> escribió:
Hi!
I am looking for an efficient way of doing some simple binning of points and then applying some functions to points within each bin.
I have tried several ways, including crude looping over the indices, or using digitize (see below) but I cannot manage to get it as efficient as I need it to be. I have a comparison with a similar (although complex) code in idl, and thought I would ask the forum. In idl there is a way to "invert" an histogram and get a reverse set of indices (via the histogram function) which seems to do the trick (or maybe it is faster for another reason).
Below I provide a (dummy) example of what I wish to achieve. Any hint on how to do this EFFICIENTLY using numpy is most welcome. I need to speed things up quite a bit (at the moment, the version I have, see below, is 10 times slower than the more complex idl routine..., I must be doing something wrong!).
thanks!! Eric ======================================================== # I have a random set of data points in 2D with coordinates x,y : import numpy as np x = np.random.random(1000) y = np.random.random(1000)
# I have now a 2D grid given by let's say 10x10 grid points: nx = 11 ny = 21 lx = linspace(0,1,nx) ly = linspace(0,1,ny) gx, gy = np.meshgrid(lx, ly)
# So my set of 2D bins are (not needed in the solution I present but just for clarity) bins = np.dstack((gx.ravel(), gy.ravel()))[0]
# Now I want to have the list of points in each bin and # if the number of points in that bin is larger than 10, apply (dummy) function func1 (see below) # If less than 10, apply (dummy) function func2 so (dum?) # if 0, do nothing # for two dummy functions like (for example): def func1(x) : return x.mean()
def func2(x) : return x.std()
# One solution would be to use digitize in 1D and histogram in 2D (don't need gx, gy for this one):
h = histogram2d(x, y, bins=[lx, ly])[0]
digitX = np.digitize(x, lx) digitY = np.digitize(y, ly)
# create the output array, with -999 values to make sure I see which ones are not filled in result = np.zeros_like(h) - 999
for i in range(nx-1) : for j in range(ny-1) : selectionofpoints = (digitX == i+1) & (digitY == j+1) if h[i,j] > 10 : result[i,j] = func1(x[selectionofpoints]) elif h[i,j] > 0 : result[i,j] = func2(x[selectionofpoints]) _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
On Wed, Dec 26, 2012 at 10:09 AM, Eric Emsellem <eric.emsellem@eso.org>wrote:
Hi!
I am looking for an efficient way of doing some simple binning of points and then applying some functions to points within each bin.
That's exactly what scipy.stats.binned_statistic does: http://docs.scipy.org/doc/scipy-dev/reference/generated/scipy.stats.binned_s... binned_statistic uses np.digitize as well, but I'm not sure that in your code below digitize is the bottleneck - the nested for-loop looks like the more likely suspect. Ralf
I have tried several ways, including crude looping over the indices, or using digitize (see below) but I cannot manage to get it as efficient as I need it to be. I have a comparison with a similar (although complex) code in idl, and thought I would ask the forum. In idl there is a way to "invert" an histogram and get a reverse set of indices (via the histogram function) which seems to do the trick (or maybe it is faster for another reason).
Below I provide a (dummy) example of what I wish to achieve. Any hint on how to do this EFFICIENTLY using numpy is most welcome. I need to speed things up quite a bit (at the moment, the version I have, see below, is 10 times slower than the more complex idl routine..., I must be doing something wrong!).
thanks!! Eric ======================================================== # I have a random set of data points in 2D with coordinates x,y : import numpy as np x = np.random.random(1000) y = np.random.random(1000)
# I have now a 2D grid given by let's say 10x10 grid points: nx = 11 ny = 21 lx = linspace(0,1,nx) ly = linspace(0,1,ny) gx, gy = np.meshgrid(lx, ly)
# So my set of 2D bins are (not needed in the solution I present but just for clarity) bins = np.dstack((gx.ravel(), gy.ravel()))[0]
# Now I want to have the list of points in each bin and # if the number of points in that bin is larger than 10, apply (dummy) function func1 (see below) # If less than 10, apply (dummy) function func2 so (dum?) # if 0, do nothing # for two dummy functions like (for example): def func1(x) : return x.mean()
def func2(x) : return x.std()
# One solution would be to use digitize in 1D and histogram in 2D (don't need gx, gy for this one):
h = histogram2d(x, y, bins=[lx, ly])[0]
digitX = np.digitize(x, lx) digitY = np.digitize(y, ly)
# create the output array, with -999 values to make sure I see which ones are not filled in result = np.zeros_like(h) - 999
for i in range(nx-1) : for j in range(ny-1) : selectionofpoints = (digitX == i+1) & (digitY == j+1) if h[i,j] > 10 : result[i,j] = func1(x[selectionofpoints]) elif h[i,j] > 0 : result[i,j] = func2(x[selectionofpoints]) _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
participants (3)
-
Daπid -
Eric Emsellem -
Ralf Gommers