Re: [Numpy-discussion] Efficient way of binning points and, applying functions to these groups
Thanks Ralf! this module looks great in fact. I didn't know it existed, and in fact It is only available in Scipy 0.11.0 (had to install from source since an Ubuntu 12.04 bin is not available). Too bad that the User-defined function only accepts one single array. If that function should take more input you need to rely on a trick to basically duplicate the input coordinates and concatenate the input arrays you need. But apart from the fact that the programme looks much cleaner now, I just tested it and it is rather SLOW in fact. Since I have to repeat this 2 or 3 times (I have in fact 3 functions to apply), it takes about 20 seconds for a full test, while with the changes I made (see below) it takes now 3 seconds or so [I am talking about the real code, not the example I give below]. So I managed to speed up things a bit by doing two things: - keeping the first loop but replacing the second one with a loop ONLY on bins which contains the right number of points - and more importantly not addressing the full array at each loop iteration but first selecting the right points (to reduce the size). So something as shown below. it is still a factor of 2 slower than the idl routine and I have no clue why. I will analyse it further. The idl routine has similar loops etc, so there is no reason for this. if anybody has an idea .... THANKS (using cython is a bit too much on my side - being a low-profile python developer. As for numba, will have a look) and thanks again for your help! Eric ======================================================= import numpy as np x = np.random.random(1000) y = np.random.random(1000) data = np.random.random(1000) # I have now a 2D grid given by let's say 10x10 grid points: nx = 11 ny = 21 lx = linspace(0,1,nx) ly = linspace(0,1,ny) gx, gy = np.meshgrid(lx, ly) # So my set of 2D bins are (not needed in the solution I present but just for clarity) bins = np.dstack((gx.ravel(), gy.ravel()))[0] # Now I want to have the list of points in each bin and # if the number of points in that bin is larger than 10, apply (dummy) function func1 (see below) # If less than 10, apply (dummy) function func2 so (dum?) # if 0, do nothing # for two dummy functions like (for example): def func1(x) : return x.mean() def func2(x) : return x.std() h = histogram2d(x, y, bins=[lx, ly])[0] digitX = np.digitize(x, lx) digitY = np.digitize(y, ly) # create the output array, with -999 values to make sure I see which ones are not filled in result = np.zeros_like(h) - 999 for i in range(nx-1) : selX = (digitX == i+1) dataX = data[selX] selH10 = np.where(h > 10) selH0 = np.where((h > 0) & (h <= 10)) for j in selH10 : selectionofpoints = (digitY == j+1) result[i,j] = func1(data[selectionofpoints]) for j in selH0 : selectionofpoints = (digitY == j+1) result[i,j] = func2(data[selectionofpoints])
Hi!
I am looking for an efficient way of doing some simple binning of points and then applying some functions to points within each bin.
That's exactly what scipy.stats.binned_statistic does: http://docs.scipy.org/doc/scipy-dev/reference/generated/scipy.stats.binned_s... binned_statistic uses np.digitize as well, but I'm not sure that in your code below digitize is the bottleneck - the nested for-loop looks like the more likely suspect. Ralf
participants (1)
-
Eric Emsellem