[Numpy-discussion] Efficient way of binning points and applying functions to these groups

Wed Dec 26 07:33:58 EST 2012

On Wed, Dec 26, 2012 at 10:09 AM, Eric Emsellem <eric.emsellem at eso.org>wrote:

> Hi!
>
> I am looking for an efficient way of doing some simple binning of points
> and then applying some functions to points within each bin.
>

That's exactly what scipy.stats.binned_statistic does:
http://docs.scipy.org/doc/scipy-dev/reference/generated/scipy.stats.binned_statistic.html

binned_statistic uses np.digitize as well, but I'm not sure that in your
code below digitize is the bottleneck - the nested for-loop looks like the
more likely suspect.

Ralf

>
> I have tried several ways, including crude looping over the indices, or
> using digitize (see below) but I cannot manage to get it as efficient as
> I need it to be. I have a comparison with a similar (although complex)
> code in idl, and thought I would ask the forum. In idl there is a way to
> "invert" an histogram and get a reverse set of indices (via the
> histogram function) which seems to do the trick (or maybe it is faster
> for another reason).
>
> Below I provide a (dummy) example of what I wish to achieve. Any hint on
> how to do this EFFICIENTLY using numpy is most welcome. I need to speed
> things up quite a bit (at the moment, the version I have, see below, is
> 10 times slower than the more complex idl routine..., I must be doing
> something wrong!).
>
> thanks!!
> Eric
> ========================================================
> # I have a random set of data points in 2D with coordinates x,y :
> import numpy as np
> x = np.random.random(1000)
> y = np.random.random(1000)
>
> # I have now a 2D grid given by let's say 10x10 grid points:
> nx = 11
> ny = 21
> lx = linspace(0,1,nx)
> ly = linspace(0,1,ny)
> gx, gy = np.meshgrid(lx, ly)
>
> # So my set of 2D bins are (not needed in the solution I present but
> just for clarity)
> bins = np.dstack((gx.ravel(), gy.ravel()))[0]
>
> # Now I want to have the list of points in each bin and
> # if the number of points in that bin is larger than 10, apply (dummy)
> function func1 (see below)
> # If less than 10, apply (dummy) function func2 so (dum?)
> # if 0, do nothing
> # for two dummy functions like (for example):
> def func1(x) : return x.mean()
>
> def func2(x) : return x.std()
>
> # One solution would be to use digitize in 1D and histogram in 2D (don't
> need gx, gy for this one):
>
> h = histogram2d(x, y, bins=[lx, ly])[0]
>
> digitX = np.digitize(x, lx)
> digitY = np.digitize(y, ly)
>
> # create the output array, with -999 values to make sure I see which
> ones are not filled in
> result = np.zeros_like(h) - 999
>
> for i in range(nx-1) :
>      for j in range(ny-1) :
>          selectionofpoints = (digitX == i+1) & (digitY == j+1)
>          if h[i,j] > 10 : result[i,j] = func1(x[selectionofpoints])
>          elif h[i,j] > 0 : result[i,j] = func2(x[selectionofpoints])
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20121226/bb25d352/attachment.html>