[Numpy-discussion] 2D binning

Wed Jun 2 01:15:39 EDT 2010

On Tue, Jun 1, 2010 at 1:51 PM, Wes McKinney <wesmckinn at gmail.com> wrote:
> On Tue, Jun 1, 2010 at 4:49 PM, Zachary Pincus <zachary.pincus at yale.edu> wrote:
>>> Hi
>>> Can anyone think of a clever (non-lopping) solution to the following?
>>>
>>> A have a list of latitudes, a list of longitudes, and list of data
>>> values. All lists are the same length.
>>>
>>> I want to compute an average  of data values for each lat/lon pair.
>>> e.g. if lat[1001] lon[1001] = lat[2001] [lon [2001] then
>>> data[1001] = (data[1001] + data[2001])/2
>>>
>>> Looping is going to take wayyyy to long.
>>
>> As a start, are the "equal" lat/lon pairs exactly equal (i.e. either
>> not floating-point, or floats that will always compare equal, that is,
>> the floating-point bit-patterns will be guaranteed to be identical) or
>> approximately equal to float tolerance?
>>
>> If you're in the approx-equal case, then look at the KD-tree in scipy
>> for doing near-neighbors queries.
>>
>> If you're in the exact-equal case, you could consider hashing the lat/
>> lon pairs or something. At least then the looping is O(N) and not
>> O(N^2):
>>
>> import collections
>> grouped = collections.defaultdict(list)
>> for lt, ln, da in zip(lat, lon, data):
>>   grouped[(lt, ln)].append(da)
>>
>> averaged = dict((ltln, numpy.mean(da)) for ltln, da in grouped.items())
>>
>> Is that fast enough?
>>
>> Zach
>> _______________________________________________
>> NumPy-Discussion mailing list
>> NumPy-Discussion at scipy.org
>> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>>
>
> This is a pretty good example of the "group-by" problem that will
> hopefully work its way into a future edition of NumPy. Given that, a
> good approach would be to produce a unique key from the lat and lon
> vectors, and pass that off to the groupby routine (when it exists).
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>

meanwhile groupby from itertools will work but might be a bit slower
since it'll have to convert every row to tuple and group in a list.

import numpy as np
import itertools

# fake data
N = 10000
lats = np.repeat(180 * (np.random.ranf(N/ 250) - 0.5), 250)
lons = np.repeat(360 * (np.random.ranf(N/ 250) - 0.5), 250)

np.random.shuffle(lats)
np.random.shuffle(lons)

vals = np.arange(N)
#####################################

inds = np.lexsort((lons, lats))

sorted_lats = lats[inds]
sorted_lons = lons[inds]
sorted_vals = vals[inds]

llv = np.array((sorted_lats, sorted_lons, sorted_vals)).T

for (lat, lon), group in itertools.groupby(llv, lambda row: tuple(row[:2])):
    group_vals = [g[-1] for g in group]
    print lat, lon, np.mean(group_vals)

# make sure the mean for the last lat/lon from the loop matches the mean
# for that lat/lon from original data.
tests_idx, = np.where((lats == lat) & (lons == lon))
assert np.mean(vals[tests_idx]) == np.mean(group_vals)