[Numpy-discussion] 2D binning

Wed Jun 2 03:41:06 EDT 2010

On 06/02/2010 04:52 AM, josef.pktd at gmail.com wrote:
> On Tue, Jun 1, 2010 at 9:57 PM, Zachary Pincus<zachary.pincus at yale.edu>  wrote:
>>> I guess it's as fast as I'm going to get. I don't really see any
>>> other way. BTW, the lat/lons are integers)
>>
>> You could (in c or cython) try a brain-dead "hashtable" with no
>> collision detection:
>>
>> for lat, long, data in dataset:
>>    bin = (lat ^ long) % num_bins
>>    hashtable[bin] = update_incremental_mean(hashtable[bin], data)
>>
>> you'll of course want to do some experiments to see if your data are
>> sufficiently sparse and/or you can afford a large enough hashtable
>> array that you won't get spurious hash collisions. Adding error-
>> checking to ensure that there are no collisions would be pretty
>> trivial (just keep a table of the lat/long for each hash value, which
>> you'll need anyway, and check that different lat/long pairs don't get
>> assigned the same bin).
>>
>> Zach
>>
>>
>>
>>> -Mathew
>>>
>>> On Tue, Jun 1, 2010 at 1:49 PM, Zachary Pincus<zachary.pincus at yale.edu
>>>> wrote:
>>>> Hi
>>>> Can anyone think of a clever (non-lopping) solution to the
>>> following?
>>>>
>>>> A have a list of latitudes, a list of longitudes, and list of data
>>>> values. All lists are the same length.
>>>>
>>>> I want to compute an average  of data values for each lat/lon pair.
>>>> e.g. if lat[1001] lon[1001] = lat[2001] [lon [2001] then
>>>> data[1001] = (data[1001] + data[2001])/2
>>>>
>>>> Looping is going to take wayyyy to long.
>>>
>>> As a start, are the "equal" lat/lon pairs exactly equal (i.e. either
>>> not floating-point, or floats that will always compare equal, that is,
>>> the floating-point bit-patterns will be guaranteed to be identical) or
>>> approximately equal to float tolerance?
>>>
>>> If you're in the approx-equal case, then look at the KD-tree in scipy
>>> for doing near-neighbors queries.
>>>
>>> If you're in the exact-equal case, you could consider hashing the lat/
>>> lon pairs or something. At least then the looping is O(N) and not
>>> O(N^2):
>>>
>>> import collections
>>> grouped = collections.defaultdict(list)
>>> for lt, ln, da in zip(lat, lon, data):
>>>    grouped[(lt, ln)].append(da)
>>>
>>> averaged = dict((ltln, numpy.mean(da)) for ltln, da in
>>> grouped.items())
>>>
>>> Is that fast enough?
>
> If the lat lon can be converted to a 1d label as Wes suggested, then
> in a similar timing exercise ndimage was the fastest.
> http://mail.scipy.org/pipermail/scipy-user/2009-February/019850.html

And as you said your lats and lons are integers, you could simply do

ll = lat*1000 + lon

to get unique 'hashes' or '1d labels' for you latlon pairs, as a lat or 
lon will never exceed 360 (degrees).

After that, either use the ndimage approach, or you could use 
histogramming with weighting by data values and divide by histogram 
withouth weighting, or just loop.

Vincent

>
> (this was for python 2.4, also later I found np.bincount which
> requires that the labels are consecutive integers, but is as fast as
> ndimage)
>
> I don't know how it would compare to the new suggestions.
>
> Josef
>
>
>
>>>
>>> Zach
>>> _______________________________________________
>>> NumPy-Discussion mailing list
>>> NumPy-Discussion at scipy.org
>>> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>>>
>>> _______________________________________________
>>> NumPy-Discussion mailing list
>>> NumPy-Discussion at scipy.org
>>> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>>
>> _______________________________________________
>> NumPy-Discussion mailing list
>> NumPy-Discussion at scipy.org
>> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>>