[Numpy-discussion] 2D binning
Wes McKinney
wesmckinn at gmail.com
Wed Jun 2 15:29:06 EDT 2010
On Wed, Jun 2, 2010 at 2:26 PM, <josef.pktd at gmail.com> wrote:
> On Wed, Jun 2, 2010 at 2:09 PM, Mathew Yeates <mat.yeates at gmail.com> wrote:
>> I'm on Windows, using a precompiled binary. I never built numpy/scipy on
>> Windows.
>
> ndimage measurements has been recently rewritten. ndimage is very fast
> but (the old version) has insufficient type checking and may crash on
> wrong inputs.
>
> I managed to work with it in the past on Windows. Maybe you could try
> to check the dtypes of the arguments for ndi.mean. (Preferably in an
> interpreter session where you don't mind if it crashes)
>
> I don't remember the type restrictions, but there are/were several
> tickets for it.
>
> Josef
>
>
>>
>> On Wed, Jun 2, 2010 at 10:45 AM, Wes McKinney <wesmckinn at gmail.com> wrote:
>>>
>>> On Wed, Jun 2, 2010 at 1:23 PM, Mathew Yeates <mat.yeates at gmail.com>
>>> wrote:
>>> > thanks. I am also getting an error in ndi.mean
>>> > Were you getting the error
>>> > "RuntimeError: data type not supported"?
>>> >
>>> > -Mathew
>>> >
>>> > On Wed, Jun 2, 2010 at 9:40 AM, Wes McKinney <wesmckinn at gmail.com>
>>> > wrote:
>>> >>
>>> >> On Wed, Jun 2, 2010 at 3:41 AM, Vincent Schut <schut at sarvision.nl>
>>> >> wrote:
>>> >> > On 06/02/2010 04:52 AM, josef.pktd at gmail.com wrote:
>>> >> >> On Tue, Jun 1, 2010 at 9:57 PM, Zachary
>>> >> >> Pincus<zachary.pincus at yale.edu>
>>> >> >> wrote:
>>> >> >>>> I guess it's as fast as I'm going to get. I don't really see any
>>> >> >>>> other way. BTW, the lat/lons are integers)
>>> >> >>>
>>> >> >>> You could (in c or cython) try a brain-dead "hashtable" with no
>>> >> >>> collision detection:
>>> >> >>>
>>> >> >>> for lat, long, data in dataset:
>>> >> >>> bin = (lat ^ long) % num_bins
>>> >> >>> hashtable[bin] = update_incremental_mean(hashtable[bin], data)
>>> >> >>>
>>> >> >>> you'll of course want to do some experiments to see if your data
>>> >> >>> are
>>> >> >>> sufficiently sparse and/or you can afford a large enough hashtable
>>> >> >>> array that you won't get spurious hash collisions. Adding error-
>>> >> >>> checking to ensure that there are no collisions would be pretty
>>> >> >>> trivial (just keep a table of the lat/long for each hash value,
>>> >> >>> which
>>> >> >>> you'll need anyway, and check that different lat/long pairs don't
>>> >> >>> get
>>> >> >>> assigned the same bin).
>>> >> >>>
>>> >> >>> Zach
>>> >> >>>
>>> >> >>>
>>> >> >>>
>>> >> >>>> -Mathew
>>> >> >>>>
>>> >> >>>> On Tue, Jun 1, 2010 at 1:49 PM, Zachary
>>> >> >>>> Pincus<zachary.pincus at yale.edu
>>> >> >>>>> wrote:
>>> >> >>>>> Hi
>>> >> >>>>> Can anyone think of a clever (non-lopping) solution to the
>>> >> >>>> following?
>>> >> >>>>>
>>> >> >>>>> A have a list of latitudes, a list of longitudes, and list of
>>> >> >>>>> data
>>> >> >>>>> values. All lists are the same length.
>>> >> >>>>>
>>> >> >>>>> I want to compute an average of data values for each lat/lon
>>> >> >>>>> pair.
>>> >> >>>>> e.g. if lat[1001] lon[1001] = lat[2001] [lon [2001] then
>>> >> >>>>> data[1001] = (data[1001] + data[2001])/2
>>> >> >>>>>
>>> >> >>>>> Looping is going to take wayyyy to long.
>>> >> >>>>
>>> >> >>>> As a start, are the "equal" lat/lon pairs exactly equal (i.e.
>>> >> >>>> either
>>> >> >>>> not floating-point, or floats that will always compare equal, that
>>> >> >>>> is,
>>> >> >>>> the floating-point bit-patterns will be guaranteed to be
>>> >> >>>> identical)
>>> >> >>>> or
>>> >> >>>> approximately equal to float tolerance?
>>> >> >>>>
>>> >> >>>> If you're in the approx-equal case, then look at the KD-tree in
>>> >> >>>> scipy
>>> >> >>>> for doing near-neighbors queries.
>>> >> >>>>
>>> >> >>>> If you're in the exact-equal case, you could consider hashing the
>>> >> >>>> lat/
>>> >> >>>> lon pairs or something. At least then the looping is O(N) and not
>>> >> >>>> O(N^2):
>>> >> >>>>
>>> >> >>>> import collections
>>> >> >>>> grouped = collections.defaultdict(list)
>>> >> >>>> for lt, ln, da in zip(lat, lon, data):
>>> >> >>>> grouped[(lt, ln)].append(da)
>>> >> >>>>
>>> >> >>>> averaged = dict((ltln, numpy.mean(da)) for ltln, da in
>>> >> >>>> grouped.items())
>>> >> >>>>
>>> >> >>>> Is that fast enough?
>>> >> >>
>>> >> >> If the lat lon can be converted to a 1d label as Wes suggested, then
>>> >> >> in a similar timing exercise ndimage was the fastest.
>>> >> >> http://mail.scipy.org/pipermail/scipy-user/2009-February/019850.html
>>> >> >
>>> >> > And as you said your lats and lons are integers, you could simply do
>>> >> >
>>> >> > ll = lat*1000 + lon
>>> >> >
>>> >> > to get unique 'hashes' or '1d labels' for you latlon pairs, as a lat
>>> >> > or
>>> >> > lon will never exceed 360 (degrees).
>>> >> >
>>> >> > After that, either use the ndimage approach, or you could use
>>> >> > histogramming with weighting by data values and divide by histogram
>>> >> > withouth weighting, or just loop.
>>> >> >
>>> >> > Vincent
>>> >> >
>>> >> >>
>>> >> >> (this was for python 2.4, also later I found np.bincount which
>>> >> >> requires that the labels are consecutive integers, but is as fast as
>>> >> >> ndimage)
>>> >> >>
>>> >> >> I don't know how it would compare to the new suggestions.
>>> >> >>
>>> >> >> Josef
>>> >> >>
>>> >> >>
>>> >> >>
>>> >> >>>>
>>> >> >>>> Zach
>>> >> >>>> _______________________________________________
>>> >> >>>> NumPy-Discussion mailing list
>>> >> >>>> NumPy-Discussion at scipy.org
>>> >> >>>> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>>> >> >>>>
>>> >> >>>> _______________________________________________
>>> >> >>>> NumPy-Discussion mailing list
>>> >> >>>> NumPy-Discussion at scipy.org
>>> >> >>>> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>>> >> >>>
>>> >> >>> _______________________________________________
>>> >> >>> NumPy-Discussion mailing list
>>> >> >>> NumPy-Discussion at scipy.org
>>> >> >>> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>>> >> >>>
>>> >> >
>>> >> > _______________________________________________
>>> >> > NumPy-Discussion mailing list
>>> >> > NumPy-Discussion at scipy.org
>>> >> > http://mail.scipy.org/mailman/listinfo/numpy-discussion
>>> >> >
>>> >>
>>> >> I was curious about how fast ndimage was for this operation so here's
>>> >> the complete function.
>>> >>
>>> >> import scipy.ndimage as ndi
>>> >>
>>> >> N = 10000
>>> >>
>>> >> lat = np.random.randint(0, 360, N)
>>> >> lon = np.random.randint(0, 360, N)
>>> >> data = np.random.randn(N)
>>> >>
>>> >> def group_mean(lat, lon, data):
>>> >> indexer = np.lexsort((lon, lat))
>>> >> lat = lat.take(indexer)
>>> >> lon = lon.take(indexer)
>>> >> sorted_data = data.take(indexer)
>>> >>
>>> >> keys = 1000 * lat + lon
>>> >> unique_keys = np.unique(keys)
>>> >>
>>> >> result = ndi.mean(sorted_data, labels=keys, index=unique_keys)
>>> >> decoder = keys.searchsorted(unique_keys)
>>> >>
>>> >> return dict(zip(zip(lat.take(decoder), lon.take(decoder)), result))
>>> >>
>>> >> Appears to be about 13x faster (and could be made faster still) than
>>> >> the naive version on my machine:
>>> >>
>>> >> def group_mean_naive(lat, lon, data):
>>> >> grouped = collections.defaultdict(list)
>>> >> for lt, ln, da in zip(lat, lon, data):
>>> >> grouped[(lt, ln)].append(da)
>>> >>
>>> >> averaged = dict((ltln, np.mean(da)) for ltln, da in grouped.items())
>>> >>
>>> >> return averaged
>>> >>
>>> >> I had to get the latest scipy trunk to not get an error from
>>> >> ndimage.mean
>>> >> _______________________________________________
>>> >> NumPy-Discussion mailing list
>>> >> NumPy-Discussion at scipy.org
>>> >> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>>> >
>>> >
>>> > _______________________________________________
>>> > NumPy-Discussion mailing list
>>> > NumPy-Discussion at scipy.org
>>> > http://mail.scipy.org/mailman/listinfo/numpy-discussion
>>> >
>>> >
>>>
>>> That's the error I was getting. Depending on your OS upgrading to the
>>> scipy trunk should be the easiest fix.
>>> _______________________________________________
>>> NumPy-Discussion mailing list
>>> NumPy-Discussion at scipy.org
>>> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>>
>>
>> _______________________________________________
>> NumPy-Discussion mailing list
>> NumPy-Discussion at scipy.org
>> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>>
>>
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
If you're on Python 2.6 the binary on here might work for you:
http://www.lfd.uci.edu/~gohlke/pythonlibs/
It looks recent enough to have the rewritten ndimage
More information about the NumPy-Discussion
mailing list