[Numpy-discussion] record arrays initialization

Thu May 3 13:22:01 EDT 2012

> 
> 
> ------------------------------
> 
> Message: 6
> Date: Thu, 3 May 2012 10:00:11 -0700
> From: Keith Goodman <kwgoodman at gmail.com>
> Subject: Re: [Numpy-discussion] record arrays initialization
> To: Discussion of Numerical Python <numpy-discussion at scipy.org>
> Message-ID:
> 	<CAB6Y536WPL-cmcGZASiRg7obVAMc5jv5Ct5ByOtM-J0++7TtjQ at mail.gmail.com>
> Content-Type: text/plain; charset=ISO-8859-1
> 
> On Wed, May 2, 2012 at 4:46 PM, Kevin Jacobs <jacobs at bioinformed.com>
> <bioinformed at gmail.com> wrote:
> 
>> The cKDTree implementation is more than 4 times faster than the brute-force
>> approach:
>> 
>> T = scipy.spatial.cKDTree(targets)
>> 
>> In [11]: %timeit foo1(element, targets) ? # Brute force
>> 1000 loops, best of 3: 385 us per loop
>> 
>> In [12]: %timeit foo2(element, T) ? ? ? ? # cKDTree
>> 10000 loops, best of 3: 83.5 us per loop
>> 
>> In [13]: 385/83.5
>> Out[13]: 4.610778443113772
> 
> Brute force plus cython beats cKDTree for knn with k=1 and Euclidean distance:
> 
>    I[5] targets = np.random.uniform(0, 8, (5000, 7))
>    I[6] element = np.arange(1, 8, dtype=np.float64)
> 
>    I[7] T = scipy.spatial.cKDTree(targets)
>    I[8] timeit T.query(element)
>    10000 loops, best of 3: 36.1 us per loop
> 
>    I[9] timeit nn(targets, element)
>    10000 loops, best of 3: 28.5 us per loop
> 
> What about lower dimensions (2 instead of 7) where cKDTree gets faster?
> 
>    I[18] element = np.arange(1,3, dtype=np.float64)
>    I[19] targets = np.random.uniform(0,8,(5000,2))
> 
>    I[20] T = scipy.spatial.cKDTree(targets)
>    I[21] timeit T.query(element)
>    10000 loops, best of 3: 27.5 us per loop
> 
>    I[22] timeit nn(targets, element)
>    100000 loops, best of 3: 11.6 us per loop
> 
> I could add nn to the bottleneck package. Is the k=1, Euclidean
> distance case too specialized?
> 
> Prototype (messy) code is here:
> https://github.com/kwgoodman/bottleneck/issues/45
> 
> For a smaller speed up (~2x) foo1 could use bottleneck's sum or
> squares function, bn.ss().
> 
> 
Thank you everybody for your advice.  I will actually be clustering the
dataset for use in the production code (which must be in Fortran), and using
scipy.cluster.vq.kmeans2 to generate and populate the clusters.  Even though a brute-force
search in Fortran is fast, it will still be too slow for the final code.

But, the kmeans2 clustering isn't "perfect" in that it doesn't always
give identical results to the brute force search, so I want to do an analysis
of the differences on a test data-set.  

Catherine