[Numpy-discussion] record arrays initialization

Aronne Merrelli aronne.merrelli at gmail.com
Wed May 2 19:25:01 EDT 2012


On Wed, May 2, 2012 at 5:27 PM, Kevin Jacobs <jacobs at bioinformed.com>
<bioinformed at gmail.com> wrote:
> On Wed, May 2, 2012 at 5:45 PM, Moroney, Catherine M (388D)
> <Catherine.M.Moroney at jpl.nasa.gov> wrote:
>>
>> Thanks to Perry for some very useful off-list conversation.   I realize
>> that
>> I wasn't being clear at all in my earlier description of the problem so
>> here it is
>> in a nutshell:
>>
>> Find the best match in an array t(5000, 7) for a single vector e(7).  Now
>> scale
>> it up so e is (128, 512, 7) and I want to return a (128, 512) array of the
>> t-identifiers
>> that are the best match for e.  "Best match" is defined as the minimum
>> Euclidean distance.
>>
>
> It sounds like you want to find the nearest neighbor to a point in a
> high-dimensional space. This sounds like a job for a spacial data structure
> like a KD-tree.  See:

> http://docs.scipy.org/doc/scipy/reference/spatial.html
> http://mloss.org/software/view/143/
> http://www.mrzv.org/software/pyANN/
> etc.

In general this is a good suggestion - I was going to mention it
earlier - but I think for this particular problem it is not better
than the "brute force" and argmin() NumPy approach. On my laptop, the
KDTree query is about a factor of 7 slower (ignoring the time cost to
create the KDTree)

In [45]: def foo1(element, targets):
    distance_squared = ((element-targets)**2).sum(1)
    min_index = distance_squared.argmin()
    return sqrt(distance_squared[min_index]), min_index
   ....:

In [46]: def foo2(element, T):
    return T.query(element)


In [47]: element = np.arange(1,8)
In [48]: targets = np.random.uniform(0,8,(5000,7))
In [49]: T = scipy.spatial.KDTree(targets)
In [50]: %timeit foo1(element, targets)
1000 loops, best of 3: 401 us per loop
In [51]: %timeit foo2(element, T)
100 loops, best of 3: 2.92 ms per loop

Just to make sure they say the same thing:

In [53]: foo1(element, targets)
Out[53]: (1.8173671152898632, 570)
In [54]: foo2(element, T)
Out[54]: (1.8173671152898632, 570)


I think KDTree is more optimal for larger searches (more than 5000
elements), and fewer dimensions. For example, with 500,000 elements
and 2 dimensions, I get 34 ms for NumPy and 2 ms for the KDtree.

Back to the original question, for 400 us per search, even over
128x512 elements that would be 26 seconds total, I think? That might
not be too bad.

Cheers,
Aronne



More information about the NumPy-Discussion mailing list