[Numpy-discussion] Proposal: scipy.spatial

Anne Archibald peridot.faceted at gmail.com
Mon Sep 29 23:24:01 EDT 2008


Hi,

Once again there has been a thread on the numpy/scipy mailing lists
requesting (essentially) some form of spatial data structure. Pointers
have been posted to ANN (sadly LGPLed and in C++) as well as a handful
of pure-python implementations of kd-trees. I suggest the creation of
a new submodule of scipy, scipy.spatial, to contain spatial data
structures and algorithms. Specifically, I propose it contain a
kd-tree implementation providing nearest-neighbor, approximate
nearest-neighbor, and all-points-near queries. I have a few other
suggestions for things it might contain, but kd-trees seem like a good
first step.

2008/9/27 Nathan Bell <wnbell at gmail.com>:
> On Sat, Sep 27, 2008 at 11:18 PM, Anne Archibald
> <peridot.faceted at gmail.com> wrote:
>>
>> I think a kd-tree implementation would be a valuable addition to
>> scipy, perhaps in a submodule scipy.spatial that might eventually
>> contain other spatial data structures and algorithms. What do you
>> think? Should we have one? Should it be based on Sturla Molden's code,
>> if the license permits? I am willing to contribute one, if not.
>
> +1

Judging that your vote and mine are enough in the absence of
dissenting voices, I have written an implementation based on yours,
Sturla Molden's, and the algorithms described by the authors of the
ANN library. Before integrating it into scipy, though, I'd like to
send it around for comments.

Particular issues:

* It's pure python right now, but with some effort it could be
partially or completely cythonized. This is probably a good idea in
the long run. In the meantime I have crudely vectorized it so that
users can at least avoid looping in their own code.
* It is able to return the r nearest neighbors, optionally subject to
a maximum distance limit, as well as approximate nearest neighbors.
* It is not currently set up to conveniently return all points within
some fixed distance of the target point, but this could easily be
added.
* It returns distances and indices of nearest neighbors in the original array.
* The testing code is, frankly, a mess. I need to look into using nose
in a sensible fashion.
* The license is the scipy license.

I am particularly concerned about providing a convenient return
format. The natural return from a query is a list of neighbors, since
it may have variable length (there may not be r elements in the tree,
or you might have supplied a maximum distance which doesn't contain r
points). For a single query, it's simple to return a python list
(should it be sorted? currently it's a heap). But if you want to
vectorize the process, storing the results in an array becomes
cumbersome. One option is an object array full of lists; another,
which, I currently use, is an array with nonexistent points marked
with an infinite distance and an invalid index. A third would be to
return masked arrays. How do you recommend handling this variable (or
potentially-variable) sized output?

> If you're implementing one, I would highly recommend the
> "left-balanced" kd-tree.
> http://www2.imm.dtu.dk/pubdb/views/edoc_download.php/2535/pdf/imm2535.pdf

Research suggests that at least in high dimension a more geometric
balancing criterion can produce vastly better results. I used the
"sliding midpoint" rule, which doesn't allow such a numerical
balancing but does ensure that you don't have too many long skinny
cells (since a sphere tends to cut very many of these).

I've also been thinking about what else would go in scipy.spatial. I
think it would be valuable to have a reasonably efficient distance
matrix function (we seem to get that question a lot, and the answer's
not trivial) as well as a sparse distance matrix function based on the
kd-trees. The module could potentially also swallow the (currently
sandboxed?) delaunay code.

Anne
-------------- next part --------------
A non-text attachment was scrubbed...
Name: kdtree.py
Type: text/x-python
Size: 10591 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20080929/80b2d018/attachment.py>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: test_kdtree.py
Type: text/x-python
Size: 4369 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20080929/80b2d018/attachment-0001.py>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: plot_kdtree.py
Type: text/x-python
Size: 1403 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20080929/80b2d018/attachment-0002.py>


More information about the NumPy-Discussion mailing list