[Numpy-discussion] Proposal: scipy.spatial

Barry Wark barrywark at gmail.com
Wed Oct 1 02:08:24 EDT 2008


On Mon, Sep 29, 2008 at 8:24 PM, Anne Archibald
<peridot.faceted at gmail.com> wrote:
> Hi,
>
> Once again there has been a thread on the numpy/scipy mailing lists
> requesting (essentially) some form of spatial data structure. Pointers
> have been posted to ANN (sadly LGPLed and in C++) as well as a handful
> of pure-python implementations of kd-trees. I suggest the creation of
> a new submodule of scipy, scipy.spatial, to contain spatial data
> structures and algorithms. Specifically, I propose it contain a
> kd-tree implementation providing nearest-neighbor, approximate
> nearest-neighbor, and all-points-near queries. I have a few other
> suggestions for things it might contain, but kd-trees seem like a good
> first step.
>
> 2008/9/27 Nathan Bell <wnbell at gmail.com>:
>> On Sat, Sep 27, 2008 at 11:18 PM, Anne Archibald
>> <peridot.faceted at gmail.com> wrote:
>>>
>>> I think a kd-tree implementation would be a valuable addition to
>>> scipy, perhaps in a submodule scipy.spatial that might eventually
>>> contain other spatial data structures and algorithms. What do you
>>> think? Should we have one? Should it be based on Sturla Molden's code,
>>> if the license permits? I am willing to contribute one, if not.
>>
>> +1
>
> Judging that your vote and mine are enough in the absence of
> dissenting voices, I have written an implementation based on yours,
> Sturla Molden's, and the algorithms described by the authors of the
> ANN library. Before integrating it into scipy, though, I'd like to
> send it around for comments.
>
> Particular issues:
>
> * It's pure python right now, but with some effort it could be
> partially or completely cythonized. This is probably a good idea in
> the long run. In the meantime I have crudely vectorized it so that
> users can at least avoid looping in their own code.
> * It is able to return the r nearest neighbors, optionally subject to
> a maximum distance limit, as well as approximate nearest neighbors.
> * It is not currently set up to conveniently return all points within
> some fixed distance of the target point, but this could easily be
> added.
> * It returns distances and indices of nearest neighbors in the original array.
> * The testing code is, frankly, a mess. I need to look into using nose
> in a sensible fashion.
> * The license is the scipy license.
>
> I am particularly concerned about providing a convenient return
> format. The natural return from a query is a list of neighbors, since
> it may have variable length (there may not be r elements in the tree,
> or you might have supplied a maximum distance which doesn't contain r
> points). For a single query, it's simple to return a python list
> (should it be sorted? currently it's a heap). But if you want to
> vectorize the process, storing the results in an array becomes
> cumbersome. One option is an object array full of lists; another,
> which, I currently use, is an array with nonexistent points marked
> with an infinite distance and an invalid index. A third would be to
> return masked arrays. How do you recommend handling this variable (or
> potentially-variable) sized output?
>
>> If you're implementing one, I would highly recommend the
>> "left-balanced" kd-tree.
>> http://www2.imm.dtu.dk/pubdb/views/edoc_download.php/2535/pdf/imm2535.pdf
>
> Research suggests that at least in high dimension a more geometric
> balancing criterion can produce vastly better results. I used the
> "sliding midpoint" rule, which doesn't allow such a numerical
> balancing but does ensure that you don't have too many long skinny
> cells (since a sphere tends to cut very many of these).
>
> I've also been thinking about what else would go in scipy.spatial. I
> think it would be valuable to have a reasonably efficient distance
> matrix function (we seem to get that question a lot, and the answer's
> not trivial) as well as a sparse distance matrix function based on the
> kd-trees. The module could potentially also swallow the (currently
> sandboxed?) delaunay code.
>
> Anne

Anne,

Thanks for taking this on. The scikits.ann has licensing issues (as
noted above), so it would be nice to have a clean-room implementation
in scipy. I am happy to port the scikits.ann API to the final API that
you choose, however, if you think that would be helpful.

Cheers,
Barry



More information about the NumPy-Discussion mailing list