[SciPy-Dev] Distance Metrics

10 Jan 2012

      Hello,
I've been working on a little project lately centered around distance 
metrics ( https://github.com/jakevdp/pyDistances ).  The idea was to 
create a set of cython distance metrics that can be called as normal 
from python with numpy arrays, but which also expose low-level C 
function pointers so that the same metrics can be called directly on 
memory buffers from within cythonized tree-based KNN searches (KD Tree, 
Ball Tree, etc.), without any python overhead.

I initially had in mind developing this for scikit-learn in order to 
extend the capability of Ball Tree, but it occurred to me that this 
might be nice to have in scipy as well.  The speed of computing a 
distance matrix is comparable to that of pdist/cdist in 
scipy.spatial.distance (a few metrics are slightly faster, a few are 
slightly slower).  The primary advantage to this approach is the 
exposure of underlying C functions which can be easily imported and 
called from other cython scripts. 

I think there are several other advantages over the current scipy 
implementation.  Because the new code is pure cython, it would likely be 
easier to maintain and to add metrics than the current scipy setup, 
which relies on C routines wrapped by-hand using the numpy C-API.  
Because all distance functions rely on the same set of underlying cython 
routines, there are fewer places for error (for instance, currently the 
scipy.spatial.distance boolean routines return different results 
depending on whether you call the metrics directly or use cdist/pdist)

I'm curious what people think: could a framework like this replace the 
current scipy.spatial.distances implementation?  Are there any 
disadvantages that I'm not noticing?
Thanks
    Jake

[SciPy-Dev] Distance Metrics

Jacob VanderPlas