[Numpy-discussion] Proposal: scipy.spatial

David Bolme bolme1234 at comcast.net
Fri Oct 3 20:25:25 EDT 2008


I remember reading a paper or book that stated that for data that has  
been normalized correlation and Euclidean are equivalent and will  
produce the same knn results.  To this end I spent a couple hours this  
afternoon doing the math.  This document is the result.

http://www.cs.colostate.edu/~bolme/bolme08euclidean.pdf

I believe that with mean subtracted and unit length vectors, a  
Euclidean knn algorithm will produces the same result as if the  
vectors were compared using correlation.  I am not sure if kd-trees  
will perform well on the normalized vectors as they have a very  
specific geometry.  If my math checks out it may be worth adding  
Pearson's correlation as a default option or as a separate class.

I have also spent a little time looking at kd-trees and the kdtree  
code.  It looks good.  As I understand it kd-trees only work well when  
the number of datapoints (N) is larger than 2^D, where D is the  
dimensionality of those points.  This will not work well for many of  
my computer vision problems because often D is large.  As Anne  
suggested I will probably look at cover trees because often times the  
data are "low-dimensional data in high-dimensional spaces".  I have  
been told though that with a large D there is know known fast  
algorithm for knn.

Another problem is that the distances and similarity measures used in  
biometrics and computer vision are often very specialized and may or  
may not conform to the underlying assumptions of fast algorithms.  I  
think for this reason I will need an exhaustive search algorithm.  I  
will code it up modeled after Anne's interface and hopefully it will  
make it into the spatial module.

I think that kd-trees and the spatial module are a good contribution  
to scipy.  I have also enjoyed learning more about norms, correlation,  
and fast knn algorithms.

Thanks,
Dave



More information about the NumPy-Discussion mailing list