computing pairwise distance of vectors with missing (nan) values
![](https://secure.gravatar.com/avatar/79c44e1ac20fdd391616a8046f16beb6.jpg?s=120&d=mm&r=g)
Dear all, My basic problem is that I would like to compute distances between vectors with missing values. You can find more detail in my question on SO (http://stackoverflow.com/questions/24781461/compute-the-pairwise-distance-in...). Since it seems this is not directly possible with scipy at the moment, I started to Cythonize my function. Currently, the below function is not much faster than my pure Python implementation, so I thought I'd ask the experts here. *Note that even though I'm computing the euclidean distance, I'd like to make use of different distance metrics. * So my current attempt at Cythonizing is: import numpy cimport numpy cimport cython from numpy.linalg import norm numpy.import_array() @cython.boundscheck(False) @cython.wraparound(False) def masked_euclidean(numpy.ndarray[numpy.double_t, ndim=2] data): cdef Py_ssize_t m = data.shape[0] cdef Py_ssize_t i = 0 cdef Py_ssize_t j = 0 cdef Py_ssize_t k = 0 cdef numpy.ndarray[numpy.double_t] dm = numpy.zeros(m * (m - 1) // 2, dtype=numpy.double) cdef numpy.ndarray[numpy.uint8_t, ndim=2, cast=True] mask = numpy.isfinite(data) # boolean for i in range(m - 1): for j in range(i + 1, m): curr = numpy.logical_and(mask[i], mask[j]) u = data[i][curr] v = data[j][curr] dm[k] = norm(u - v) k += 1 return dm Maybe the lack of speed-up is due to the Python function 'norm'? So my question is, how to improve the Cython implementation? Or is there a completely different way of approaching this problem? Thanks in advance, Moritz
![](https://secure.gravatar.com/avatar/4d4ea6148fef59dff9fa0fc8c309496a.jpg?s=120&d=mm&r=g)
Le 21 juil. 2014 à 10:09, Moritz Emanuel Beber a écrit :
Dear all,
My basic problem is that I would like to compute distances between vectors with missing values. You can find more detail in my question on SO (http://stackoverflow.com/questions/24781461/compute-the-pairwise-distance-in...). Since it seems this is not directly possible with scipy at the moment, I started to Cythonize my function. Currently, the below function is not much faster than my pure Python implementation, so I thought I'd ask the experts here. Note that even though I'm computing the euclidean distance, I'd like to make use of different distance metrics.
So my current attempt at Cythonizing is:
import numpy cimport numpy cimport cython from numpy.linalg import norm
numpy.import_array()
@cython.boundscheck(False) @cython.wraparound(False) def masked_euclidean(numpy.ndarray[numpy.double_t, ndim=2] data): cdef Py_ssize_t m = data.shape[0] cdef Py_ssize_t i = 0 cdef Py_ssize_t j = 0 cdef Py_ssize_t k = 0 cdef numpy.ndarray[numpy.double_t] dm = numpy.zeros(m * (m - 1) // 2, dtype=numpy.double) cdef numpy.ndarray[numpy.uint8_t, ndim=2, cast=True] mask = numpy.isfinite(data) # boolean for i in range(m - 1): for j in range(i + 1, m): curr = numpy.logical_and(mask[i], mask[j]) u = data[i][curr] v = data[j][curr] dm[k] = norm(u - v) k += 1 return dm
Maybe the lack of speed-up is due to the Python function 'norm'? So my question is, how to improve the Cython implementation? Or is there a completely different way of approaching this problem?
Thanks in advance,
I would suggest using the python --anotate option (or -a option of python magic in IPython notebook) ,it will show you the generated c-code with hints of which line is slow and why as a nice syntax highlighted html page. You are right that `norm`, is slow, but apparently so is gitItem on data[] and numpy.logical_and -- M
Moritz _______________________________________________ SciPy-Dev mailing list SciPy-Dev@scipy.org http://mail.scipy.org/mailman/listinfo/scipy-dev
![](https://secure.gravatar.com/avatar/79c44e1ac20fdd391616a8046f16beb6.jpg?s=120&d=mm&r=g)
Hello again, (I somehow lost the ability to reply to your message Matthias, since I got mails in digest mode, apologies for that.) So I've poked around the code with profilers and got two conflicting pieces of information. I now use a .pyx file (see attachment) and I profiled it in two different ways: 1. Using cProfile which gave the following results: ncalls tottime percall cumtime percall filename:lineno(function) 1 1.641 1.641 2.303 2.303 distance.pyx:13(masked_euclidean) 44850 0.294 0.000 0.662 0.000 linalg.py:1924(norm) 44850 0.292 0.000 0.292 0.000 {method 'reduce' of 'numpy.ufunc' objects} 44850 0.041 0.000 0.041 0.000 {numpy.core.multiarray.array} 44850 0.023 0.000 0.065 0.000 numeric.py:392(asarray) 44850 0.012 0.000 0.012 0.000 {method 'conj' of 'numpy.ndarray' objects} 1 0.000 0.000 2.303 2.303 <string>:1(<module>) 1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects} This leads me to believe that, yes, the boolean subset is taking a significant amount of time but the majority is spent in `norm`. 2. I also ran valgrind and that seems to suggest that 55% of the time is spent in the boolean subset (you can get it here: https://dl.dropboxusercontent.com/u/51564502/callgrind.log). Or am I reading the results wrong? 3. I couldn't get the %lprof magic to work in the IPyNB, just get 0 time for the whole function call. Is this possible somehow by now? So my questions at this point are: Can I improve the fancy indexing somehow? And can I include the scipy distance measures easily so that I avoid the call to numpy.linalg.norm? Thank you so much, Moritz
![](https://secure.gravatar.com/avatar/4d4ea6148fef59dff9fa0fc8c309496a.jpg?s=120&d=mm&r=g)
Quick from my phone. Isn't numpy.take() faster than fancy indexing ? -- M Envoyé de mon iPhone
Le 23 juil. 2014 à 11:51, Moritz Beber <moritz.beber@gmail.com> a écrit :
Hello again,
(I somehow lost the ability to reply to your message Matthias, since I got mails in digest mode, apologies for that.)
So I've poked around the code with profilers and got two conflicting pieces of information. I now use a .pyx file (see attachment) and I profiled it in two different ways:
1. Using cProfile which gave the following results:
ncalls tottime percall cumtime percall filename:lineno(function) 1 1.641 1.641 2.303 2.303 distance.pyx:13(masked_euclidean) 44850 0.294 0.000 0.662 0.000 linalg.py:1924(norm) 44850 0.292 0.000 0.292 0.000 {method 'reduce' of 'numpy.ufunc' objects} 44850 0.041 0.000 0.041 0.000 {numpy.core.multiarray.array} 44850 0.023 0.000 0.065 0.000 numeric.py:392(asarray) 44850 0.012 0.000 0.012 0.000 {method 'conj' of 'numpy.ndarray' objects} 1 0.000 0.000 2.303 2.303 <string>:1(<module>) 1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
This leads me to believe that, yes, the boolean subset is taking a significant amount of time but the majority is spent in `norm`.
2. I also ran valgrind and that seems to suggest that 55% of the time is spent in the boolean subset (you can get it here: https://dl.dropboxusercontent.com/u/51564502/callgrind.log). Or am I reading the results wrong?
3. I couldn't get the %lprof magic to work in the IPyNB, just get 0 time for the whole function call. Is this possible somehow by now?
So my questions at this point are: Can I improve the fancy indexing somehow? And can I include the scipy distance measures easily so that I avoid the call to numpy.linalg.norm?
Thank you so much, Moritz <distance.pyx> <profile_cython.py> _______________________________________________ SciPy-Dev mailing list SciPy-Dev@scipy.org http://mail.scipy.org/mailman/listinfo/scipy-dev
![](https://secure.gravatar.com/avatar/79c44e1ac20fdd391616a8046f16beb6.jpg?s=120&d=mm&r=g)
So I've made significant headway on cythonizing a pdist function that ignores NaNs. You can see the results here: http://nbviewer.ipython.org/gist/Midnighter/b81d5732a0ef88f2e185 Two questions remain: 1) Can I somehow make use of the distance measures defined in scipy/spatial/src/distance.c? 2) Does anyone know if numexpr could be used to compute the above pairwise distances in parallel? Thank you again.
participants (3)
-
Matthias Bussonnier
-
Moritz Beber
-
Moritz Emanuel Beber