New subject: computing pairwise distance of vectors with missing (nan) values

21 Jul 2014

      Dear all,

My basic problem is that I would like to compute distances between 
vectors with missing values. You can find more detail in my question on 
SO 
(http://stackoverflow.com/questions/24781461/compute-the-pairwise-distance-in...). 
Since it seems this is not directly possible with scipy at the moment, I 
started to Cythonize my function. Currently, the below function is not 
much faster than my pure Python implementation, so I thought I'd ask the 
experts here. *Note that even though I'm computing the euclidean 
distance, I'd like to make use of different distance metrics.

*
So my current attempt at Cythonizing is:

import numpy
cimport numpy
cimport cython
from numpy.linalg import norm

numpy.import_array()

@cython.boundscheck(False)
@cython.wraparound(False)
def masked_euclidean(numpy.ndarray[numpy.double_t, ndim=2] data):
     cdef Py_ssize_t m = data.shape[0]
     cdef Py_ssize_t i = 0
     cdef Py_ssize_t j = 0
     cdef Py_ssize_t k = 0
     cdef numpy.ndarray[numpy.double_t] dm = numpy.zeros(m * (m - 1) // 
2, dtype=numpy.double)
     cdef numpy.ndarray[numpy.uint8_t, ndim=2, cast=True] mask = 
numpy.isfinite(data) # boolean
     for i in range(m - 1):
         for j in range(i + 1, m):
             curr = numpy.logical_and(mask[i], mask[j])
             u = data[i][curr]
             v = data[j][curr]
             dm[k] = norm(u - v)
             k += 1
     return dm

Maybe the lack of speed-up is due to the Python function 'norm'? So my 
question is, how to improve the Cython implementation? Or is there a 
completely different way of approaching this problem?

Thanks in advance,
Moritz

computing pairwise distance of vectors with missing (nan) values

Moritz Emanuel Beber

Matthias Bussonnier

Moritz Beber

Matthias Bussonnier

Moritz Beber

tags

participants (3)