Dear all,

My basic problem is that I would like to compute distances between vectors with missing values. You can find more detail in my question on SO (http://stackoverflow.com/questions/24781461/compute-the-pairwise-distance-in-scipy-with-missing-values). Since it seems this is not directly possible with scipy at the moment, I started to Cythonize my function. Currently, the below function is not much faster than my pure Python implementation, so I thought I'd ask the experts here. Note that even though I'm computing the euclidean distance, I'd like to make use of different distance metrics.

So my current attempt at Cythonizing is:

import numpy
cimport numpy
cimport cython
from numpy.linalg import norm

numpy.import_array()

@cython.boundscheck(False)
@cython.wraparound(False)
def masked_euclidean(numpy.ndarray[numpy.double_t, ndim=2] data):
cdef Py_ssize_t m = data.shape[0]
cdef Py_ssize_t i = 0
cdef Py_ssize_t j = 0
cdef Py_ssize_t k = 0
cdef numpy.ndarray[numpy.double_t] dm = numpy.zeros(m * (m - 1) // 2, dtype=numpy.double)
cdef numpy.ndarray[numpy.uint8_t, ndim=2, cast=True] mask = numpy.isfinite(data) # boolean
for i in range(m - 1):
for j in range(i + 1, m):
curr = numpy.logical_and(mask[i], mask[j])
u = data[i][curr]
v = data[j][curr]
dm[k] = norm(u - v)
k += 1
return dm

Maybe the lack of speed-up is due to the Python function 'norm'? So my question is, how to improve the Cython implementation? Or is there a completely different way of approaching this problem?

Thanks in advance,

Moritz