Re: [SciPy-Dev] computing pairwise distance of vectors with missing (nan) values

23 Jul 2014

      Hello again,

(I somehow lost the ability to reply to your message Matthias, since I got
mails in digest mode, apologies for that.)

So I've poked around the code with profilers and got two conflicting pieces
of information. I now use a .pyx file (see attachment) and I profiled it in
two different ways:

1. Using cProfile which gave the following results:

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    1.641    1.641    2.303    2.303
distance.pyx:13(masked_euclidean)
    44850    0.294    0.000    0.662    0.000 linalg.py:1924(norm)
    44850    0.292    0.000    0.292    0.000 {method 'reduce' of
'numpy.ufunc' objects}
    44850    0.041    0.000    0.041    0.000 {numpy.core.multiarray.array}
    44850    0.023    0.000    0.065    0.000 numeric.py:392(asarray)
    44850    0.012    0.000    0.012    0.000 {method 'conj' of
'numpy.ndarray' objects}
        1    0.000    0.000    2.303    2.303 <string>:1(<module>)
        1    0.000    0.000    0.000    0.000 {method 'disable' of
'_lsprof.Profiler' objects}

This leads me to believe that, yes, the boolean subset is taking a
significant amount of time but the majority is spent in `norm`.

2. I also ran valgrind and that seems to suggest that 55% of the time is
spent in the boolean subset (you can get it here:
https://dl.dropboxusercontent.com/u/51564502/callgrind.log). Or am I
reading the results wrong?

3. I couldn't get the %lprof magic to work in the IPyNB, just get 0 time
for the whole function call. Is this possible somehow by now?

So my questions at this point are: Can I improve the fancy indexing
somehow? And can I include the scipy distance measures easily so that I
avoid the call to numpy.linalg.norm?

Thank you so much,
Moritz

Re: [SciPy-Dev] computing pairwise distance of vectors with missing (nan) values

Moritz Beber