On Wed, Aug 13, 2014 at 8:08 AM, Moritz Beber <moritz.beber@gmail.com> wrote:
Dear all,

As suggested in this github issue (https://github.com/scipy/scipy/issues/3870), I would like to discuss the merit of introducing a new function nanpdist into scipy.spatial. I have also brought up the problem in the following previous e-mail (http://comments.gmane.org/gmane.comp.python.scientific.devel/18956) and on SO (http://stackoverflow.com/questions/24781461/compute-the-pairwise-distance-in-scipy-with-missing-values).

Warren suggested three ways to tackle this problem:
  1. Don't change anything--the users should clean up their data!
  2. nanpdist
  3. Add a keyword argument to pdist that determines how nan should be treated.
Warren has already pointed this out, but let me insist: what is nanpdist, or the nan keyword expected to do? Treat pairs of vectors with NaNs as lower dimensional, removing pairs of entries where either is NaN? Do those results make any real sense? Thinking of euclidean distance for points in 3D space, I have trouble thinking of a practical situation where "if any Z coordinate is missing, just give me the distance of the projections onto the XY plane" would be anything but a misleading result. I presume the case is different for all those other distances I have never needed to use, so I am just curious of the use case.

Looking at your linked post, from an implementation point of view, at the low level function that is actually going to do the heavy lifting, it is probable better to, rather than hardcode a check for NaN-ness, take a 'where' kwarg, as numpy ufuncs already do (http://docs.scipy.org/doc/numpy/reference/ufuncs.html#optional-keyword-arguments), and build the masking array in a higher level wrapper. This would make it easier to eventually make this functionality work with masked arrays or the like.

As a separate but related issue, I have had this PR open for almost a year now, https://github.com/scipy/scipy/pull/3163, and although me saying I want to complete it is getting old, hopefully whatever you have in mind can fit with the general structure of that.

Lastly, whatever you go for, I don't think you should do anything to pdist that you don't also do for cdist and the individual distance functions.

Jaime

--
(\__/)
( O.o)
( > <) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus planes de dominación mundial.