On Wed, Aug 13, 2014 at 11:08 AM, Moritz Beber <moritz.beber@gmail.com> wrote:
Dear all,

As suggested in this github issue (https://github.com/scipy/scipy/issues/3870), I would like to discuss the merit of introducing a new function nanpdist into scipy.spatial. I have also brought up the problem in the following previous e-mail (http://comments.gmane.org/gmane.comp.python.scientific.devel/18956) and on SO (http://stackoverflow.com/questions/24781461/compute-the-pairwise-distance-in-scipy-with-missing-values).

Warren suggested three ways to tackle this problem:
  1. Don't change anything--the users should clean up their data!
  2. nanpdist
  3. Add a keyword argument to pdist that determines how nan should be treated.

Clearly, I don't favor the first option since I believe missing values can be important pieces of information, too. I slightly tend towards option two because adding a keyword will further complicate an already very long pdist function.

I'm happy to submit a pull request if there is a consensus that something should be done.

Best,

Moritz


There are two parts to this:

(1)  What is the new calculation for handling nan's?
(2)  What is the API for accessing the new calculation?

Before getting into the API (i.e. nanpdist vs. keyword vs. whatever),
I'd like better understand (1).

Here's a normal use of pdist (no nans):


In [158]: set_printoptions(precision=2)

In [159]: x = np.arange(1., 11).reshape(-1,2)

In [160]: x
Out[160]:
array([[  1.,   2.],
       [  3.,   4.],
       [  5.,   6.],
       [  7.,   8.],
       [  9.,  10.]])

In [161]: pdist(x)
Out[161]:
array([  2.83,   5.66,   8.49,  11.31,   2.83,   5.66,   8.49,   2.83,
         5.66,   2.83])



And here's how pdist currently handles nans:

In [162]: y = x.copy()

In [163]: y[0,1] = nan

In [164]: y[1,0] = nan

In [165]: y
Out[165]:
array([[  1.,  nan],
       [ nan,   4.],
       [  5.,   6.],
       [  7.,   8.],
       [  9.,  10.]])

In [166]: pdist(y)
Out[166]: array([  nan,   nan,   nan,   nan,   nan,   nan,   nan,  2.83,  5.66,  2.83])



That is, *any* distance involving a point that has a nan is nan.
This seems like a reasonable default behavior.

What should nanpdist(y) be?

Based on your code snippet on StackOverflow and your comment in the github
issue, my understanding is this: for any pair, you ignore the coordinates
where either has a nan (i.e. compute the distance in a lower dimension).
In this case, pdist(y) would be

    [nan, 4, 6, 8, 2, 4, 6, 2.83, 5.66, 2.83]

(I'm not sure if you would put nan or something else in that first position.)

Or, if we use the scaling of `n/(n - p)` that you suggested in the github issue,
where n is the dimension of the observations and p is the number of "missing"
coordinates,

    [nan, 8, 12, 16, 4, 8, 12, 2.83, 5.66, 2.83]

Is that correct?

What's the use-case for this behavior?  How widely used is it?


Warren



_______________________________________________
SciPy-Dev mailing list
SciPy-Dev@scipy.org
http://mail.scipy.org/mailman/listinfo/scipy-dev