Answers to Warren's post:



That is, *any* distance involving a point that has a nan is nan.
This seems like a reasonable default behavior.

I agree, it is the way that most functions in other lanaguages/packages handle it.
 

What should nanpdist(y) be?

Based on your code snippet on StackOverflow and your comment in the github
issue, my understanding is this: for any pair, you ignore the coordinates
where either has a nan (i.e. compute the distance in a lower dimension).
In this case, pdist(y) would be

    [nan, 4, 6, 8, 2, 4, 6, 2.83, 5.66, 2.83]

(I'm not sure if you would put nan or something else in that first position.)

Or, if we use the scaling of `n/(n - p)` that you suggested in the github issue,
where n is the dimension of the observations and p is the number of "missing"
coordinates,

    [nan, 8, 12, 16, 4, 8, 12, 2.83, 5.66, 2.83]

Is that correct?

That is what I suggest. The appropriate scaling would have to be checked/discussed in detail as it may differ between distance and similarity measures.
 

What's the use-case for this behavior?  How widely used is it?
 
 
I work in bioinformatics and my data set consists of thousands of vectors corresponding to different treatment parameters. Each vector consists of basically the changes in expression levels of a number of genes. I am interested in clustering the treatments, i.e., determine which treatments introduce similar gene expression patterns. Not every treatment leads to significant expression changes, of course, which is why there are missing values. So the vectors have roughly 3000 elements and most of them have about 200 missing values.

The data are scaled to follow a normal distribution so I could just replace the missing values with the mean and be done with it but I don't think that's the correct approach. I also don't want the current pdist behavior as it would disregard the majority of my otherwise perfectly valid data.

As to the popularity of this use-case: Clustering of gene expression data is very wide-spread, however, usually all gene expression data are considered and thus every treatment consists of a completely filled vector. I can't claim that my current use-case is very popular, it's a slightly new approach.

If you think that this behavior has no place in scipy, no problem at all.

Best,
Moritz