Thank you for your response Nathaniel.

I was a bit concerned that by going into the application this would turn into a discussion about the method rather than whether this is a desirable concept for scipy. I suppose it's not possible to fully separate the two issues so I will indulge you.

On Fri, Aug 22, 2014 at 4:33 PM, Nathaniel Smith <njs@pobox.com> wrote:

Just as a scientific issue this seems very odd to me and not at all
what statisticians usually mean by missing data. Surely if you want to
determine "which treatments introduce similar gene expression
patterns" then two treatments that both produce no effect on the
expression of the same gene should be counted as more similar to each
other? If you've measured an expression change to be near 0 then
that's a known measured value that happens to be near 0 -- not an
unknown value that could be arbitrarily large or small and you have no
idea which. (Obviously I don't know any of the details about your
setting, but in particular I worry that your reasoning sounds similar
to common misconceptions about what "significant" actually means. "Not
significantly different from zero" might well be "significantly
different from 1000".)

Since I didn't want the discussion to be about the method I tried to describe the situation briefly and did not give you the whole story. My apologies.

The real situation is the following: The gene expression data are mapped onto pathways using information on links between proteins and coding genes. The pathway definitions come from a multitude of source databases and were collected in a single database (http://consensuspathdb.org/). Only pathways that have five or more available scores are considered (this is somewhat arbitrary, I suppose). Each pathway is then assigned a mean score. Pathways that have too few scores are not considered. You can read up on more specifics in [1]. So I consider those pathways that did not make the cut-off of 5 scores as "missing values". If all the treatments had missing values at the same pathways, I'd be tempted to just throw those out. We are considering treatments from different studies, however, and the studies report gene expression changes for different genes and consequently different pathways end up having no scores. I still want to be able to compare treatments between different studies. One approach could be to rethink the scoring of pathways and introduce an uncertainty that is larger for pathways with missing scores but since I'm sitting at the end of a pipeline that lands the treatments and pathway response scores in my lap, my preferred way of dealing with this is to simply scale up the distance between treatments where one has a pathway score and it's missing for the other.

If this seems unreasonable to you, I'm all ears. It does make sense in my mind.

Cheers,
Moritz

[1] http://toxsci.oxfordjournals.org/content/124/2/278.full in particular in the subsection "pathway response analysis"