[SciPy-Dev] Proposal for a new function nanpdist that treats NaNs as missing values

Nathaniel Smith njs at pobox.com
Fri Aug 22 10:33:16 EDT 2014


On Thu, Aug 14, 2014 at 10:24 AM, Moritz Beber <moritz.beber at gmail.com> wrote:
> I work in bioinformatics and my data set consists of thousands of vectors
> corresponding to different treatment parameters. Each vector consists of
> basically the changes in expression levels of a number of genes. I am
> interested in clustering the treatments, i.e., determine which treatments
> introduce similar gene expression patterns. Not every treatment leads to
> significant expression changes, of course, which is why there are missing
> values. So the vectors have roughly 3000 elements and most of them have
> about 200 missing values.

Just as a scientific issue this seems very odd to me and not at all
what statisticians usually mean by missing data. Surely if you want to
determine "which treatments introduce similar gene expression
patterns" then two treatments that both produce no effect on the
expression of the same gene should be counted as more similar to each
other? If you've measured an expression change to be near 0 then
that's a known measured value that happens to be near 0 -- not an
unknown value that could be arbitrarily large or small and you have no
idea which. (Obviously I don't know any of the details about your
setting, but in particular I worry that your reasoning sounds similar
to common misconceptions about what "significant" actually means. "Not
significantly different from zero" might well be "significantly
different from 1000".)

-n

-- 
Nathaniel J. Smith
Postdoctoral researcher - Informatics - University of Edinburgh
http://vorpus.org



More information about the SciPy-Dev mailing list