[Numpy-discussion] invalid correlation coefficient from np.ma.corrcoef
njs at pobox.com
Thu Sep 26 13:21:42 EDT 2013
On 26 Sep 2013 17:32, <josef.pktd at gmail.com> wrote:
> On Thu, Sep 26, 2013 at 7:35 AM, Nathaniel Smith <njs at pobox.com> wrote:
> > By textbook I mean, users expect corrcoef to use this formula, which
> > is printed in every textbook:
> > The vast majority of people using correlations think that "sample
> > correlation" justs mean this number, not "some arbitrary finite-sample
> > estimator of the underlying population correlation". So the obvious
> > interpretation of pairwise correlations is that you apply that formula
> > to each set of pairwise complete observations.
> This textbook version **assumes** that we have the same observations
> for all/both variables, and doesn't say what to do if not.
> I'm usually mainly interested the covariance/correlation matrix for
> estimating some underlying population or model parameters or do
> hypothesis tests with them.
> I just wanted to point out that there is no "obvious" ("There should
> be one-- ...") way to define pairwise deletion correlation matrices.
Yeah, fair enough.
> But maybe just doing a loop [corrcoef(x, y) for x in data for y in
> data] still makes the most sense. Dunno
I'm not 100% sure what the best answer is either, but it seems we agree
that these are the only reasonable options:
(1) refuse to give correlations if there are missing values
(2) the pairwise version pandas/R do
(3) maybe something in between (like only including fully complete rows, or
giving an option to pick between these)
But the key thing here is that the current behaviour is definitely *wrong*
and misleading people, so we better do something about that. (And if no one
pops up to fix it maybe we should just remove the function entirely from
1.8, because numerically wrong answers are Serious Business?)
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the NumPy-Discussion