<div dir="ltr">Hi everyone,<div><br></div><div style>I'm using np.ma.corrcoef to compute the correlation coefficients among rows of a masked matrix, where the masked elements are the missing data. I've observed that in some cases, the np.ma.corrcoef gives invalid coefficients that are greater than 1 or less than -1.</div>
<div style><br></div><div style>Here's an example:</div><div style><br></div><div style>x = array([[ 7, -4, -1, -7, -3, -2],<br></div><div style><div> [ 6, -3, 0, 4, 0, 5],</div><div> [-4, -5, 7, 5, -7, -7],</div>
<div> [-5, 5, -8, 0, 1, 4]])</div><div><br></div><div>x_ma = np.ma.masked_less_equal(x , -5)<br></div><div><br></div><div><div><div>C = np.round(np.ma.corrcoef(x_ma), 2)</div><div><br></div><div>print C</div><div>
<br></div><div>[[1.0 0.73 -- -1.68]</div><div> [0.73 1.0 -0.86 -0.38]</div><div> [-- -0.86 1.0 --]</div><div> [-1.68 -0.38 -- 1.0]]</div></div></div><div><br></div><div style>As you can see, the [0,3] element is -1.68 which is not a valid correlation coefficient. (Valid correlation coefficients should be between -1 and 1).</div>
<div style><br></div><div style>I looked at the code for np.ma.corrcoef, and this behavior seems to be due to the way that mean values of the rows of the input matrix are computed and subtracted from them. Apparently, the mean value is individually computed for each row, without masking the elements corresponding to the masked elements of the other row of the matrix, with respect to which the correlation coefficient is being computed. </div>
<div style><br></div><div style>I guess the right way should be to recompute the mean value for each row every time that a correlation coefficient is being computed for two rows after propagating pair-wise masked values.</div>
<div style><br></div><div style>Please let me know what you think.</div><div style><br></div><div style>Thanks,</div><div style><br></div><div style>Faraz</div><div style><br></div><div style><br></div></div></div>