[SciPy-User] Generalized least square on large dataset

Thu Mar 8 11:32:42 EST 2012

On Thu, Mar 8, 2012 at 9:09 AM, Peter Cimermančič <
peter.cimermancic at gmail.com> wrote:

>
>>
>> I agree with Josef, the first thing that comes to mind is controlling
>> for spatial effects (which happens in various fields; ecology folks
>> worry about this a lot too).
>>
>> In this case, though, I think you may need to think more carefully
>> about whether your similarity measure is really appropriate. If your
>> matrix is uninvertible, then IIUC that means you think that you
>> effectively have less than 1000 distinct genomes -- some of your
>> genomes are "so similar" to other ones that they can be predicted
>> *exactly*.
>>
>
>
> That's exactly true - some of the bacteria are almost identical (I'll try
> filtering those and see if it changes anything).
>
>
>
>
>>
>> In terms of the underlying probabilistic model: you have some
>> population of bacteria genomes, and you picked 1000 of them to study.
>> Each genome you picked has some length, and it also has a number of
>> genes. The number of genes is determined probabilistically by taking
>> some linear function of the length, and then adding some Gaussian
>> noise. Your goal is to figure out what that linear function looks
>> like.
>>
>> In OLS, we assume that each of those Gaussian noise terms is IID. In
>> GLS, we assume that they're correlated. The way to think about this is
>> that we take 1000 uncorrelated IID gaussian samples, let's call this
>> vector "g", and then we mix them together by multiplying by a matrix
>> chol(V), chol(V)*g. (cholV) is the cholesky decomposition; it's
>> triangular, and chol(V)*chol(V)' = V.) So the noise added to each
>> measurement is a mixture of these underlying IID gaussian terms, and
>> bacteria that are more similar have noise terms that overlap more.
>>
>>
> I'm also unable to calculate chol of my V matrix, because it doesn't
> appear to be a positive definite. Any suggestion here?
>
>
>> If V is singular, then this means that the last k rows of chol(V) are
>> all-zero, which means that when you compute chol(V)*g, there are some
>> elements of g that get ignored entirely -- they don't mixed in at all,
>> and don't effect any bacteria. So, your choice of V is encoding an
>> assumption that there are really only, like, 900 noise terms and 1000
>> bacteria, and 'g' effectively only has 900 entries. So if you make the
>> measurements for the first 900 bacteria, you should be able to
>> reconstruct the full vector 'g', and then you can use it to compute
>> *exactly* what measurements you will see for the last 100 bacteria.
>> And also you can compute the linear relationship exactly. No need to
>> do any hypothesis tests on the result (and in fact you can't do any
>> hypothesis tests on the result, the math won't work), because you know
>> The Truth!
>>
>> Of course none of these assumptions are actually true. Your bacteria
>> are less similar to each other -- and your measurements more noisy --
>> than your V matrix claims.
>>
>> So you need a better way to compute V. The nice thing about the above
>> derivation -- and the reason I bothered to go through it -- is that it
>> tells you what entries in V mean, numerically. Ideally you should
>> figure out how to re-calibrate your similarity score so that bacteria
>> which are 0.5 similar have a covariance of 0.5 in their noise --
>> perhaps by calculating empirical covariances on other measures or
>> something, figuring out the best way to do this calibration will take
>> domain knowledge. The ecology folks might have some practical ideas on
>> how to calibrate such things.
>>
>> Or, you could just replace V with V+\lambda*I. That'd solve the
>> numerical problem, but you should be very suspicious of any p-values
>> you get out, since they are based on lies ;-).
>>
>
> Yes, p-values are something I'd eventually like to come close to.
>
>
I think Josef's suggestion would be a good place to start. The problem
seems to be that the similarity matrix isn't a correlation matrix. What
Josef suggests would give you some way of analysing what the actual
correlations are.

Chuck
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.scipy.org/pipermail/scipy-user/attachments/20120308/9e70ff33/attachment.html>