[SciPy-User] Generalized least square on large dataset

Thu Mar 8 10:04:05 EST 2012

On Thu, Mar 8, 2012 at 4:25 AM, Peter Cimermančič
<peter.cimermancic at gmail.com> wrote:
> To describe my problem into more details, I have a list of ~1000 bacterial
> genome lengths and number of certain genes for each one of them. I'd like to
> see if there is any correlation between genome lengths and number of the
> genes. It may look like an easy linear regression problem; however, one has
> to be a bit more careful as the measurements aren't sampled independently.
> Bacteria, whose genomes are similar, tend to also contain similar number of
> the genes. Bacterial similarity is what is described with matrix V - it
> contains similarity values for each pair of bacteria, ranging from 0 to 1.
>
> Anybody encountered similar problem already?

I agree with Josef, the first thing that comes to mind is controlling
for spatial effects (which happens in various fields; ecology folks
worry about this a lot too).

In this case, though, I think you may need to think more carefully
about whether your similarity measure is really appropriate. If your
matrix is uninvertible, then IIUC that means you think that you
effectively have less than 1000 distinct genomes -- some of your
genomes are "so similar" to other ones that they can be predicted
*exactly*.

In terms of the underlying probabilistic model: you have some
population of bacteria genomes, and you picked 1000 of them to study.
Each genome you picked has some length, and it also has a number of
genes. The number of genes is determined probabilistically by taking
some linear function of the length, and then adding some Gaussian
noise. Your goal is to figure out what that linear function looks
like.

In OLS, we assume that each of those Gaussian noise terms is IID. In
GLS, we assume that they're correlated. The way to think about this is
that we take 1000 uncorrelated IID gaussian samples, let's call this
vector "g", and then we mix them together by multiplying by a matrix
chol(V), chol(V)*g. (cholV) is the cholesky decomposition; it's
triangular, and chol(V)*chol(V)' = V.) So the noise added to each
measurement is a mixture of these underlying IID gaussian terms, and
bacteria that are more similar have noise terms that overlap more.

If V is singular, then this means that the last k rows of chol(V) are
all-zero, which means that when you compute chol(V)*g, there are some
elements of g that get ignored entirely -- they don't mixed in at all,
and don't effect any bacteria. So, your choice of V is encoding an
assumption that there are really only, like, 900 noise terms and 1000
bacteria, and 'g' effectively only has 900 entries. So if you make the
measurements for the first 900 bacteria, you should be able to
reconstruct the full vector 'g', and then you can use it to compute
*exactly* what measurements you will see for the last 100 bacteria.
And also you can compute the linear relationship exactly. No need to
do any hypothesis tests on the result (and in fact you can't do any
hypothesis tests on the result, the math won't work), because you know
The Truth!

Of course none of these assumptions are actually true. Your bacteria
are less similar to each other -- and your measurements more noisy --
than your V matrix claims.

So you need a better way to compute V. The nice thing about the above
derivation -- and the reason I bothered to go through it -- is that it
tells you what entries in V mean, numerically. Ideally you should
figure out how to re-calibrate your similarity score so that bacteria
which are 0.5 similar have a covariance of 0.5 in their noise --
perhaps by calculating empirical covariances on other measures or
something, figuring out the best way to do this calibration will take
domain knowledge. The ecology folks might have some practical ideas on
how to calibrate such things.

Or, you could just replace V with V+\lambda*I. That'd solve the
numerical problem, but you should be very suspicious of any p-values
you get out, since they are based on lies ;-).

-- Nathaniel