[SciPy-User] Generalized least square on large dataset

josef.pktd at gmail.com josef.pktd at gmail.com
Fri Mar 9 15:13:37 EST 2012


On Fri, Mar 9, 2012 at 2:46 PM, Nathaniel Smith <njs at pobox.com> wrote:
> On Fri, Mar 9, 2012 at 7:30 PM, Peter Cimermančič
> <peter.cimermancic at gmail.com> wrote:
>> Sure, please see attached. Bacteria.jpg is the plot we're talking about. As
>> you can see there is a nice correlation in the graph, but I'm afraid there
>> might something like in the second figure (ives.jpg) going on. The second
>> figure is from Ives and Zhu; Statistics for correlated data: phylogenies,
>> space and time (2006).
>
> So in the figure from Ives and Zhu, the two variables do seem to be
> well-correlated across groups, but then within individual groups they
> aren't well-correlated. Is that what you're worried about -- that gene
> count and genome length might be correlated overall, but not within
> individual groups?
>
> Because GLS doesn't actually address that question. It lets you
> correct your p-values for the fact that similarity between bacteria
> means that you effectively have somewhat less data than it would
> otherwise appear, and thus your p-values should be larger than they
> would be in a naive analysis. But it'd still be a p-value on whether
> the two variables are correlated overall. (Which they obviously
> are...)

I don't think there would be any problem with p-values for the overall
positive relationship. I would be surprised when any statistical
methods wouldn't produce a large p-value for the slope.

Although there is a bit of bunching of points I don't see any big
clusters that would indicate that the linear relationship is
different. In terms of size of the slope I would guess a robust
estimator (statsmodels.RLM) would downweight the observations on the
high part of the graph, large count/length ratio, outliers of
shorties?

I think Sturla has a point in that both count and length are positive.
It doesn't look like it's relevant for length, but in the counts there
is a bunching just above zero, this creates either a non-linearity or
requires another distribution log-normal (?) or Poisson (without
zeros, or loc=1)?

Josef


>
> -- Nathaniel
> _______________________________________________
> SciPy-User mailing list
> SciPy-User at scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-user



More information about the SciPy-User mailing list