[SciPy-User] Generalized least square on large dataset

Fri Mar 9 14:30:35 EST 2012

Sure, please see attached. Bacteria.jpg is the plot we're talking about. As
you can see there is a nice correlation in the graph, but I'm afraid there
might something like in the second figure (ives.jpg) going on. The second
figure is from Ives and Zhu; Statistics for correlated data: phylogenies,
space and time (2006).

Peter

On Fri, Mar 9, 2012 at 9:51 AM, <josef.pktd at gmail.com> wrote:

> On Fri, Mar 9, 2012 at 12:43 PM, Peter Cimermančič
> <peter.cimermancic at gmail.com> wrote:
> >>
> >>
> >> No, it does not. If you are working with counts, the appropriate model
> >> would usually be Poisson regression. I.e. Generalized linear model with
> >> log-link function and Possion probability family. I have seen many
> >> examples of microbiologists using linear regression when they should
> >> actually use Poisson regression (e.g. counting genes) or logistic
> >> regression (e.g. dose-response and titration curves).
> >>
> >> This will do it for you:
> >>
> >> MATLAB: glmfit from the statistics toolbox
> >> R: glm
> >> SAS: PROC GLIM
> >> Python: statmodels scikit
> >>
> >> Another example of inappropriate use of linear regression in
> >> microbiology is the Lineweaver-Burk plot as substitute for non-linear
> >> least-squares (usually Levenberg-Marquardt) to fit a Michelis-Menten
> >> curve. Some microbiologists are bevare of this, but they seem to prefer
> >> all sorts of ad hoc trickeries like linearizations and
> >> variance-stabilizing transforms instead of "just doing it right".
> >>
> >> As for samples that are not independent, that will affect the final
> >> likelihood. If you want to optimize the log-likelhood yourself, to
> >> control for this, getting ML estimates by maximizing the log-likelhood
> >> is easy with fmin_powell or fmin_bgfs from scipy.optimize. (Powell's
> >> method does not even need the gradient.) And if you need the "p-value",
> >> you can either use the likelihood ratio or Monte Carlo (e.g. permutation
> >> test).
> >>
> >
> > Sturla, could you be more specific here? I don't know much about
> > (bio)statistics, but that doesn't mean I don't want to do the things
> right
> > :). All I want to get out of this analysis is to be able to say whether
> the
> > correlation between genome lengths and numbers of particular genes (which
> > looks neat and obvious from the scatter plot) is statistically
> significant
> > given that the data points are heavily phylogenetically biased. That's
> why I
> > mentioned "p-values". Of course, I'm open to any better/more accurate
> way of
> > getting there than initially planned.
>
> Peter, Could you post a scatter plot of your data (with axis ticks and
> labels) so we get an idea what your data looks like?
>
> I have no idea at all about the bio topic.
>
> Josef
>
> >
> >
> >
> >
> >>
> >>
> >> Sturla
> >>
> >>
> >
> > _______________________________________________
> > SciPy-User mailing list
> > SciPy-User at scipy.org
> > http://mail.scipy.org/mailman/listinfo/scipy-user
> >
> _______________________________________________
> SciPy-User mailing list
> SciPy-User at scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-user
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.scipy.org/pipermail/scipy-user/attachments/20120309/9e156449/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ives.tiff
Type: image/tiff
Size: 55288 bytes
Desc: not available
URL: <http://mail.scipy.org/pipermail/scipy-user/attachments/20120309/9e156449/attachment.tiff>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: bacteria.jpg
Type: image/jpeg
Size: 37352 bytes
Desc: not available
URL: <http://mail.scipy.org/pipermail/scipy-user/attachments/20120309/9e156449/attachment.jpg>