[Numpy-discussion] lstsq functionality

Tue Jul 20 01:08:36 EDT 2010

On Mon, Jul 19, 2010 at 9:40 PM, Keith Goodman <kwgoodman at gmail.com> wrote:

> On Mon, Jul 19, 2010 at 8:27 PM, Charles R Harris
> <charlesr.harris at gmail.com> wrote:
> >
> >
> > On Mon, Jul 19, 2010 at 9:02 PM, Keith Goodman <kwgoodman at gmail.com>
> wrote:
> >>
> >> On Mon, Jul 19, 2010 at 6:53 PM, Joshua Holbrook
> >> <josh.holbrook at gmail.com> wrote:
> >> > On Mon, Jul 19, 2010 at 5:50 PM, Charles R Harris
> >> > <charlesr.harris at gmail.com> wrote:
> >> >> Hi All,
> >> >>
> >> >> I'm thinking about adding some functionality to lstsq because I find
> >> >> myself
> >> >> doing the same fixes over and over. List follows.
> >> >>
> >> >> Add weights so data points can be weighted.
> >> >> Use column scaling so condition numbers make more sense.
> >> >> Compute covariance approximation?
> >> >>
> >> >> Unfortunately, the last will require using svd since there no linear
> >> >> least
> >> >> squares routines in LAPACK that also compute the covariance, at least
> >> >> that
> >> >> google knows about.
> >> >>
> >> >> Thoughts?
> >> >
> >> > Maybe make 2 functions--one which implements 1 and 2, and another
> >> > which implements 3? I think weights is an excellent idea!
> >>
> >> I'd like a lstsq that did less, like not calculate the sum of squared
> >> residuals. That's useful in tight loops. So I also think having two
> >> lstsq makes sense. One as basic as possible; one with bells. How does
> >> scipy's lstsq fit into all this?
> >
> > I think the computation of the residues is cheap in lstsq. The algolrithm
> > used starts by reducing the design matrix to bidiagonal from and reduces
> the
> > rhs at the same time. In other words an mxn problem becomes a (n+1)xn
> > problem. That's why the summed square of residuals is available but not
> the
> > individual residuals, after the reduction there is only one residual and
> its
> > square it the residue.
>
> That does sound good. But it must take some time. There's indexing,
> array creation, if statement, summing:
>
>        if results['rank'] == n and m > n:
>            resids = sum((transpose(bstar)[n:,:])**2,
> axis=0).astype(result_t)
>
> Here are the timings after removing the sum of squared residuals:
>
> >> x = np.random.rand(1000,10)
> >> y = np.random.rand(1000)
> >> timeit np.linalg.lstsq(x, y)
> 1000 loops, best of 3: 369 us per loop
> >> timeit np.linalg.lstsq2(x, y)
> 1000 loops, best of 3: 344 us per loop
> >>
> >> x = np.random.rand(10,2)
> >> y = np.random.rand(10)
> >> timeit np.linalg.lstsq(x, y)
> 10000 loops, best of 3: 102 us per loop
> >> timeit np.linalg.lstsq2(x, y)
> 10000 loops, best of 3: 77 us per loop
> _
>

Now that I've looked through the driver program I see that you are right.
Also, all the info needed for the covariance is almost available in the
LAPACK driver program. Hmm, it seems that maybe the best thing to do here is
to pull out the lapack_lite driver program, modify it, and make it a
standalone python module that links to either the lapack_lite programs or
the ATLAS versions.  But that is more work than just doing things in python
:-\ It does have the added advantage that all the work arrays can be handled
internally.

Chuck
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20100719/ac475237/attachment.html>