Re: [Numpy-discussion] lstsq functionality

July 20, 2010


      On Tue, Jul 20, 2010 at 8:32 AM, Skipper Seabold <jsseabold@gmail.com>wrote:
...
On Tue, Jul 20, 2010 at 1:08 AM, Charles R Harris
<charlesr.harris@gmail.com> wrote:
...
On Mon, Jul 19, 2010 at 9:40 PM, Keith Goodman <kwgoodman@gmail.com>
...
...
On Mon, Jul 19, 2010 at 8:27 PM, Charles R Harris
<charlesr.harris@gmail.com> wrote:
...
On Mon, Jul 19, 2010 at 9:02 PM, Keith Goodman <kwgoodman@gmail.com>
wrote:
...
On Mon, Jul 19, 2010 at 6:53 PM, Joshua Holbrook
<josh.holbrook@gmail.com> wrote:
...
On Mon, Jul 19, 2010 at 5:50 PM, Charles R Harris
<charlesr.harris@gmail.com> wrote:
> Hi All,
>
> I'm thinking about adding some functionality to lstsq because I
find
...
...
...
...
> myself
> doing the same fixes over and over. List follows.
>
> Add weights so data points can be weighted.
> Use column scaling so condition numbers make more sense.
> Compute covariance approximation?
>
> Unfortunately, the last will require using svd since there no
...
...
...
...
...
> least
> squares routines in LAPACK that also compute the covariance, at
> least
> that
> google knows about.
>
> Thoughts?
Maybe make 2 functions--one which implements 1 and 2, and another
which implements 3? I think weights is an excellent idea!
I'd like a lstsq that did less, like not calculate the sum of squared
residuals. That's useful in tight loops. So I also think having two
lstsq makes sense. One as basic as possible; one with bells. How does
scipy's lstsq fit into all this?
I think the computation of the residues is cheap in lstsq. The
algolrithm
used starts by reducing the design matrix to bidiagonal from and
reduces
the
rhs at the same time. In other words an mxn problem becomes a (n+1)xn
problem. That's why the summed square of residuals is available but
not
the
individual residuals, after the reduction there is only one residual
and
its
square it the residue.
That does sound good. But it must take some time. There's indexing,
array creation, if statement, summing:
if results['rank'] == n and m > n:
           resids = sum((transpose(bstar)[n:,:])**2,
axis=0).astype(result_t)
Here are the timings after removing the sum of squared residuals:
...
...
x = np.random.rand(1000,10)
y = np.random.rand(1000)
timeit np.linalg.lstsq(x, y)
1000 loops, best of 3: 369 us per loop
timeit np.linalg.lstsq2(x, y)
1000 loops, best of 3: 344 us per loop
x = np.random.rand(10,2)
y = np.random.rand(10)
timeit np.linalg.lstsq(x, y)
10000 loops, best of 3: 102 us per loop
timeit np.linalg.lstsq2(x, y)
10000 loops, best of 3: 77 us per loop
_
Now that I've looked through the driver program I see that you are right.
Also, all the info needed for the covariance is almost available in the
LAPACK driver program. Hmm, it seems that maybe the best thing to do here
is
to pull out the lapack_lite driver program, modify it, and make it a
standalone python module that links to either the lapack_lite programs or
the ATLAS versions.  But that is more work than just doing things in
wrote:
linear
python
...
:-\ It does have the added advantage that all the work arrays can be
handled
internally.
IIUC, I tried this with statsmodels earlier this summer, but found the
result to be slower than our current implementation (and surely slower
than lstsq since we have a bit more Python overhead around the LAPACK
function calls).  I will have to dig through the revisions and make
sure about this.
You modified the C version of dgelsd in lapack_lite? That is the driver I'm
talking about.
...
But for the record, we have weighted least squares (WLS) and you can
get the covariance matrix.  I think you can do column-scaling with our
GLS (generalized least squares class).  I haven't seen the terms
row-scaling and column-scaling used much in econometrics books do you
have a good reference offhand?
The dgelsd driver scales the matrix, but only enough to keep the numbers in
the working range. Column scaling affects the singular values, so if you are
using them to track the condition number of the matrix it can effect the
outcome. The deeper thing that is going on is that some of the usefulness of
the svd is based on the assumption that the euclidean distance between
parameter vectors means something. I just tend to normalize the columns to
unit length. It is a bit like using z-scores in statistics except the column
norm is normalized instead of the column variance. I'm sure there are
references out there but a quick google doesn't show anything obvious.
Normalization seems to be a bit neglected in the usual classroom notes.

Chuck