On Tue, Jul 20, 2010 at 8:32 AM, Skipper Seabold <jsseabold@gmail.com>wrote:
On Tue, Jul 20, 2010 at 1:08 AM, Charles R Harris <charlesr.harris@gmail.com> wrote:
On Mon, Jul 19, 2010 at 9:40 PM, Keith Goodman <kwgoodman@gmail.com>
On Mon, Jul 19, 2010 at 8:27 PM, Charles R Harris <charlesr.harris@gmail.com> wrote:
On Mon, Jul 19, 2010 at 9:02 PM, Keith Goodman <kwgoodman@gmail.com> wrote:
On Mon, Jul 19, 2010 at 6:53 PM, Joshua Holbrook <josh.holbrook@gmail.com> wrote:
On Mon, Jul 19, 2010 at 5:50 PM, Charles R Harris <charlesr.harris@gmail.com> wrote: > Hi All, > > I'm thinking about adding some functionality to lstsq because I
find
> myself > doing the same fixes over and over. List follows. > > Add weights so data points can be weighted. > Use column scaling so condition numbers make more sense. > Compute covariance approximation? > > Unfortunately, the last will require using svd since there no
> least > squares routines in LAPACK that also compute the covariance, at > least > that > google knows about. > > Thoughts?
Maybe make 2 functions--one which implements 1 and 2, and another which implements 3? I think weights is an excellent idea!
I'd like a lstsq that did less, like not calculate the sum of squared residuals. That's useful in tight loops. So I also think having two lstsq makes sense. One as basic as possible; one with bells. How does scipy's lstsq fit into all this?
I think the computation of the residues is cheap in lstsq. The algolrithm used starts by reducing the design matrix to bidiagonal from and reduces the rhs at the same time. In other words an mxn problem becomes a (n+1)xn problem. That's why the summed square of residuals is available but not the individual residuals, after the reduction there is only one residual and its square it the residue.
That does sound good. But it must take some time. There's indexing, array creation, if statement, summing:
if results['rank'] == n and m > n: resids = sum((transpose(bstar)[n:,:])**2, axis=0).astype(result_t)
Here are the timings after removing the sum of squared residuals:
x = np.random.rand(1000,10) y = np.random.rand(1000) timeit np.linalg.lstsq(x, y) 1000 loops, best of 3: 369 us per loop timeit np.linalg.lstsq2(x, y) 1000 loops, best of 3: 344 us per loop
x = np.random.rand(10,2) y = np.random.rand(10) timeit np.linalg.lstsq(x, y) 10000 loops, best of 3: 102 us per loop timeit np.linalg.lstsq2(x, y) 10000 loops, best of 3: 77 us per loop _
Now that I've looked through the driver program I see that you are right. Also, all the info needed for the covariance is almost available in the LAPACK driver program. Hmm, it seems that maybe the best thing to do here is to pull out the lapack_lite driver program, modify it, and make it a standalone python module that links to either the lapack_lite programs or the ATLAS versions. But that is more work than just doing things in
wrote: linear python
:-\ It does have the added advantage that all the work arrays can be handled internally.
IIUC, I tried this with statsmodels earlier this summer, but found the result to be slower than our current implementation (and surely slower than lstsq since we have a bit more Python overhead around the LAPACK function calls). I will have to dig through the revisions and make sure about this.
You modified the C version of dgelsd in lapack_lite? That is the driver I'm talking about.
But for the record, we have weighted least squares (WLS) and you can get the covariance matrix. I think you can do column-scaling with our GLS (generalized least squares class). I haven't seen the terms row-scaling and column-scaling used much in econometrics books do you have a good reference offhand?
The dgelsd driver scales the matrix, but only enough to keep the numbers in the working range. Column scaling affects the singular values, so if you are using them to track the condition number of the matrix it can effect the outcome. The deeper thing that is going on is that some of the usefulness of the svd is based on the assumption that the euclidean distance between parameter vectors means something. I just tend to normalize the columns to unit length. It is a bit like using z-scores in statistics except the column norm is normalized instead of the column variance. I'm sure there are references out there but a quick google doesn't show anything obvious. Normalization seems to be a bit neglected in the usual classroom notes. Chuck