[Numpy-discussion] lstsq functionality

Tue Jul 20 22:16:59 EDT 2010

On Tue, Jul 20, 2010 at 10:12 PM, Keith Goodman <kwgoodman at gmail.com> wrote:
> On Tue, Jul 20, 2010 at 6:35 PM, Keith Goodman <kwgoodman at gmail.com> wrote:
>> On Mon, Jul 19, 2010 at 10:08 PM, Charles R Harris
>> <charlesr.harris at gmail.com> wrote:
>>>
>>>
>>> On Mon, Jul 19, 2010 at 9:40 PM, Keith Goodman <kwgoodman at gmail.com> wrote:
>>>>
>>>> On Mon, Jul 19, 2010 at 8:27 PM, Charles R Harris
>>>> <charlesr.harris at gmail.com> wrote:
>>>> >
>>>> >
>>>> > On Mon, Jul 19, 2010 at 9:02 PM, Keith Goodman <kwgoodman at gmail.com>
>>>> > wrote:
>>>> >>
>>>> >> On Mon, Jul 19, 2010 at 6:53 PM, Joshua Holbrook
>>>> >> <josh.holbrook at gmail.com> wrote:
>>>> >> > On Mon, Jul 19, 2010 at 5:50 PM, Charles R Harris
>>>> >> > <charlesr.harris at gmail.com> wrote:
>>>> >> >> Hi All,
>>>> >> >>
>>>> >> >> I'm thinking about adding some functionality to lstsq because I find
>>>> >> >> myself
>>>> >> >> doing the same fixes over and over. List follows.
>>>> >> >>
>>>> >> >> Add weights so data points can be weighted.
>>>> >> >> Use column scaling so condition numbers make more sense.
>>>> >> >> Compute covariance approximation?
>>>> >> >>
>>>> >> >> Unfortunately, the last will require using svd since there no linear
>>>> >> >> least
>>>> >> >> squares routines in LAPACK that also compute the covariance, at
>>>> >> >> least
>>>> >> >> that
>>>> >> >> google knows about.
>>>> >> >>
>>>> >> >> Thoughts?
>>>> >> >
>>>> >> > Maybe make 2 functions--one which implements 1 and 2, and another
>>>> >> > which implements 3? I think weights is an excellent idea!
>>>> >>
>>>> >> I'd like a lstsq that did less, like not calculate the sum of squared
>>>> >> residuals. That's useful in tight loops. So I also think having two
>>>> >> lstsq makes sense. One as basic as possible; one with bells. How does
>>>> >> scipy's lstsq fit into all this?
>>>> >
>>>> > I think the computation of the residues is cheap in lstsq. The
>>>> > algolrithm
>>>> > used starts by reducing the design matrix to bidiagonal from and reduces
>>>> > the
>>>> > rhs at the same time. In other words an mxn problem becomes a (n+1)xn
>>>> > problem. That's why the summed square of residuals is available but not
>>>> > the
>>>> > individual residuals, after the reduction there is only one residual and
>>>> > its
>>>> > square it the residue.
>>>>
>>>> That does sound good. But it must take some time. There's indexing,
>>>> array creation, if statement, summing:
>>>>
>>>>        if results['rank'] == n and m > n:
>>>>            resids = sum((transpose(bstar)[n:,:])**2,
>>>> axis=0).astype(result_t)
>>>>
>>>> Here are the timings after removing the sum of squared residuals:
>>>>
>>>> >> x = np.random.rand(1000,10)
>>>> >> y = np.random.rand(1000)
>>>> >> timeit np.linalg.lstsq(x, y)
>>>> 1000 loops, best of 3: 369 us per loop
>>>> >> timeit np.linalg.lstsq2(x, y)
>>>> 1000 loops, best of 3: 344 us per loop
>>>> >>
>>>> >> x = np.random.rand(10,2)
>>>> >> y = np.random.rand(10)
>>>> >> timeit np.linalg.lstsq(x, y)
>>>> 10000 loops, best of 3: 102 us per loop
>>>> >> timeit np.linalg.lstsq2(x, y)
>>>> 10000 loops, best of 3: 77 us per loop
>>>> _
>>>
>>> Now that I've looked through the driver program I see that you are right.
>>> Also, all the info needed for the covariance is almost available in the
>>> LAPACK driver program. Hmm, it seems that maybe the best thing to do here is
>>> to pull out the lapack_lite driver program, modify it, and make it a
>>> standalone python module that links to either the lapack_lite programs or
>>> the ATLAS versions.  But that is more work than just doing things in python
>>> :-\ It does have the added advantage that all the work arrays can be handled
>>> internally.
>>
>> While taking a quick look at lstsq to see what I could cut, this line
>> caught my eye:
>>
>> bstar[:b.shape[0],:n_rhs] = b.copy()
>
> Even if bstar contained a view of b, the next line in lstsq would take
> care of it:
>
> a, bstar = _fastCopyAndTranspose(t, a, bstar)
>

If a view is never taken, then what does the copy method actually do?
A redundant copy?  The b matrix to *gelsd is in/out, so if the
original b is not going to be written to then taking out the redundant
copy seems like a good idea.

Out of curiosity, is there an explicit way to check if these share memory?

import numpy as np
b = np.empty((1000,2))
c = np.arange(1000)[:,None]
b[:,1:] = c

Then what?

Skipper

>> I thought that array.__setitem__ never gives a view of the right hand
>> side. If that's so then the copy is not needed. That would save some
>> time since b can be large. All unit test pass when I remove the copy,
>> but you know how that goes...
>>
>> I also noticed that "import math" is done inside the lstsq function.
>> Why is that?
>>
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>