[Numpy-discussion] 2d binning and linear regression
Bruce Southey
bsouthey at gmail.com
Wed Jun 23 09:43:52 EDT 2010
On 06/22/2010 02:58 PM, josef.pktd at gmail.com wrote:
> On Tue, Jun 22, 2010 at 10:09 AM, Tom Durrant<thdurrant at gmail.com> wrote:
>
>>> the basic idea is in "polyfit on multiple data points" on
>>> numpy-disscusion mailing list April 2009
>>>
>>> In this case, calculations have to be done by groups
>>>
>>> subtract mean (this needs to be replaced by group demeaning)
>>> modeldm = model - model.mean()
>>> obsdm = obs - obs.mean()
>>>
>>> xx = np.histogram2d(
>>> xx, xedges, yedges = np.histogram2d(lat, lon, weights=modeldm*modeldm,
>>> bins=(latedges,lonedges))
>>> xy, xedges, yedges = np.histogram2d(lat, lon, weights=obsdm*obsdm,
>>> bins=(latedges,lonedges))
>>>
>>>
>>> slopes = xy/xx # slopes by group
>>>
>>> expand slopes to length of original array
>>> predicted = model - obs * slopes_expanded
>>> ...
>>>
>>> the main point is to get the group functions, for demeaning, ... for
>>> the 2d labels (and get the labels out of histogramdd)
>>>
>>> I'm out of time (off to the airport soon), but I can look into it next
>>> weekend.
>>>
>>> Josef
>>>
>>>
>> Thanks Josef, I will chase up the April list...
>> If I understand what you have done above, this returns the slope of best fit
>> lines forced through the origin, is that right?
>>
> Not if both variables, model and obs, are demeaned first, demeaning
> removes any effect of a constant and only the slope is left over,
> which can be done with the ration xx/xy.
>
> But to get independent intercept per group, the demeaning has to be by group.
>
> What's the size of your problem, how many groups or how many separate
> regressions ?
>
> demeaning by group has setup cost in this case, so the main speed
> benefit would come if you calculate the digitize and label generation,
> that histogram2d does, only ones and reuse it in later calculations.
>
> Using dummy variables as Bruce proposes works very well if there are
> not a very large number of groups, otherwise I think the memory
> requirements and size of array would be very costly in terms of
> performance.
>
> Josef
>
>
There is always a tradeoff of memory vs speed. It is too easy to be too
clever just to find that a more brute force approach is considerably
faster. You want to avoid Python code and array indexing as much as
possible since those can be major speed bumps. Also some approaches have
hidden memory costs as well (histogram2d calls as histogramdd([x,y],...).
If memory is an issue, then obviously you need to decide how to handle
it because there are many ways around it. For example, the biggest
memory usage in my code should be related to the design matrix and
creating the normal equations. At worst (which requires passing by value
rather than reference) it would be two 2-d arrays with the number of
rows being the number of observations and the number of columns being
two times the number of groups. If you have scipy, then sparse matrices
can reduce the memory footprint of the design matrix and could be
faster. There are also ways to construct the design matrix and the
normal equations either observation by observation (essential for very
large data sets) or pieces (uses within group information of numbers of
observations, sum of 'model' and sum of 'model*model'). If the your
boxes have homogeneous variance (which is already being assumed), you
can first divide the data into groups of boxes and then loop over each
group reusing the existing arrays as needed.
Bruce
More information about the NumPy-Discussion
mailing list