[Numpy-discussion] 2d binning and linear regression

Wed Jun 23 09:43:52 EDT 2010

On 06/22/2010 02:58 PM, josef.pktd at gmail.com wrote:
> On Tue, Jun 22, 2010 at 10:09 AM, Tom Durrant<thdurrant at gmail.com>  wrote:
>    
>>> the basic idea is in "polyfit  on multiple data points" on
>>> numpy-disscusion mailing list April 2009
>>>
>>> In this case, calculations have to be done by groups
>>>
>>> subtract mean (this needs to be replaced by group demeaning)
>>> modeldm = model - model.mean()
>>> obsdm = obs - obs.mean()
>>>
>>> xx = np.histogram2d(
>>> xx, xedges, yedges = np.histogram2d(lat, lon, weights=modeldm*modeldm,
>>>       bins=(latedges,lonedges))
>>> xy, xedges, yedges = np.histogram2d(lat, lon, weights=obsdm*obsdm,
>>>       bins=(latedges,lonedges))
>>>
>>>
>>> slopes = xy/xx  # slopes by group
>>>
>>> expand slopes to length of original array
>>> predicted = model - obs * slopes_expanded
>>> ...
>>>
>>> the main point is to get the group functions, for demeaning, ... for
>>> the 2d labels (and get the labels out of histogramdd)
>>>
>>> I'm out of time (off to the airport soon), but I can look into it next
>>> weekend.
>>>
>>> Josef
>>>
>>>        
>> Thanks Josef, I will chase up the April list...
>> If I understand what you have done above, this returns the slope of best fit
>> lines forced through the origin, is that right?
>>      
> Not if both variables, model and obs, are demeaned first, demeaning
> removes any effect of a constant and only the slope is left over,
> which can be done with the ration xx/xy.
>
> But to get independent intercept per group, the demeaning has to be by group.
>
> What's the size of your problem, how many groups or how many separate
> regressions ?
>
> demeaning by group has setup cost in this case, so the main speed
> benefit would come if you calculate the digitize and label generation,
> that histogram2d does, only ones and reuse it in later calculations.
>
> Using dummy variables as Bruce proposes works very well if there are
> not a very large number of groups, otherwise I think the memory
> requirements and size of array would be very costly in terms of
> performance.
>
> Josef
>
>    
There is always a tradeoff of memory vs speed. It is too easy to be too 
clever just to find that a more brute force approach is considerably 
faster. You want to avoid Python code and array indexing as much as 
possible since those can be major speed bumps. Also some approaches have 
hidden memory costs as well (histogram2d calls as histogramdd([x,y],...).

If memory is an issue, then obviously you need to decide how to handle 
it because there are many ways around it. For example, the biggest 
memory usage in my code should be related to the design matrix and 
creating the normal equations. At worst (which requires passing by value 
rather than reference) it would be two 2-d arrays with the number of 
rows being the number of observations and the number of columns being 
two times the number of groups. If you have scipy, then sparse matrices 
can reduce the memory footprint of the design matrix and could be 
faster. There are also ways to construct the design matrix and the 
normal equations either observation by observation (essential for very 
large data sets) or pieces (uses within group information of numbers of 
observations, sum of 'model' and sum of 'model*model'). If the your 
boxes have homogeneous variance (which is already being assumed), you 
can first divide the data into groups of boxes and then loop over each 
group reusing the existing arrays as needed.

Bruce