[Numpy-discussion] 2d binning and linear regression

Bruce Southey bsouthey at gmail.com
Mon Jun 21 10:38:08 EDT 2010


On 06/20/2010 03:24 AM, Tom Durrant wrote:
> Hi All,
>
> I have a problem involving lat/lon data.  Basically, I am evaluating 
> numerical weather model data against satellite data, and trying to 
> produce gridded plots of various statistics.  There are various steps 
> involved with this, but basically, I get to the point where I have 
> four arrays of the same length, containing lat, lon, model and 
> observed values respectively along the path of the sattelite.
>
> eg:
>
> lat       =  [   50.32   50.78    51.24    51.82   52.55    53.15   
>  53.75    54.28   54.79   55.16  ... ]
> lon      =  [ 192.83  193.38  193.94  194.67  195.65  196.49  197.35 
>  198.15  198.94 199.53  ... ]
> obs     =  [    1.42      1.35      1.55     1.50      1.59     1.76   
>   2.15       1.90     1.55    0.73  ... ]
> model  = [     1.67      1.68     1.70      1.79     2.04     2.36     
>  2.53      2.38      2.149   1.57 ... ]
>
> I then want to calculate statistics based on bins of say 2 X 2 degree 
> boxes. These arrays are very large, on the order of 10^6. For ease of 
> explanation, I will focus initially on bias.
>
> My first approach was to use loops through lat/lon boxes and use 
> np.where statements to extract all the values of the model and 
> observations within the given box, and calculate the bias as the mean 
> of the difference.  This was obviously very slow.
>
> histogram2d provided a FAR better way to do this.  i.e.
>
>
> import numpy as np
>
> latedges=np.arange(-90,90,2)
> lonedges=np.arange(0,360,2)
>
> diff = model-obs
> grid_N, xedges, yedges = np.histogram2d(lat, lon],
>                 bins=(latedges,lonedges))
> grid_bias_sum, xedges, yedges = np.histogram2d(lat, lon, weights=diff,
>                 bins=(latedges,lonedges))
> grid_bias = grid_bias_sum/grid_N
>
>
> I now want to determine the the linear regression coefficients for 
> each each box after fitting a least squares linear regression to the 
> data in each bin.  I have been looking at using np.digitize to extract 
> the bin indeces, but haven't had much success trying to do this in two 
> dimensions.  I am back to looping through the lat and lon box values, 
> using np.where to extract the observations and model data within that 
> box, and using np.polyfit to obtain the regression coefficients.  This 
> is, of course, very slow.
>
> Can anyone advise a smarter, vectorized way to do this?  Any thoughts 
> would be greatly appreciated.
>
> Thanks in advance
> Tom
>
>
>
What exactly are trying to fit because it is rather bad practice to fit 
a model to some summarized data as you lose the uncertainty in the 
original data?
If you define your boxes, you can loop through directly on each box and 
even fit the equation:

model=mu +beta1*obs

The extension is to fit the larger equation:
model=mu + boxes + beta1*obs + beta2*obs*boxes

where your boxes is a indicator or dummy variable for each box.
Since you are only interested in the box by model term, you probably can 
use this type of model
model=mu + boxes + beta2*obs*boxes

However, these models assume that the residual variance is identical for 
all boxes. (That is solved by a mixed model approach.)

Bruce









More information about the NumPy-Discussion mailing list