[Numpy-discussion] Interpolation question
Kevin Dunn
kgdunn at gmail.com
Tue Mar 30 16:14:05 EDT 2010
Andrea Gavana <andrea.gavana <at> gmail.com> writes:
>
> Hi Kevin,
>
> On 29 March 2010 01:38, Kevin Dunn wrote:
> >> Message: 5
> >> Date: Sun, 28 Mar 2010 00:24:01 +0000
> >> From: Andrea Gavana <andrea.gavana <at> gmail.com>
> >> Subject: [Numpy-discussion] Interpolation question
> >> To: Discussion of Numerical Python <numpy-discussion <at> scipy.org>
> >> Message-ID:
> >> <d5ff27201003271724o6c82ec75v225d819c84140b46 <at> mail.gmail.com>
> >> Content-Type: text/plain; charset=ISO-8859-1
> >>
> >> Hi All,
> >>
> >> I have an interpolation problem and I am having some difficulties
> >> in tackling it. I hope I can explain myself clearly enough.
> >>
> >> Basically, I have a whole bunch of 3D fluid flow simulations (close to
> >> 1000), and they are a result of different combinations of parameters.
> >> I was planning to use the Radial Basis Functions in scipy, but for the
> >> moment let's assume, to simplify things, that I am dealing only with
> >> one parameter (x). In 1000 simulations, this parameter x has 1000
> >> values, obviously. The problem is, the outcome of every single
> >> simulation is a vector of oil production over time (let's say 40
> >> values per simulation, one per year), and I would like to be able to
> >> interpolate my x parameter (1000 values) against all the simulations
> >> (1000x40) and get an approximating function that, given another x
> >> parameter (of size 1x1) will give me back an interpolated production
> >> profile (of size 1x40).
> >
> > [I posted the following earlier but forgot to change the subject - it
> > appears as a new thread called "NumPy-Discussion Digest, Vol 42, Issue
> > 85" - please ignore that thread]
> >
> > Andrea, may I suggest a different approach to RBF's.
> >
> > Realize that your vector of 40 values for each row in y are not
> > independent of each other (they will be correlated). First build a
> > principal component analysis (PCA) model on this 1000 x 40 matrix and
> > reduce it down to a 1000 x A matrix, called your scores matrix, where
> > A is the number of independent components. A is selected so that it
> > adequately summarizes Y without over-fitting and you will find A <<
> > 40, maybe A = 2 or 3. There are tools, such as cross-validation, that
> > will help select a reasonable value of A.
> >
> > Then you can relate your single column of X to these independent
> > columns in A using a tool such as least squares: one least squares
> > model per column in the scores matrix. This works because each column
> > in the score vector is independent (contains totally orthogonal
> > information) to the others. But I would be surprised if this works
> > well enough, unless A = 1.
> >
> > But it sounds like your don't just have a single column in your
> > X-variables (you hinted that the single column was just for
> > simplification). In that case, I would build a projection to latent
> > structures model (PLS) model that builds a single latent-variable
> > model that simultaneously models the X-matrix, the Y-matrix as well as
> > providing the maximal covariance between these two matrices.
> >
> > If you need some references and an outline of code, then I can readily
> > provide these.
> >
> > This is a standard problem with data from spectroscopic instruments
> > and with batch processes. They produce hundreds, sometimes 1000's of
> > samples per row. PCA and PLS are very effective at summarizing these
> > down to a much smaller number of independent columns, very often just
> > a handful, and relating them (i.e. building a predictive model) to
> > other data matrices.
> >
> > I also just saw the suggestions of others to center the data by
> > subtracting the mean from each column in Y and scaling (by dividing
> > through by the standard deviation). This is a standard data
> > preprocessing step, called autoscaling and makes sense for any data
> > analysis, as you already discovered.
>
> I have got some success by using time-based RBFs interpolations, but I
> am always open to other possible implementations (as the one I am
> using can easily fail for strange combinations of input parameters).
> Unfortunately, my understanding of your explanation is very very
> limited: I am not an expert at all, so it's a bit hard for me to
> translate the mathematical technical stuff in something I can
> understand. If you have an example code (even a very trivial one) for
> me to study so that I can understand what the code is actually doing,
> I would be more than grateful for your help
PCA can be calculated in several different ways. The code below works well
enough in many cases, but it can't handle missing data. If you need that
capability, let me know.
You can read more about PCA here (shameless plug for the course I teach on this
material):
http://stats4.eng.mcmaster.ca/wiki/Latent_variable_methods_and_applications
And here is some Python code for a fake data set and an actual data set. Hope
that gets you started, Kevin.
import numpy as np
use_fake_data = False
if use_fake_data:
N = 1000
K = 40
A = 3 # extract 3 principal components
# Replace this garbage data with your actual, raw data
X_raw = np.random.normal(loc=20.0, scale=3.242, size=(N, K))
else:
import urllib2
import StringIO
# Get a N=460 rows and K=650 column data set of NIR spectra
url = 'http://stats4.eng.mcmaster.ca/datasets/tablet-spectra.csv'
data_string = urllib2.urlopen(url).read()
X_raw = np.genfromtxt(StringIO.StringIO(data_string), delimiter=",")
N, K = X_raw.shape
A = 3 # extract 3 principal components
# Center and scale X (you should check for columns with
# zero variance first)
X = X_raw - X_raw.mean(axis=0)
X = X / X.std(axis=0)
# Verify the centering and scaling
print(X.mean(axis=0)) # Row of zeros
print(X.std(axis=0)) # Row of ones
# Use SVD to calculate the components (can't handle missing data!)
u, d, v = np.linalg.svd(X)
# These are the loadings (direction vectors for the A components)
P = v.T[:, range(0, A)]
# These are the scores, A scores for each observation
T = np.dot(X, P)
# How well did these A scores summarize the original data?
X_hat = np.dot(T, P.T)
residuals = X - X_hat
residual_ssq = np.sum(residuals * residuals)
initial_ssq = np.sum(X * X)
variance_captured = 1.0 - residual_ssq/initial_ssq
print variance_captured
# With the spectral data set, 3 components capture 94% of the
# variance in the original data. Now you work with only these 3
# columns, rather than the K=650 columns from your original data.
> Andrea.
>
> "Imagination Is The Only Weapon In The War Against Reality."
> http://xoomer.alice.it/infinity77/
>
> ==> Never *EVER* use RemovalGroup for your house removal. You'll
> regret it forever.
> http://thedoomedcity.blogspot.com/2010/03/removal-group-nightmare.html <==
>
More information about the NumPy-Discussion
mailing list