[Numpy-discussion] finding eigenvectors etc

Thu Feb 21 10:47:13 EST 2008

Hi all,

>> How are you using the values? How significant are the differences?
>>
>
> i am using these eigenvectors to do PCA on a set of images(of faces).I
> sort the eigenvectors in descending order of their eigenvalues and
> this is multiplied with the  (orig data of some images viz a matrix)to
> obtain a facespace.

I've dealt with similar issues a lot -- performing PCA on data where  
the dimensionality of the data is much greater than the number of  
data points. (Like images.)

In this case, the maximum number of non-trivial eigenvectors of the  
covariance matrix of the data is min(dimension_of_data,  
number_of_data_points), so one always runs into the zero-eigenvalue  
problem; the matrix is thus always ill-conditioned, but that's not a  
problem in these cases.

Nevertheless, if you've got (say) 100 images that are each 100x100  
pixels, to do PCA in the naive way you need to make a 10000x10000  
covariance matrix and then decompose it into 100000 eigenvectors and  
values just to get out the 100 non-trivial ones. That's a lot of  
computation wasted calculating noise! Fortunately, there are better  
ways. One is to perform the SVD on the 100x10000 data matrix. Let the  
centered (mean-subtracted) data matrix be D, then the SVD provides  
matrices U, S, and V'. IIRC, the eigenvalues of D'D (the covariance  
matrix of interest) are then packed along the first dimension of V',  
and the eigenvalues are the square of the values in S.

But! There's an even faster way (from my testing). The trick is that  
instead of calculating the 10000x10000 outer covariance matrix D'D,  
or doing the SVD on D, one can calculate the 100x100 "inner"  
covariance matrix DD' and perform the eigen-decomposition thereon and  
then trivially transform those eigenvalues and vectors to the ones of  
the D'D matrix. This computation is often substantially faster than  
the SVD.

Here's how it works:
Let D, our re-centered data matrix, be of shape (n, k) -- that is, n  
data points in k dimensions.
We know that D has a singular value decomposition D = USV' (no need  
to calculate the SVD though; just enough to know it exists).
 From this, we can rewrite the covariance matrices:
D'D = VS'SV'
DD' = USS'U'

Now, from the SVD, we know that S'S and SS' are diagonal matrices,  
and V and U (and V' and U') form orthogonal bases. One way to write  
the eigen-decomposition of a matrix is A = QLQ', where Q is  
orthogonal and L is diagonal. Since the eigen-decomposition is unique  
(up to a permutation of the columns of Q and L), we know that V must  
therefore contain the eigenvectors of D'D in its columns, and U must  
contain the eigenvectors of DD' in its columns. This is the origin of  
the SVD recipe for that I gave above.

Further, let S_hat, of shape (n, k) be the elementwise reciprocal of  
S (i.e. SS_hat = I of shape (m, n) and S_hatS = I of shape (n, m),  
where I is the identity matrix).
Then, we can solve for U or V in terms of the other:
V = D'US_hat'
U = DVS_hat

So, to get the eigenvectors and eigenvalues of D'D, we just calculate  
DD' and then apply the symmetric eigen-decomposition (symmetric  
version is faster, and DD' is symmetric) to get eigenvectors U and  
eigenvalues L. We know that L=SS', so S_hat = 1/sqrt(L) (where the  
sqrt is taken elementwise, of course). So, the eigenvectors we're  
looking for are:
V = D'US_hat
Then, the principal components (eigenvectors) are in the columns of V  
(packed along the second dimension of V).

Fortunately, I've packaged this all up into a python module for PCA  
that takes care of this all. It's attached.

Zach Pincus

Postdoctoral Fellow, Lab of Dr. Frank Slack
Molecular, Cellular and Developmental Biology
Yale University

-------------- next part --------------
A non-text attachment was scrubbed...
Name: pca.py
Type: text/x-python-script
Size: 4350 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20080221/364296dc/attachment.bin>