[scikit-learn] How can I tell if I am getting the loading values from a PCA analysis using scikit-learn?

Thu May 4 07:52:00 EDT 2017

Firstly, I apologize in advance if the following questions I have are very basic. I am very new to coding in general, as well as Principal Component Analysis and scikit-learn. I am trying to finish a project for an internship and hit a wall, and am desperately trying to seek help to solve this before my deadline.

I have a set of RNA sequences that are evaluated based on several parameters. For the sake of simplicity, let's say there are three parameters: the GC content in the RNA sequence's ribosome binding site (RBS), the estimated stability of the RNA sequences' secondary structures (MFE), and the ensemble defect of the RNA (i.e. the number of nucleotides that do not conform to a prescribed secondary structure). A series of functions calculated the values of these RNA sequences for each parameter, and the lower their calculated value is, the more indicative it is that the RNA sequence in question is more optimal for our experimental purposes.

What I am now trying to do is build a composite score from the values in these parameters for each RNA sequence using PCA. Rather than use PCA for dimension reduction, I am going to use it to determine the loading values for each value to their respective component. I have been using the following code to calculate the loadings and subsequently utilize them for the creation of a composite score, which is the sum of the original values multiplied by their corresponding loading values (i.e. the loading values are used as weights on the original values).

import pandas as pd
from sklearn.decomposition import PCA
from sklearn.preprocessing import scale

green= path.join(output_folder, "SequenceScoring.csv")
df = pd.read_csv(green)
X= df.values
X = scale(X)
pca = PCA(n_components=3)
pca.fit(X)
X1=pca.fit_transform(X)
df1 = pd.DataFrame(data= X1, index= range(len(df['MFE of Sequence Complex'])), columns= ['Loading MFE of Sequence Complex', 'Loading Percentage of Ensemble Defect of RBS of Trigger-Switch Complex', 'Loading GC Content of RBS Region in Switch'])

Although this seems to be the correct procedure, I am not certain if I properly understand the output of X1=pca.fit_transform(X). One source I initially used ostensibly cleared the matter (source: Sklearn PCA is pca.components_ the loadings?<https://stackoverflow.com/questions/36380183/sklearn-pca-is-pca-components-the-loadings>), but upon closer inspection, I realized I wasn't sure if I was getting the correct values, which was described as "the result of the projection for each sample into the vector space spanned by the components". Furthermore, loadings can also be defined as being "sums of squares within each component are the eigenvalues (components' variances)" (source: https://stats.stackexchange.com/questions/92499/how-to-interpret-pca-loadings). I checked the Eigenvalues of my parameters using:

X_std = StandardScaler().fit_transform(X)
cov_mat = np.cov(X_std.T)

eig_vals, eig_vecs = np.linalg.eig(cov_mat)
print('\nEigenvalues \n%s' %eig_vals)

And then I squared and summed the loading values in each column produced by 'X1=pca.fit_transform(X)`, and found that they did not match the Eigenvalues for the respective parameters at all.

It is worth noting that I understood the term "loading values" as the distance between a certain value and an associated component (so that values that influenced the slope and variance captured by the component more strongly had higher loading values). Am I fundamentally misunderstanding the concept of loading values? Or am I not using the right function from scikit-learn? I have tried to look through the source code for scikit-learn's pca.fit_transform, but I don't have the level of mathematical or coding experience required to understand it.

Thanks so much,

David
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20170504/c7e1abbb/attachment.html>