Finding the PC that captures a specific variable
Hi I have a question about PCA and that is, how we can determine, a variable, X, is better captured by which factor (principal component)? For example, maybe one variable has low weight in the first PC but has a higher weight in the fifth PC. When I use the PCA from Scikit, I have to manually work with the PCs, therefore, I may miss the point that although a variable is weak in PC1-PC2 plot, it may be strong in PC4-PC5 plot. Any comment on that? Regards, Mahmood
I am not really understanding the question, sorry. Are you seeking for the `explained_variance_ratio_` attribute that give you a relative value of the eigenvalues associated to the eigenvectors? On Fri, 22 Jan 2021 at 10:16, Mahmood Naderan <mahmood.nt@gmail.com> wrote:
Hi I have a question about PCA and that is, how we can determine, a variable, X, is better captured by which factor (principal component)? For example, maybe one variable has low weight in the first PC but has a higher weight in the fifth PC.
When I use the PCA from Scikit, I have to manually work with the PCs, therefore, I may miss the point that although a variable is weak in PC1-PC2 plot, it may be strong in PC4-PC5 plot.
Any comment on that?
Regards, Mahmood _______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
-- Guillaume Lemaitre Scikit-learn @ Inria Foundation https://glemaitre.github.io/
Hi Mahmood, I believe your question is answered here: https://stackoverflow.com/questions/22984335/recovering-features-names-of-ex...
El 22 ene 2021, a las 10:26, Guillaume Lemaître <g.lemaitre58@gmail.com> escribió:
I am not really understanding the question, sorry. Are you seeking for the `explained_variance_ratio_` attribute that give you a relative value of the eigenvalues associated to the eigenvectors?
On Fri, 22 Jan 2021 at 10:16, Mahmood Naderan <mahmood.nt@gmail.com> wrote: Hi I have a question about PCA and that is, how we can determine, a variable, X, is better captured by which factor (principal component)? For example, maybe one variable has low weight in the first PC but has a higher weight in the fifth PC.
When I use the PCA from Scikit, I have to manually work with the PCs, therefore, I may miss the point that although a variable is weak in PC1-PC2 plot, it may be strong in PC4-PC5 plot.
Any comment on that?
Regards, Mahmood _______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
-- Guillaume Lemaitre Scikit-learn @ Inria Foundation https://glemaitre.github.io/ _______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
Hi Mahmood, There are different pieces of info that you can get from PCA: 1. How important is a given PC to reconstruct the entire dataset -> This is given by explained_variance_ratio_ as Guillaume suggested 2. What is the contribution of each feature to each PC (remember that a PC is a linear combination of all the features i.e.: PC_1 = X_1 . alpha_11 + X_2 . alpha_12 + ... X_m . alpha_1m). The alpha_ij are what you're looking for and they are given in the components_ matrix which is a n_components x n_features matrix. Nicolas On 1/22/21 9:13 AM, Mahmood Naderan wrote:
Hi I have a question about PCA and that is, how we can determine, a variable, X, is better captured by which factor (principal component)? For example, maybe one variable has low weight in the first PC but has a higher weight in the fifth PC.
When I use the PCA from Scikit, I have to manually work with the PCs, therefore, I may miss the point that although a variable is weak in PC1-PC2 plot, it may be strong in PC4-PC5 plot.
Any comment on that?
Regards, Mahmood _______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
Hi Thanks for the replies. I read about the available functions in the PCA section. Consider the following code x = StandardScaler().fit_transform(x) pca = PCA() principalComponents = pca.fit_transform(x) principalDf = pd.DataFrame(data = principalComponents) loadings = pca.components_ finalDf = pd.concat([principalDf, pd.DataFrame(targets, columns=['kernel'])], 1) print( "First and second observations\n", finalDf.loc[0:1] ) print( "loadings[0:1]\n", loadings[0], loadings[1] ) print ("explained_variance_ratio_\n",pca.explained_variance_ratio_) The output looks like First and second observations 0 1 2 3 4 kernel 0 2.959846 -0.184307 -0.100236 0.533735 -0.002227 ELEC1 1 0.390313 1.805239 0.029688 -0.502359 -0.002350 ELECT2 loadings[0:1] [0.21808984 0.49137412 0.46511098 0.49735819 0.49728754] [-0.94878375 -0.01257726 0.29718078 0.07493325 0.07562934] explained_variance_ratio_ [7.80626876e-01 1.79854061e-01 2.50729844e-02 1.44436687e-02 2.40984767e-06] As you can see for two kernels named ELEC1 and ELEC2, there are five PCs from 0 to 4. Now based on the numbers in the loadings, I expect that loadings[0] which is the first variable is better shown on PC1-PC2 plane (0.49137412,0.46511098). However, loadings[1] which is the second variable is better shown on PC0-PC2 plane (-0.94878375,0.29718078). Is this understanding correct? I don't understand what explained_variance_ratio_ is trying to say here. Regards, Mahmood On Fri, Jan 22, 2021 at 11:52 AM Nicolas Hug <niourf@gmail.com> wrote:
Hi Mahmood,
There are different pieces of info that you can get from PCA:
1. How important is a given PC to reconstruct the entire dataset -> This is given by explained_variance_ratio_ as Guillaume suggested
2. What is the contribution of each feature to each PC (remember that a PC is a linear combination of all the features i.e.: PC_1 = X_1 . alpha_11 + X_2 . alpha_12 + ... X_m . alpha_1m). The alpha_ij are what you're looking for and they are given in the components_ matrix which is a n_components x n_features matrix.
Nicolas
On 1/22/21 9:13 AM, Mahmood Naderan wrote:
Hi I have a question about PCA and that is, how we can determine, a variable, X, is better captured by which factor (principal component)? For example, maybe one variable has low weight in the first PC but has a higher weight in the fifth PC.
When I use the PCA from Scikit, I have to manually work with the PCs, therefore, I may miss the point that although a variable is weak in PC1-PC2 plot, it may be strong in PC4-PC5 plot.
Any comment on that?
Regards, Mahmood _______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
Hi Mahmood, the information you need is given by the individual explained variance for each variable / feature. You get that information from the hoggorm package (Python): https://github.com/olivertomic/hoggorm https://hoggorm.readthedocs.io/en/latest/index.html Here is one of the PCA examples provided in a Jupyter notebook: https://github.com/olivertomic/hoggorm/blob/master/examples/PCA/PCA_on_cance... When you do PCA you get the information by calling for example: cumCalExplVar_individualVariable = model.X_cumCalExplVar() (which gives you the cumulative calibrated explained variance for each variable, cell 21 in the notebook) cumValExplVar_individualVariable = model.X_cumValExplVar_indVar() (which gives you the cumulative validated explained variance variable, cell 30 in the notebook) The component where you get the biggest jump for the variable of interest is the component you are looking for. You could also have a look at the correlation loadings to identify the component you are looking for. cheers Oliver ---- On Fri, 22 Jan 2021 21:48:46 +0100 Mahmood Naderan <mahmood.nt@gmail.com> wrote ---- Hi Thanks for the replies. I read about the available functions in the PCA section. Consider the following code x = StandardScaler().fit_transform(x) pca = PCA() principalComponents = pca.fit_transform(x) principalDf = pd.DataFrame(data = principalComponents) loadings = pca.components_ finalDf = pd.concat([principalDf, pd.DataFrame(targets, columns=['kernel'])], 1) print( "First and second observations\n", finalDf.loc[0:1] ) print( "loadings[0:1]\n", loadings[0], loadings[1] ) print ("explained_variance_ratio_\n",pca.explained_variance_ratio_) The output looks like First and second observations 0 1 2 3 4 kernel 0 2.959846 -0.184307 -0.100236 0.533735 -0.002227 ELEC1 1 0.390313 1.805239 0.029688 -0.502359 -0.002350 ELECT2 loadings[0:1] [0.21808984 0.49137412 0.46511098 0.49735819 0.49728754] [-0.94878375 -0.01257726 0.29718078 0.07493325 0.07562934] explained_variance_ratio_ [7.80626876e-01 1.79854061e-01 2.50729844e-02 1.44436687e-02 2.40984767e-06] As you can see for two kernels named ELEC1 and ELEC2, there are five PCs from 0 to 4. Now based on the numbers in the loadings, I expect that loadings[0] which is the first variable is better shown on PC1-PC2 plane (0.49137412,0.46511098). However, loadings[1] which is the second variable is better shown on PC0-PC2 plane (-0.94878375,0.29718078). Is this understanding correct? I don't understand what explained_variance_ratio_ is trying to say here. Regards, Mahmood On Fri, Jan 22, 2021 at 11:52 AM Nicolas Hug <mailto:niourf@gmail.com> wrote:
Hi Mahmood,
There are different pieces of info that you can get from PCA:
1. How important is a given PC to reconstruct the entire dataset -> This is given by explained_variance_ratio_ as Guillaume suggested
2. What is the contribution of each feature to each PC (remember that a PC is a linear combination of all the features i.e.: PC_1 = X_1 . alpha_11 + X_2 . alpha_12 + ... X_m . alpha_1m). The alpha_ij are what you're looking for and they are given in the components_ matrix which is a n_components x n_features matrix.
Nicolas
On 1/22/21 9:13 AM, Mahmood Naderan wrote:
Hi I have a question about PCA and that is, how we can determine, a variable, X, is better captured by which factor (principal component)? For example, maybe one variable has low weight in the first PC but has a higher weight in the fifth PC.
When I use the PCA from Scikit, I have to manually work with the PCs, therefore, I may miss the point that although a variable is weak in PC1-PC2 plot, it may be strong in PC4-PC5 plot.
Any comment on that?
Regards, Mahmood _______________________________________________ scikit-learn mailing list mailto:scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
scikit-learn mailing list mailto:scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________ scikit-learn mailing list mailto:scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
Hi Olivier, Thanks for the suggestion. The package seems to be handy. I will try that. Regards, Mahmood On Sun, Jan 24, 2021 at 12:55 PM Oliver Tomic via scikit-learn <scikit-learn@python.org> wrote:
Hi Mahmood,
the information you need is given by the individual explained variance for each variable / feature. You get that information from the hoggorm package (Python):
https://github.com/olivertomic/hoggorm https://hoggorm.readthedocs.io/en/latest/index.html
Here is one of the PCA examples provided in a Jupyter notebook: https://github.com/olivertomic/hoggorm/blob/master/examples/PCA/PCA_on_cance...
When you do PCA you get the information by calling for example:
cumCalExplVar_individualVariable = model.X_cumCalExplVar() (which gives you the cumulative calibrated explained variance for each variable, cell 21 in the notebook)
cumValExplVar_individualVariable = model.X_cumValExplVar_indVar() (which gives you the cumulative validated explained variance variable, cell 30 in the notebook)
The component where you get the biggest jump for the variable of interest is the component you are looking for.
You could also have a look at the correlation loadings to identify the component you are looking for.
cheers Oliver
---- On Fri, 22 Jan 2021 21:48:46 +0100 Mahmood Naderan <mahmood.nt@gmail.com> wrote ----
Hi Thanks for the replies. I read about the available functions in the PCA section. Consider the following code
x = StandardScaler().fit_transform(x) pca = PCA() principalComponents = pca.fit_transform(x) principalDf = pd.DataFrame(data = principalComponents) loadings = pca.components_ finalDf = pd.concat([principalDf, pd.DataFrame(targets, columns=['kernel'])], 1) print( "First and second observations\n", finalDf.loc[0:1] ) print( "loadings[0:1]\n", loadings[0], loadings[1] ) print ("explained_variance_ratio_\n",pca.explained_variance_ratio_)
The output looks like
First and second observations 0 1 2 3 4 kernel 0 2.959846 -0.184307 -0.100236 0.533735 -0.002227 ELEC1 1 0.390313 1.805239 0.029688 -0.502359 -0.002350 ELECT2 loadings[0:1] [0.21808984 0.49137412 0.46511098 0.49735819 0.49728754] [-0.94878375 -0.01257726 0.29718078 0.07493325 0.07562934] explained_variance_ratio_ [7.80626876e-01 1.79854061e-01 2.50729844e-02 1.44436687e-02 2.40984767e-06]
As you can see for two kernels named ELEC1 and ELEC2, there are five PCs from 0 to 4. Now based on the numbers in the loadings, I expect that loadings[0] which is the first variable is better shown on PC1-PC2 plane (0.49137412,0.46511098). However, loadings[1] which is the second variable is better shown on PC0-PC2 plane (-0.94878375,0.29718078). Is this understanding correct?
I don't understand what explained_variance_ratio_ is trying to say here.
Regards, Mahmood
On Fri, Jan 22, 2021 at 11:52 AM Nicolas Hug <niourf@gmail.com> wrote:
Hi Mahmood,
There are different pieces of info that you can get from PCA:
1. How important is a given PC to reconstruct the entire dataset -> This is given by explained_variance_ratio_ as Guillaume suggested
2. What is the contribution of each feature to each PC (remember that a PC is a linear combination of all the features i.e.: PC_1 = X_1 . alpha_11 + X_2 . alpha_12 + ... X_m . alpha_1m). The alpha_ij are what you're looking for and they are given in the components_ matrix which is a n_components x n_features matrix.
Nicolas
On 1/22/21 9:13 AM, Mahmood Naderan wrote:
Hi I have a question about PCA and that is, how we can determine, a variable, X, is better captured by which factor (principal component)? For example, maybe one variable has low weight in the first PC but has a higher weight in the fifth PC.
When I use the PCA from Scikit, I have to manually work with the PCs, therefore, I may miss the point that although a variable is weak in PC1-PC2 plot, it may be strong in PC4-PC5 plot.
Any comment on that?
Regards, Mahmood _______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
participants (5)
-
Guillaume Lemaître -
Julio Antonio Soto -
Mahmood Naderan -
Nicolas Hug -
Oliver Tomic