[scikit-learn] 1. Re: unclear help file for sklearn.decomposition.pca

Mon Oct 16 14:41:26 EDT 2017

Your document says:

> This data has already been pre-processed so that each of the features
 and  have about the same mean (zero) and variance.

This means that you do this before doing the eigendecomposition.

Check the wikipedia article
https://en.wikipedia.org/wiki/Principal_component_analysis - it says:

> To find the axes of the ellipsoid, we must first subtract the mean of
each variable from the dataset to center the data around the origin.

More intuitively: PCA diagonalizes the empirical covariance matrix. The
covariance matrix is the matrix of centered second order moments. To obtain
it you have to center the data.

Hope this helps.
Michael

On Mon, Oct 16, 2017 at 11:27 AM, Ismael Lemhadri <lemhadri at stanford.edu>
wrote:

> @Andreas Muller:
> My references do not assume centering, e.g. http://ufldl.stanford.
> edu/wiki/index.php/PCA
> any reference?
>
>
>
> On Mon, Oct 16, 2017 at 10:20 AM, <scikit-learn-request at python.org> wrote:
>
>> Send scikit-learn mailing list submissions to
>>         scikit-learn at python.org
>>
>> To subscribe or unsubscribe via the World Wide Web, visit
>>         https://mail.python.org/mailman/listinfo/scikit-learn
>> or, via email, send a message with subject or body 'help' to
>>         scikit-learn-request at python.org
>>
>> You can reach the person managing the list at
>>         scikit-learn-owner at python.org
>>
>> When replying, please edit your Subject line so it is more specific
>> than "Re: Contents of scikit-learn digest..."
>>
>>
>> Today's Topics:
>>
>>    1. Re: unclear help file for sklearn.decomposition.pca
>>       (Andreas Mueller)
>>
>>
>> ----------------------------------------------------------------------
>>
>> Message: 1
>> Date: Mon, 16 Oct 2017 13:19:57 -0400
>> From: Andreas Mueller <t3kcit at gmail.com>
>> To: scikit-learn at python.org
>> Subject: Re: [scikit-learn] unclear help file for
>>         sklearn.decomposition.pca
>> Message-ID: <04fc445c-d8f3-a3a9-4ab2-0535826a2d03 at gmail.com>
>> Content-Type: text/plain; charset="utf-8"; Format="flowed"
>>
>> The definition of PCA has a centering step, but no scaling step.
>>
>> On 10/16/2017 11:16 AM, Ismael Lemhadri wrote:
>> > Dear Roman,
>> > My concern is actually not about not mentioning the scaling but about
>> > not mentioning the centering.
>> > That is, the sklearn PCA removes the mean but it does not mention it
>> > in the help file.
>> > This was quite messy for me to debug as I expected it to either: 1/
>> > center and scale simultaneously or / not scale and not center either.
>> > It would be beneficial to explicit the behavior in the help file in my
>> > opinion.
>> > Ismael
>> >
>> > On Mon, Oct 16, 2017 at 8:02 AM, <scikit-learn-request at python.org
>> > <mailto:scikit-learn-request at python.org>> wrote:
>> >
>> >     Send scikit-learn mailing list submissions to
>> >     scikit-learn at python.org <mailto:scikit-learn at python.org>
>> >
>> >     To subscribe or unsubscribe via the World Wide Web, visit
>> >     https://mail.python.org/mailman/listinfo/scikit-learn
>> >     <https://mail.python.org/mailman/listinfo/scikit-learn>
>> >     or, via email, send a message with subject or body 'help' to
>> >     scikit-learn-request at python.org
>> >     <mailto:scikit-learn-request at python.org>
>> >
>> >     You can reach the person managing the list at
>> >     scikit-learn-owner at python.org <mailto:scikit-learn-owner at python.org
>> >
>> >
>> >     When replying, please edit your Subject line so it is more specific
>> >     than "Re: Contents of scikit-learn digest..."
>> >
>> >
>> >     Today's Topics:
>> >
>> >     ? ?1. unclear help file for sklearn.decomposition.pca (Ismael
>> >     Lemhadri)
>> >     ? ?2. Re: unclear help file for sklearn.decomposition.pca
>> >     ? ? ? (Roman Yurchak)
>> >     ? ?3. Question about LDA's coef_ attribute (Serafeim Loukas)
>> >     ? ?4. Re: Question about LDA's coef_ attribute (Alexandre Gramfort)
>> >     ? ?5. Re: Question about LDA's coef_ attribute (Serafeim Loukas)
>> >
>> >
>> >     -----------------------------------------------------------
>> -----------
>> >
>> >     Message: 1
>> >     Date: Sun, 15 Oct 2017 18:42:56 -0700
>> >     From: Ismael Lemhadri <lemhadri at stanford.edu
>> >     <mailto:lemhadri at stanford.edu>>
>> >     To: scikit-learn at python.org <mailto:scikit-learn at python.org>
>> >     Subject: [scikit-learn] unclear help file for
>> >     ? ? ? ? sklearn.decomposition.pca
>> >     Message-ID:
>> >     ? ? ? ?
>> >     <CANpSPFTgv+Oz7f97dandmrBBayqf_o9w=18oKHCFN0u5DNzj+g at mail.gmail.com
>> >     <mailto:18oKHCFN0u5DNzj%2Bg at mail.gmail.com>>
>> >     Content-Type: text/plain; charset="utf-8"
>> >
>> >     Dear all,
>> >     The help file for the PCA class is unclear about the preprocessing
>> >     performed to the data.
>> >     You can check on line 410 here:
>> >     https://github.com/scikit-learn/scikit-learn/blob/ef5cb84a/sklearn/
>> >     decomposition/pca.py#L410
>> >     <https://github.com/scikit-learn/scikit-learn/blob/ef5cb84a
>> /sklearn/%0Adecomposition/pca.py#L410>
>> >     that the matrix is centered but NOT scaled, before performing the
>> >     singular
>> >     value decomposition.
>> >     However, the help files do not make any mention of it.
>> >     This is unclear for someone who, like me, just wanted to compare
>> >     that the
>> >     PCA and np.linalg.svd give the same results. In academic settings,
>> >     students
>> >     are often asked to compare different methods and to check that
>> >     they yield
>> >     the same results. I expect that many students have confronted this
>> >     problem
>> >     before...
>> >     Best,
>> >     Ismael Lemhadri
>> >     -------------- next part --------------
>> >     An HTML attachment was scrubbed...
>> >     URL:
>> >     <http://mail.python.org/pipermail/scikit-learn/attachments/
>> 20171015/c465bde7/attachment-0001.html
>> >     <http://mail.python.org/pipermail/scikit-learn/attachments/
>> 20171015/c465bde7/attachment-0001.html>>
>> >
>> >     ------------------------------
>> >
>> >     Message: 2
>> >     Date: Mon, 16 Oct 2017 15:16:45 +0200
>> >     From: Roman Yurchak <rth.yurchak at gmail.com
>> >     <mailto:rth.yurchak at gmail.com>>
>> >     To: Scikit-learn mailing list <scikit-learn at python.org
>> >     <mailto:scikit-learn at python.org>>
>> >     Subject: Re: [scikit-learn] unclear help file for
>> >     ? ? ? ? sklearn.decomposition.pca
>> >     Message-ID: <b2abdcfd-4736-929e-6304-b93832932043 at gmail.com
>> >     <mailto:b2abdcfd-4736-929e-6304-b93832932043 at gmail.com>>
>> >     Content-Type: text/plain; charset=utf-8; format=flowed
>> >
>> >     Ismael,
>> >
>> >     as far as I saw the sklearn.decomposition.PCA doesn't mention
>> >     scaling at
>> >     all (except for the whiten parameter which is post-transformation
>> >     scaling).
>> >
>> >     So since it doesn't mention it, it makes sense that it doesn't do
>> any
>> >     scaling of the input. Same as np.linalg.svd.
>> >
>> >     You can verify that PCA and np.linalg.svd yield the same results,
>> with
>> >
>> >     ```
>> >     ?>>> import numpy as np
>> >     ?>>> from sklearn.decomposition import PCA
>> >     ?>>> import numpy.linalg
>> >     ?>>> X = np.random.RandomState(42).rand(10, 4)
>> >     ?>>> n_components = 2
>> >     ?>>> PCA(n_components, svd_solver='full').fit_transform(X)
>> >     ```
>> >
>> >     and
>> >
>> >     ```
>> >     ?>>> U, s, V = np.linalg.svd(X - X.mean(axis=0),
>> full_matrices=False)
>> >     ?>>> (X - X.mean(axis=0)).dot(V[:n_components].T)
>> >     ```
>> >
>> >     --
>> >     Roman
>> >
>> >     On 16/10/17 03:42, Ismael Lemhadri wrote:
>> >     > Dear all,
>> >     > The help file for the PCA class is unclear about the preprocessing
>> >     > performed to the data.
>> >     > You can check on line 410 here:
>> >     >
>> >     https://github.com/scikit-learn/scikit-learn/blob/ef5cb84a/
>> sklearn/decomposition/pca.py#L410
>> >     <https://github.com/scikit-learn/scikit-learn/blob/ef5cb84a
>> /sklearn/decomposition/pca.py#L410>
>> >     >
>> >     <https://github.com/scikit-learn/scikit-learn/blob/ef5cb84a
>> /sklearn/decomposition/pca.py#L410
>> >     <https://github.com/scikit-learn/scikit-learn/blob/ef5cb84a
>> /sklearn/decomposition/pca.py#L410>>
>> >     > that the matrix is centered but NOT scaled, before performing the
>> >     > singular value decomposition.
>> >     > However, the help files do not make any mention of it.
>> >     > This is unclear for someone who, like me, just wanted to compare
>> >     that
>> >     > the PCA and np.linalg.svd give the same results. In academic
>> >     settings,
>> >     > students are often asked to compare different methods and to
>> >     check that
>> >     > they yield the same results. I expect that many students have
>> >     confronted
>> >     > this problem before...
>> >     > Best,
>> >     > Ismael Lemhadri
>> >     >
>> >     >
>> >     > _______________________________________________
>> >     > scikit-learn mailing list
>> >     > scikit-learn at python.org <mailto:scikit-learn at python.org>
>> >     > https://mail.python.org/mailman/listinfo/scikit-learn
>> >     <https://mail.python.org/mailman/listinfo/scikit-learn>
>> >     >
>> >
>> >
>> >
>> >     ------------------------------
>> >
>> >     Message: 3
>> >     Date: Mon, 16 Oct 2017 15:27:48 +0200
>> >     From: Serafeim Loukas <seralouk at gmail.com <mailto:
>> seralouk at gmail.com>>
>> >     To: scikit-learn at python.org <mailto:scikit-learn at python.org>
>> >     Subject: [scikit-learn] Question about LDA's coef_ attribute
>> >     Message-ID: <58C6D0DA-9DE5-4EF5-97C1-48159831F5A9 at gmail.com
>> >     <mailto:58C6D0DA-9DE5-4EF5-97C1-48159831F5A9 at gmail.com>>
>> >     Content-Type: text/plain; charset="us-ascii"
>> >
>> >     Dear Scikit-learn community,
>> >
>> >     Since the documentation of the LDA
>> >     (http://scikit-learn.org/stable/modules/generated/sklearn.
>> discriminant_analysis.LinearDiscriminantAnalysis.html
>> >     <http://scikit-learn.org/stable/modules/generated/sklearn.
>> discriminant_analysis.LinearDiscriminantAnalysis.html>
>> >     <http://scikit-learn.org/stable/modules/generated/sklearn.
>> discriminant_analysis.LinearDiscriminantAnalysis.html
>> >     <http://scikit-learn.org/stable/modules/generated/sklearn.
>> discriminant_analysis.LinearDiscriminantAnalysis.html>>)
>> >     is not so clear, I would like to ask if the lda.coef_ attribute
>> >     stores the eigenvectors from the SVD decomposition.
>> >
>> >     Thank you in advance,
>> >     Serafeim
>> >     -------------- next part --------------
>> >     An HTML attachment was scrubbed...
>> >     URL:
>> >     <http://mail.python.org/pipermail/scikit-learn/attachments/
>> 20171016/4263df5c/attachment-0001.html
>> >     <http://mail.python.org/pipermail/scikit-learn/attachments/
>> 20171016/4263df5c/attachment-0001.html>>
>> >
>> >     ------------------------------
>> >
>> >     Message: 4
>> >     Date: Mon, 16 Oct 2017 16:57:52 +0200
>> >     From: Alexandre Gramfort <alexandre.gramfort at inria.fr
>> >     <mailto:alexandre.gramfort at inria.fr>>
>> >     To: Scikit-learn mailing list <scikit-learn at python.org
>> >     <mailto:scikit-learn at python.org>>
>> >     Subject: Re: [scikit-learn] Question about LDA's coef_ attribute
>> >     Message-ID:
>> >     ? ? ? ?
>> >     <CADeotZricOQhuHJMmW2Z14cqffEQyndYoxn-OgKAvTMQ7V0Y2g at mail.gmail.com
>> >     <mailto:CADeotZricOQhuHJMmW2Z14cqffEQyndYoxn-
>> OgKAvTMQ7V0Y2g at mail.gmail.com>>
>> >     Content-Type: text/plain; charset="UTF-8"
>> >
>> >     no it stores the direction of the decision function to match the
>> >     API of
>> >     linear models.
>> >
>> >     HTH
>> >     Alex
>> >
>> >     On Mon, Oct 16, 2017 at 3:27 PM, Serafeim Loukas
>> >     <seralouk at gmail.com <mailto:seralouk at gmail.com>> wrote:
>> >     > Dear Scikit-learn community,
>> >     >
>> >     > Since the documentation of the LDA
>> >     >
>> >     (http://scikit-learn.org/stable/modules/generated/sklearn.
>> discriminant_analysis.LinearDiscriminantAnalysis.html
>> >     <http://scikit-learn.org/stable/modules/generated/sklearn.
>> discriminant_analysis.LinearDiscriminantAnalysis.html>)
>> >     > is not so clear, I would like to ask if the lda.coef_ attribute
>> >     stores the
>> >     > eigenvectors from the SVD decomposition.
>> >     >
>> >     > Thank you in advance,
>> >     > Serafeim
>> >     >
>> >     > _______________________________________________
>> >     > scikit-learn mailing list
>> >     > scikit-learn at python.org <mailto:scikit-learn at python.org>
>> >     > https://mail.python.org/mailman/listinfo/scikit-learn
>> >     <https://mail.python.org/mailman/listinfo/scikit-learn>
>> >     >
>> >
>> >
>> >     ------------------------------
>> >
>> >     Message: 5
>> >     Date: Mon, 16 Oct 2017 17:02:46 +0200
>> >     From: Serafeim Loukas <seralouk at gmail.com <mailto:
>> seralouk at gmail.com>>
>> >     To: Scikit-learn mailing list <scikit-learn at python.org
>> >     <mailto:scikit-learn at python.org>>
>> >     Subject: Re: [scikit-learn] Question about LDA's coef_ attribute
>> >     Message-ID: <413210D2-56AE-41A4-873F-D171BB36539D at gmail.com
>> >     <mailto:413210D2-56AE-41A4-873F-D171BB36539D at gmail.com>>
>> >     Content-Type: text/plain; charset="us-ascii"
>> >
>> >     Dear Alex,
>> >
>> >     Thank you for the prompt response.
>> >
>> >     Are the eigenvectors stored in some variable ?
>> >     Does the lda.scalings_ attribute contain the eigenvectors ?
>> >
>> >     Best,
>> >     Serafeim
>> >
>> >     > On 16 Oct 2017, at 16:57, Alexandre Gramfort
>> >     <alexandre.gramfort at inria.fr <mailto:alexandre.gramfort at inria.fr>>
>> >     wrote:
>> >     >
>> >     > no it stores the direction of the decision function to match the
>> >     API of
>> >     > linear models.
>> >     >
>> >     > HTH
>> >     > Alex
>> >     >
>> >     > On Mon, Oct 16, 2017 at 3:27 PM, Serafeim Loukas
>> >     <seralouk at gmail.com <mailto:seralouk at gmail.com>> wrote:
>> >     >> Dear Scikit-learn community,
>> >     >>
>> >     >> Since the documentation of the LDA
>> >     >>
>> >     (http://scikit-learn.org/stable/modules/generated/sklearn.
>> discriminant_analysis.LinearDiscriminantAnalysis.html
>> >     <http://scikit-learn.org/stable/modules/generated/sklearn.
>> discriminant_analysis.LinearDiscriminantAnalysis.html>)
>> >     >> is not so clear, I would like to ask if the lda.coef_ attribute
>> >     stores the
>> >     >> eigenvectors from the SVD decomposition.
>> >     >>
>> >     >> Thank you in advance,
>> >     >> Serafeim
>> >     >>
>> >     >> _______________________________________________
>> >     >> scikit-learn mailing list
>> >     >> scikit-learn at python.org <mailto:scikit-learn at python.org>
>> >     >> https://mail.python.org/mailman/listinfo/scikit-learn
>> >     <https://mail.python.org/mailman/listinfo/scikit-learn>
>> >     >>
>> >     > _______________________________________________
>> >     > scikit-learn mailing list
>> >     > scikit-learn at python.org <mailto:scikit-learn at python.org>
>> >     > https://mail.python.org/mailman/listinfo/scikit-learn
>> >     <https://mail.python.org/mailman/listinfo/scikit-learn>
>> >
>> >     -------------- next part --------------
>> >     An HTML attachment was scrubbed...
>> >     URL:
>> >     <http://mail.python.org/pipermail/scikit-learn/attachments/
>> 20171016/505c7da3/attachment.html
>> >     <http://mail.python.org/pipermail/scikit-learn/attachments/
>> 20171016/505c7da3/attachment.html>>
>> >
>> >     ------------------------------
>> >
>> >     Subject: Digest Footer
>> >
>> >     _______________________________________________
>> >     scikit-learn mailing list
>> >     scikit-learn at python.org <mailto:scikit-learn at python.org>
>> >     https://mail.python.org/mailman/listinfo/scikit-learn
>> >     <https://mail.python.org/mailman/listinfo/scikit-learn>
>> >
>> >
>> >     ------------------------------
>> >
>> >     End of scikit-learn Digest, Vol 19, Issue 25
>> >     ********************************************
>> >
>> >
>> >
>> >
>> > _______________________________________________
>> > scikit-learn mailing list
>> > scikit-learn at python.org
>> > https://mail.python.org/mailman/listinfo/scikit-learn
>>
>> -------------- next part --------------
>> An HTML attachment was scrubbed...
>> URL: <http://mail.python.org/pipermail/scikit-learn/attachments/
>> 20171016/f47e63a9/attachment.html>
>>
>> ------------------------------
>>
>> Subject: Digest Footer
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
>> ------------------------------
>>
>> End of scikit-learn Digest, Vol 19, Issue 28
>> ********************************************
>>
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20171016/a32c15e5/attachment-0001.html>