[scikit-learn] unclear help file for sklearn.decomposition.pca

Mon Oct 16 13:19:57 EDT 2017

The definition of PCA has a centering step, but no scaling step.

On 10/16/2017 11:16 AM, Ismael Lemhadri wrote:
> Dear Roman,
> My concern is actually not about not mentioning the scaling but about 
> not mentioning the centering.
> That is, the sklearn PCA removes the mean but it does not mention it 
> in the help file.
> This was quite messy for me to debug as I expected it to either: 1/ 
> center and scale simultaneously or / not scale and not center either.
> It would be beneficial to explicit the behavior in the help file in my 
> opinion.
> Ismael
>
> On Mon, Oct 16, 2017 at 8:02 AM, <scikit-learn-request at python.org 
> <mailto:scikit-learn-request at python.org>> wrote:
>
>     Send scikit-learn mailing list submissions to
>     scikit-learn at python.org <mailto:scikit-learn at python.org>
>
>     To subscribe or unsubscribe via the World Wide Web, visit
>     https://mail.python.org/mailman/listinfo/scikit-learn
>     <https://mail.python.org/mailman/listinfo/scikit-learn>
>     or, via email, send a message with subject or body 'help' to
>     scikit-learn-request at python.org
>     <mailto:scikit-learn-request at python.org>
>
>     You can reach the person managing the list at
>     scikit-learn-owner at python.org <mailto:scikit-learn-owner at python.org>
>
>     When replying, please edit your Subject line so it is more specific
>     than "Re: Contents of scikit-learn digest..."
>
>
>     Today's Topics:
>
>        1. unclear help file for sklearn.decomposition.pca (Ismael
>     Lemhadri)
>        2. Re: unclear help file for sklearn.decomposition.pca
>           (Roman Yurchak)
>        3. Question about LDA's coef_ attribute (Serafeim Loukas)
>        4. Re: Question about LDA's coef_ attribute (Alexandre Gramfort)
>        5. Re: Question about LDA's coef_ attribute (Serafeim Loukas)
>
>
>     ----------------------------------------------------------------------
>
>     Message: 1
>     Date: Sun, 15 Oct 2017 18:42:56 -0700
>     From: Ismael Lemhadri <lemhadri at stanford.edu
>     <mailto:lemhadri at stanford.edu>>
>     To: scikit-learn at python.org <mailto:scikit-learn at python.org>
>     Subject: [scikit-learn] unclear help file for
>             sklearn.decomposition.pca
>     Message-ID:
>            
>     <CANpSPFTgv+Oz7f97dandmrBBayqf_o9w=18oKHCFN0u5DNzj+g at mail.gmail.com
>     <mailto:18oKHCFN0u5DNzj%2Bg at mail.gmail.com>>
>     Content-Type: text/plain; charset="utf-8"
>
>     Dear all,
>     The help file for the PCA class is unclear about the preprocessing
>     performed to the data.
>     You can check on line 410 here:
>     https://github.com/scikit-learn/scikit-learn/blob/ef5cb84a/sklearn/
>     decomposition/pca.py#L410
>     <https://github.com/scikit-learn/scikit-learn/blob/ef5cb84a/sklearn/%0Adecomposition/pca.py#L410>
>     that the matrix is centered but NOT scaled, before performing the
>     singular
>     value decomposition.
>     However, the help files do not make any mention of it.
>     This is unclear for someone who, like me, just wanted to compare
>     that the
>     PCA and np.linalg.svd give the same results. In academic settings,
>     students
>     are often asked to compare different methods and to check that
>     they yield
>     the same results. I expect that many students have confronted this
>     problem
>     before...
>     Best,
>     Ismael Lemhadri
>     -------------- next part --------------
>     An HTML attachment was scrubbed...
>     URL:
>     <http://mail.python.org/pipermail/scikit-learn/attachments/20171015/c465bde7/attachment-0001.html
>     <http://mail.python.org/pipermail/scikit-learn/attachments/20171015/c465bde7/attachment-0001.html>>
>
>     ------------------------------
>
>     Message: 2
>     Date: Mon, 16 Oct 2017 15:16:45 +0200
>     From: Roman Yurchak <rth.yurchak at gmail.com
>     <mailto:rth.yurchak at gmail.com>>
>     To: Scikit-learn mailing list <scikit-learn at python.org
>     <mailto:scikit-learn at python.org>>
>     Subject: Re: [scikit-learn] unclear help file for
>             sklearn.decomposition.pca
>     Message-ID: <b2abdcfd-4736-929e-6304-b93832932043 at gmail.com
>     <mailto:b2abdcfd-4736-929e-6304-b93832932043 at gmail.com>>
>     Content-Type: text/plain; charset=utf-8; format=flowed
>
>     Ismael,
>
>     as far as I saw the sklearn.decomposition.PCA doesn't mention
>     scaling at
>     all (except for the whiten parameter which is post-transformation
>     scaling).
>
>     So since it doesn't mention it, it makes sense that it doesn't do any
>     scaling of the input. Same as np.linalg.svd.
>
>     You can verify that PCA and np.linalg.svd yield the same results, with
>
>     ```
>      >>> import numpy as np
>      >>> from sklearn.decomposition import PCA
>      >>> import numpy.linalg
>      >>> X = np.random.RandomState(42).rand(10, 4)
>      >>> n_components = 2
>      >>> PCA(n_components, svd_solver='full').fit_transform(X)
>     ```
>
>     and
>
>     ```
>      >>> U, s, V = np.linalg.svd(X - X.mean(axis=0), full_matrices=False)
>      >>> (X - X.mean(axis=0)).dot(V[:n_components].T)
>     ```
>
>     --
>     Roman
>
>     On 16/10/17 03:42, Ismael Lemhadri wrote:
>     > Dear all,
>     > The help file for the PCA class is unclear about the preprocessing
>     > performed to the data.
>     > You can check on line 410 here:
>     >
>     https://github.com/scikit-learn/scikit-learn/blob/ef5cb84a/sklearn/decomposition/pca.py#L410
>     <https://github.com/scikit-learn/scikit-learn/blob/ef5cb84a/sklearn/decomposition/pca.py#L410>
>     >
>     <https://github.com/scikit-learn/scikit-learn/blob/ef5cb84a/sklearn/decomposition/pca.py#L410
>     <https://github.com/scikit-learn/scikit-learn/blob/ef5cb84a/sklearn/decomposition/pca.py#L410>>
>     > that the matrix is centered but NOT scaled, before performing the
>     > singular value decomposition.
>     > However, the help files do not make any mention of it.
>     > This is unclear for someone who, like me, just wanted to compare
>     that
>     > the PCA and np.linalg.svd give the same results. In academic
>     settings,
>     > students are often asked to compare different methods and to
>     check that
>     > they yield the same results. I expect that many students have
>     confronted
>     > this problem before...
>     > Best,
>     > Ismael Lemhadri
>     >
>     >
>     > _______________________________________________
>     > scikit-learn mailing list
>     > scikit-learn at python.org <mailto:scikit-learn at python.org>
>     > https://mail.python.org/mailman/listinfo/scikit-learn
>     <https://mail.python.org/mailman/listinfo/scikit-learn>
>     >
>
>
>
>     ------------------------------
>
>     Message: 3
>     Date: Mon, 16 Oct 2017 15:27:48 +0200
>     From: Serafeim Loukas <seralouk at gmail.com <mailto:seralouk at gmail.com>>
>     To: scikit-learn at python.org <mailto:scikit-learn at python.org>
>     Subject: [scikit-learn] Question about LDA's coef_ attribute
>     Message-ID: <58C6D0DA-9DE5-4EF5-97C1-48159831F5A9 at gmail.com
>     <mailto:58C6D0DA-9DE5-4EF5-97C1-48159831F5A9 at gmail.com>>
>     Content-Type: text/plain; charset="us-ascii"
>
>     Dear Scikit-learn community,
>
>     Since the documentation of the LDA
>     (http://scikit-learn.org/stable/modules/generated/sklearn.discriminant_analysis.LinearDiscriminantAnalysis.html
>     <http://scikit-learn.org/stable/modules/generated/sklearn.discriminant_analysis.LinearDiscriminantAnalysis.html>
>     <http://scikit-learn.org/stable/modules/generated/sklearn.discriminant_analysis.LinearDiscriminantAnalysis.html
>     <http://scikit-learn.org/stable/modules/generated/sklearn.discriminant_analysis.LinearDiscriminantAnalysis.html>>)
>     is not so clear, I would like to ask if the lda.coef_ attribute
>     stores the eigenvectors from the SVD decomposition.
>
>     Thank you in advance,
>     Serafeim
>     -------------- next part --------------
>     An HTML attachment was scrubbed...
>     URL:
>     <http://mail.python.org/pipermail/scikit-learn/attachments/20171016/4263df5c/attachment-0001.html
>     <http://mail.python.org/pipermail/scikit-learn/attachments/20171016/4263df5c/attachment-0001.html>>
>
>     ------------------------------
>
>     Message: 4
>     Date: Mon, 16 Oct 2017 16:57:52 +0200
>     From: Alexandre Gramfort <alexandre.gramfort at inria.fr
>     <mailto:alexandre.gramfort at inria.fr>>
>     To: Scikit-learn mailing list <scikit-learn at python.org
>     <mailto:scikit-learn at python.org>>
>     Subject: Re: [scikit-learn] Question about LDA's coef_ attribute
>     Message-ID:
>            
>     <CADeotZricOQhuHJMmW2Z14cqffEQyndYoxn-OgKAvTMQ7V0Y2g at mail.gmail.com
>     <mailto:CADeotZricOQhuHJMmW2Z14cqffEQyndYoxn-OgKAvTMQ7V0Y2g at mail.gmail.com>>
>     Content-Type: text/plain; charset="UTF-8"
>
>     no it stores the direction of the decision function to match the
>     API of
>     linear models.
>
>     HTH
>     Alex
>
>     On Mon, Oct 16, 2017 at 3:27 PM, Serafeim Loukas
>     <seralouk at gmail.com <mailto:seralouk at gmail.com>> wrote:
>     > Dear Scikit-learn community,
>     >
>     > Since the documentation of the LDA
>     >
>     (http://scikit-learn.org/stable/modules/generated/sklearn.discriminant_analysis.LinearDiscriminantAnalysis.html
>     <http://scikit-learn.org/stable/modules/generated/sklearn.discriminant_analysis.LinearDiscriminantAnalysis.html>)
>     > is not so clear, I would like to ask if the lda.coef_ attribute
>     stores the
>     > eigenvectors from the SVD decomposition.
>     >
>     > Thank you in advance,
>     > Serafeim
>     >
>     > _______________________________________________
>     > scikit-learn mailing list
>     > scikit-learn at python.org <mailto:scikit-learn at python.org>
>     > https://mail.python.org/mailman/listinfo/scikit-learn
>     <https://mail.python.org/mailman/listinfo/scikit-learn>
>     >
>
>
>     ------------------------------
>
>     Message: 5
>     Date: Mon, 16 Oct 2017 17:02:46 +0200
>     From: Serafeim Loukas <seralouk at gmail.com <mailto:seralouk at gmail.com>>
>     To: Scikit-learn mailing list <scikit-learn at python.org
>     <mailto:scikit-learn at python.org>>
>     Subject: Re: [scikit-learn] Question about LDA's coef_ attribute
>     Message-ID: <413210D2-56AE-41A4-873F-D171BB36539D at gmail.com
>     <mailto:413210D2-56AE-41A4-873F-D171BB36539D at gmail.com>>
>     Content-Type: text/plain; charset="us-ascii"
>
>     Dear Alex,
>
>     Thank you for the prompt response.
>
>     Are the eigenvectors stored in some variable ?
>     Does the lda.scalings_ attribute contain the eigenvectors ?
>
>     Best,
>     Serafeim
>
>     > On 16 Oct 2017, at 16:57, Alexandre Gramfort
>     <alexandre.gramfort at inria.fr <mailto:alexandre.gramfort at inria.fr>>
>     wrote:
>     >
>     > no it stores the direction of the decision function to match the
>     API of
>     > linear models.
>     >
>     > HTH
>     > Alex
>     >
>     > On Mon, Oct 16, 2017 at 3:27 PM, Serafeim Loukas
>     <seralouk at gmail.com <mailto:seralouk at gmail.com>> wrote:
>     >> Dear Scikit-learn community,
>     >>
>     >> Since the documentation of the LDA
>     >>
>     (http://scikit-learn.org/stable/modules/generated/sklearn.discriminant_analysis.LinearDiscriminantAnalysis.html
>     <http://scikit-learn.org/stable/modules/generated/sklearn.discriminant_analysis.LinearDiscriminantAnalysis.html>)
>     >> is not so clear, I would like to ask if the lda.coef_ attribute
>     stores the
>     >> eigenvectors from the SVD decomposition.
>     >>
>     >> Thank you in advance,
>     >> Serafeim
>     >>
>     >> _______________________________________________
>     >> scikit-learn mailing list
>     >> scikit-learn at python.org <mailto:scikit-learn at python.org>
>     >> https://mail.python.org/mailman/listinfo/scikit-learn
>     <https://mail.python.org/mailman/listinfo/scikit-learn>
>     >>
>     > _______________________________________________
>     > scikit-learn mailing list
>     > scikit-learn at python.org <mailto:scikit-learn at python.org>
>     > https://mail.python.org/mailman/listinfo/scikit-learn
>     <https://mail.python.org/mailman/listinfo/scikit-learn>
>
>     -------------- next part --------------
>     An HTML attachment was scrubbed...
>     URL:
>     <http://mail.python.org/pipermail/scikit-learn/attachments/20171016/505c7da3/attachment.html
>     <http://mail.python.org/pipermail/scikit-learn/attachments/20171016/505c7da3/attachment.html>>
>
>     ------------------------------
>
>     Subject: Digest Footer
>
>     _______________________________________________
>     scikit-learn mailing list
>     scikit-learn at python.org <mailto:scikit-learn at python.org>
>     https://mail.python.org/mailman/listinfo/scikit-learn
>     <https://mail.python.org/mailman/listinfo/scikit-learn>
>
>
>     ------------------------------
>
>     End of scikit-learn Digest, Vol 19, Issue 25
>     ********************************************
>
>
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20171016/f47e63a9/attachment-0001.html>