[scikit-learn] unclear help file for sklearn.decomposition.pca

Mon Oct 16 09:16:45 EDT 2017

Ismael,

as far as I saw the sklearn.decomposition.PCA doesn't mention scaling at 
all (except for the whiten parameter which is post-transformation scaling).

So since it doesn't mention it, it makes sense that it doesn't do any 
scaling of the input. Same as np.linalg.svd.

You can verify that PCA and np.linalg.svd yield the same results, with

```
 >>> import numpy as np
 >>> from sklearn.decomposition import PCA
 >>> import numpy.linalg
 >>> X = np.random.RandomState(42).rand(10, 4)
 >>> n_components = 2
 >>> PCA(n_components, svd_solver='full').fit_transform(X)
```

and

```
 >>> U, s, V = np.linalg.svd(X - X.mean(axis=0), full_matrices=False)
 >>> (X - X.mean(axis=0)).dot(V[:n_components].T)
```

-- 
Roman

On 16/10/17 03:42, Ismael Lemhadri wrote:
> Dear all,
> The help file for the PCA class is unclear about the preprocessing
> performed to the data.
> You can check on line 410 here:
> https://github.com/scikit-learn/scikit-learn/blob/ef5cb84a/sklearn/decomposition/pca.py#L410
> <https://github.com/scikit-learn/scikit-learn/blob/ef5cb84a/sklearn/decomposition/pca.py#L410>
> that the matrix is centered but NOT scaled, before performing the
> singular value decomposition.
> However, the help files do not make any mention of it.
> This is unclear for someone who, like me, just wanted to compare that
> the PCA and np.linalg.svd give the same results. In academic settings,
> students are often asked to compare different methods and to check that
> they yield the same results. I expect that many students have confronted
> this problem before...
> Best,
> Ismael Lemhadri
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>