[scikit-learn] Update or downgrade PCA

Pamphile Roy roy.pamphile at gmail.com
Tue Jul 3 08:39:46 EDT 2018


So yes there is a difference between the two depending on the size of the
matrix.

Following is an output from ipython:

*With a matrix of shape (1000 * 500)*
(batman3) tupui at Batman:Desktop $ ipython -i sk_pod.py
Python 3.6.5 | packaged by conda-forge | (default, Apr  6 2018, 13:44:09)
Type 'copyright', 'credits' or 'license' for more information
IPython 6.4.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: %timeit pod._update(snapshot2.T)
491 ms ± 22.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [2]: %timeit ipca.partial_fit(snapshot2)
163 ms ± 1.6 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

*With a matrix of shape (1000 * 2000)*
(batman3) tupui at Batman:Desktop $ ipython -i sk_pod.py
Python 3.6.5 | packaged by conda-forge | (default, Apr  6 2018, 13:44:09)
Type 'copyright', 'credits' or 'license' for more information
IPython 6.4.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: %timeit pod._update(snapshot2.T)
4.84 s ± 220 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [2]: %timeit ipca.partial_fit(snapshot2)
5.85 s ± 77.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [3]:
Do you really want to exit ([y]/n)?

*With a matrix of shape (1000 * 20 000)*
(batman3) tupui at Batman:Desktop $ ipython -i sk_pod.py
Python 3.6.5 | packaged by conda-forge | (default, Apr  6 2018, 13:44:09)
Type 'copyright', 'credits' or 'license' for more information
IPython 6.4.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: %timeit pod._update(snapshot2.T)
3.39 s ± 65.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [2]: %timeit ipca.partial_fit(snapshot2)
33.1 s ± 17.7 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

Conclusion is that, the method seems faster to add one sample if the number
of feature is superior to the number of samples.
But if you want to add a bunch of sample, I found that sklearn seems a bit
faster (38.75 s vs
34.51s to add 10 samples of shape 1000 * 20 000).
It is to be noted that in this last case, adding a single or 10 samples is
taking the same time ~30s. So depending on how much sample are to be added,
this can help.

Cheers,

Pamphile

P.S. Following is the code I used (requires batman available though
conda-forge):

import time
import numpy as np
from batman.pod import Pod
from sklearn.decomposition import IncrementalPCA

n_samples, n_features = 1000, 20000

snapshots = np.random.random_sample((n_samples, n_features))
snapshot2 = np.random.random_sample((1, n_features))

pod = Pod([np.zeros(n_features), np.ones(n_features)], None, np.inf, 1, 999)
pod._decompose(snapshots.T)


ipca = IncrementalPCA(999)
ipca.fit(snapshots)

np.allclose(ipca.singular_values_, pod.S)

pod._update(snapshot2.T)
ipca.partial_fit(snapshot2)

np.allclose(ipca.singular_values_[:999], pod.S[:999])

snapshot3 = np.random.random_sample((10, n_features))

itime = time.time()
[pod._update(snap.T[:, None]) for snap in snapshot3]
print(time.time() - itime)

itime = time.time()
ipca.partial_fit(snapshot3)
print(time.time() - itime)
np.allclose(ipca.singular_values_[:999], pod.S[:999])




2018-07-03 11:06 GMT+02:00 Pamphile Roy <roy.pamphile at gmail.com>:

> I have no idea about the comparison with sklearn.decomposition.Inc
> rementalPCA.
> Was not aware of this but from the code it seems to be a different
> approach.
> I will try to come with some numbers.
>
> Pamphile
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180703/a72d3e55/attachment.html>


More information about the scikit-learn mailing list