[scikit-learn] Scaling model selection on a cluster

Mon Aug 8 01:24:20 EDT 2016

My guess is that your model evaluations are too fast, and that you are
not getting the benefits of distributed computing as the overhead is
hiding them.

Anyhow, I don't think that this is ready for prime-time usage. It
probably requires tweeking and understanding the tradeoffs.

G

On Sun, Aug 07, 2016 at 09:25:47PM +0000, Vlad Ionescu wrote:
> I copy pasted the example in the link you gave, only made the search take a
> longer time. I used dask-ssh to setup worker nodes and a scheduler, then
> connected to the scheduler in my code.

> Tweaking the n_jobs parameters for the randomized search does not get any
> performance benefits. The connection to the scheduler seems to work, but
> nothing gets assigned to the workers, because the code doesn't scale.

> I am using scikit-learn 0.18.dev0

> Any ideas?

> Code and results are below. Only the n_jobs value was changed between
> executions. I printed an Executor assigned to my scheduler, and it reported 240
> cores.

> import distributed.joblib
> from joblib import Parallel, parallel_backend
> from sklearn.datasets import load_digits
> from sklearn.grid_search import RandomizedSearchCV
> from sklearn.svm import SVC
> import numpy as np

> digits = load_digits()

> param_space = {
>     'C': np.logspace(-6, 6, 100),
>     'gamma': np.logspace(-8, 8, 100),
>     'tol': np.logspace(-4, -1, 100),
>     'class_weight': [None, 'balanced'],
> }

> model = SVC(kernel='rbf')
> search = RandomizedSearchCV(model, param_space, cv=3, n_iter=1000, verbose=1,
> n_jobs=200)

> with parallel_backend('distributed', scheduler_host='my_scheduler:8786'):
>     search.fit(digits.data, digits.target)

> Fitting 3 folds for each of 1000 candidates, totalling 3000 fits
> [Parallel(n_jobs=200)]: Done   4 tasks      | elapsed:    0.5s
> [Parallel(n_jobs=200)]: Done 292 tasks      | elapsed:    6.9s
> [Parallel(n_jobs=200)]: Done 800 tasks      | elapsed:   16.1s
> [Parallel(n_jobs=200)]: Done 1250 tasks      | elapsed:   24.8s
> [Parallel(n_jobs=200)]: Done 1800 tasks      | elapsed:   36.0s
> [Parallel(n_jobs=200)]: Done 2450 tasks      | elapsed:   49.0s
> [Parallel(n_jobs=200)]: Done 3000 out of 3000 | elapsed:  1.0min finished

> -------------------------------------

> Fitting 3 folds for each of 1000 candidates, totalling 3000 fits
> [Parallel(n_jobs=20)]: Done  10 tasks      | elapsed:    0.5s
> [Parallel(n_jobs=20)]: Done 160 tasks      | elapsed:    3.7s
> [Parallel(n_jobs=20)]: Done 410 tasks      | elapsed:    8.6s
> [Parallel(n_jobs=20)]: Done 760 tasks      | elapsed:   16.2s
> [Parallel(n_jobs=20)]: Done 1210 tasks      | elapsed:   25.0s
> [Parallel(n_jobs=20)]: Done 1760 tasks      | elapsed:   36.2s
> [Parallel(n_jobs=20)]: Done 2410 tasks      | elapsed:   48.8s
> [Parallel(n_jobs=20)]: Done 3000 out of 3000 | elapsed:  1.0min finished

>  

> On Sun, Aug 7, 2016 at 8:31 PM Gael Varoquaux <gael.varoquaux at normalesup.org>
> wrote:

>     Parallel computing in scikit-learn is built upon on joblib. In the
>     development version of scikit-learn, the included joblib can be extended
>     with a distributed backend:
>     http://distributed.readthedocs.io/en/latest/joblib.html
>     that can distribute code on a cluster.

>     This is still bleeding edge, but this is probably a direction that will
>     see more development.

>     _______________________________________________
>     scikit-learn mailing list
>     scikit-learn at python.org
>     https://mail.python.org/mailman/listinfo/scikit-learn

> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

-- 
    Gael Varoquaux
    Researcher, INRIA Parietal
    NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France
    Phone:  ++ 33-1-69-08-79-68
    http://gael-varoquaux.info            http://twitter.com/GaelVaroquaux