[scikit-learn] Scaling model selection on a cluster

Sun Aug 7 17:25:47 EDT 2016

I copy pasted the example in the link you gave, only made the search take a
longer time. I used dask-ssh to setup worker nodes and a scheduler, then
connected to the scheduler in my code.

Tweaking the n_jobs parameters for the randomized search does not get any
performance benefits. The connection to the scheduler seems to work, but
nothing gets assigned to the workers, because the code doesn't scale.

I am using scikit-learn 0.18.dev0

Any ideas?

Code and results are below. Only the n_jobs value was changed between
executions. I printed an Executor assigned to my scheduler, and it reported
240 cores.

import distributed.joblib
from joblib import Parallel, parallel_backend
from sklearn.datasets import load_digits
from sklearn.grid_search import RandomizedSearchCV
from sklearn.svm import SVC
import numpy as np

digits = load_digits()

param_space = {
    'C': np.logspace(-6, 6, 100),
    'gamma': np.logspace(-8, 8, 100),
    'tol': np.logspace(-4, -1, 100),
    'class_weight': [None, 'balanced'],
}

model = SVC(kernel='rbf')
search = RandomizedSearchCV(model, param_space, cv=3, n_iter=1000,
verbose=1, *n_jobs=200*)

with parallel_backend('distributed', scheduler_host='my_scheduler:8786'):
    search.fit(digits.data, digits.target)

Fitting 3 folds for each of 1000 candidates, totalling 3000 fits
[Parallel(n_jobs=200)]: Done   4 tasks      | elapsed:    0.5s
[Parallel(n_jobs=200)]: Done 292 tasks      | elapsed:    6.9s
[Parallel(n_jobs=200)]: Done 800 tasks      | elapsed:   16.1s
[Parallel(n_jobs=200)]: Done 1250 tasks      | elapsed:   24.8s
[Parallel(n_jobs=200)]: Done 1800 tasks      | elapsed:   36.0s
[Parallel(n_jobs=200)]: Done 2450 tasks      | elapsed:   49.0s
[Parallel(*n_jobs=200*)]: Done 3000 out of 3000 | *elapsed:  1.0min
finished*

-------------------------------------

Fitting 3 folds for each of 1000 candidates, totalling 3000 fits
[Parallel(n_jobs=20)]: Done  10 tasks      | elapsed:    0.5s
[Parallel(n_jobs=20)]: Done 160 tasks      | elapsed:    3.7s
[Parallel(n_jobs=20)]: Done 410 tasks      | elapsed:    8.6s
[Parallel(n_jobs=20)]: Done 760 tasks      | elapsed:   16.2s
[Parallel(n_jobs=20)]: Done 1210 tasks      | elapsed:   25.0s
[Parallel(n_jobs=20)]: Done 1760 tasks      | elapsed:   36.2s
[Parallel(n_jobs=20)]: Done 2410 tasks      | elapsed:   48.8s
[Parallel(*n_jobs=20*)]: Done 3000 out of 3000 | *elapsed:  1.0min finished*

On Sun, Aug 7, 2016 at 8:31 PM Gael Varoquaux <gael.varoquaux at normalesup.org>
wrote:

> Parallel computing in scikit-learn is built upon on joblib. In the
> development version of scikit-learn, the included joblib can be extended
> with a distributed backend:
> http://distributed.readthedocs.io/en/latest/joblib.html
> that can distribute code on a cluster.
>
> This is still bleeding edge, but this is probably a direction that will
> see more development.
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160807/4001aff4/attachment.html>