[scikit-learn] Scikit Learn in a Cray computer

Sat Jun 29 02:43:37 EDT 2019

You have to use a dedicated framework to distribute the computation on a
cluster like you cray system.

You can use mpi, or dask with dask-jobqueue but the also need to run
parallel algorithms that are efficient when running in a distributed with a
high cost for communication between distributed worker nodes.

I am not sure that the dbscan implementation in scikit-learn would benefit
much from naively running in distributed mode.

Le ven. 28 juin 2019 22 h 06, Mauricio Reis <reismc at ime.eb.br> a écrit :

> Sorry, but just now I reread your answer more closely.
>
> It seems that the "n_jobs" parameter of the DBScan routine brings no
> benefit to performance. If I want to improve the performance of the
> DBScan routine I will have to redesign the solution to use MPI
> resources.
>
> Is it correct?
>
> ---
> Ats.,
> Mauricio Reis
>
> Em 28/06/2019 16:47, Mauricio Reis escreveu:
> > My laptop has Intel I7 processor with 4 cores. When I run the program
> > on Windows 10, the "joblib.cpu_count()" routine returns "4". In these
> > cases, the same test I did on the Cray computer caused a 10% increase
> > in the processing time of the DBScan routine when I used the "n_jobs =
> > 4" parameter compared to the processing time of that routine without
> > this parameter. Do you know what is the cause of the longer processing
> > time when I use "n_jobs = 4" on my laptop?
> >
> > ---
> > Ats.,
> > Mauricio Reis
> >
> > Em 28/06/2019 06:29, Brown J.B. via scikit-learn escreveu:
> >>> where you can see "ncpus = 1" (I still do not know why 4 lines were
> >>> printed -
> >>>
> >>> (total of 40 nodes) and each node has 1 CPU and 1 GPU!
> >>
> >>> #PBS -l select=1:ncpus=8:mpiprocs=8
> >>> aprun -n 4 p.sh ./ncpus.py
> >>
> >> You can request 8 CPUs from a job scheduler, but if each node the
> >> script runs on contains only one virtual/physical core, then
> >> cpu_count() will return 1.
> >> If that CPU supports multi-threading, you would typically get 2.
> >>
> >> For example, on my workstation:
> >> `--> egrep "processor|model name|core id" /proc/cpuinfo
> >> processor : 0
> >> model name : Intel(R) Core(TM) i3-4160 CPU @ 3.60GHz
> >> core id : 0
> >> processor : 1
> >> model name : Intel(R) Core(TM) i3-4160 CPU @ 3.60GHz
> >> core id : 1
> >> processor : 2
> >> model name : Intel(R) Core(TM) i3-4160 CPU @ 3.60GHz
> >> core id : 0
> >> processor : 3
> >> model name : Intel(R) Core(TM) i3-4160 CPU @ 3.60GHz
> >> core id : 1
> >> `--> python3 -c "from sklearn.externals import joblib;
> >> print(joblib.cpu_count())"
> >> 4
> >>
> >> It seems that in this situation, if you're wanting to parallelize
> >> *independent* sklearn calculations (e.g., changing dataset or random
> >> seed), you'll ask for the MPI by PBS processes like you have, but
> >> you'll need to place the sklearn computations in a function and then
> >> take care of distributing that function call across the MPI processes.
> >>
> >> Then again, if the runs are independent, it's a lot easier to write a
> >> for loop in a shell script that changes the dataset/seed and submits
> >> it to the job scheduler to let the job handler take care of the
> >> parallel distribution.
> >> (I do this when performing 10+ independent runs of sklearn modeling,
> >> where models use multiple threads during calculations; in my case,
> >> SLURM then takes care of finding the available nodes to distribute the
> >> work to.)
> >>
> >> Hope this helps.
> >> J.B.
> >> _______________________________________________
> >> scikit-learn mailing list
> >> scikit-learn at python.org
> >> https://mail.python.org/mailman/listinfo/scikit-learn
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20190629/9d22487b/attachment-0001.html>