<div dir="ltr"><div dir="ltr"><div>Dear All,</div><div><br></div><div>Alex Lovell-Troy heads up innovation/cloud supercomputing at Cray (cc'd) and he is a great resource for all things. I thought he might find this thread useful.</div><div><br></div><div>Best, Alex<br></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Fri, Jun 28, 2019 at 11:45 PM Olivier Grisel <<a href="mailto:olivier.grisel@ensta.org">olivier.grisel@ensta.org</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="auto">You have to use a dedicated framework to distribute the computation on a cluster like you cray system.<div dir="auto"><br></div><div dir="auto">You can use mpi, or dask with dask-jobqueue but the also need to run parallel algorithms that are efficient when running in a distributed with a high cost for communication between distributed worker nodes.</div><div dir="auto"><br></div><div dir="auto">I am not sure that the dbscan implementation in scikit-learn would benefit much from naively running in distributed mode.</div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">Le ven. 28 juin 2019 22 h 06, Mauricio Reis <<a href="mailto:reismc@ime.eb.br" target="_blank">reismc@ime.eb.br</a>> a écrit :<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Sorry, but just now I reread your answer more closely.<br>
<br>
It seems that the "n_jobs" parameter of the DBScan routine brings no <br>
benefit to performance. If I want to improve the performance of the <br>
DBScan routine I will have to redesign the solution to use MPI <br>
resources.<br>
<br>
Is it correct?<br>
<br>
---<br>
Ats.,<br>
Mauricio Reis<br>
<br>
Em 28/06/2019 16:47, Mauricio Reis escreveu:<br>
> My laptop has Intel I7 processor with 4 cores. When I run the program<br>
> on Windows 10, the "joblib.cpu_count()" routine returns "4". In these<br>
> cases, the same test I did on the Cray computer caused a 10% increase<br>
> in the processing time of the DBScan routine when I used the "n_jobs =<br>
> 4" parameter compared to the processing time of that routine without<br>
> this parameter. Do you know what is the cause of the longer processing<br>
> time when I use "n_jobs = 4" on my laptop?<br>
> <br>
> ---<br>
> Ats.,<br>
> Mauricio Reis<br>
> <br>
> Em 28/06/2019 06:29, Brown J.B. via scikit-learn escreveu:<br>
>>> where you can see "ncpus = 1" (I still do not know why 4 lines were<br>
>>> printed -<br>
>>> <br>
>>> (total of 40 nodes) and each node has 1 CPU and 1 GPU!<br>
>> <br>
>>> #PBS -l select=1:ncpus=8:mpiprocs=8<br>
>>> aprun -n 4 p.sh ./ncpus.py<br>
>> <br>
>> You can request 8 CPUs from a job scheduler, but if each node the<br>
>> script runs on contains only one virtual/physical core, then<br>
>> cpu_count() will return 1.<br>
>> If that CPU supports multi-threading, you would typically get 2.<br>
>> <br>
>> For example, on my workstation:<br>
>> `--> egrep "processor|model name|core id" /proc/cpuinfo<br>
>> processor : 0<br>
>> model name : Intel(R) Core(TM) i3-4160 CPU @ 3.60GHz<br>
>> core id : 0<br>
>> processor : 1<br>
>> model name : Intel(R) Core(TM) i3-4160 CPU @ 3.60GHz<br>
>> core id : 1<br>
>> processor : 2<br>
>> model name : Intel(R) Core(TM) i3-4160 CPU @ 3.60GHz<br>
>> core id : 0<br>
>> processor : 3<br>
>> model name : Intel(R) Core(TM) i3-4160 CPU @ 3.60GHz<br>
>> core id : 1<br>
>> `--> python3 -c "from sklearn.externals import joblib;<br>
>> print(joblib.cpu_count())"<br>
>> 4<br>
>> <br>
>> It seems that in this situation, if you're wanting to parallelize<br>
>> *independent* sklearn calculations (e.g., changing dataset or random<br>
>> seed), you'll ask for the MPI by PBS processes like you have, but<br>
>> you'll need to place the sklearn computations in a function and then<br>
>> take care of distributing that function call across the MPI processes.<br>
>> <br>
>> Then again, if the runs are independent, it's a lot easier to write a<br>
>> for loop in a shell script that changes the dataset/seed and submits<br>
>> it to the job scheduler to let the job handler take care of the<br>
>> parallel distribution.<br>
>> (I do this when performing 10+ independent runs of sklearn modeling,<br>
>> where models use multiple threads during calculations; in my case,<br>
>> SLURM then takes care of finding the available nodes to distribute the<br>
>> work to.)<br>
>> <br>
>> Hope this helps.<br>
>> J.B.<br>
>> _______________________________________________<br>
>> scikit-learn mailing list<br>
>> <a href="mailto:scikit-learn@python.org" rel="noreferrer" target="_blank">scikit-learn@python.org</a><br>
>> <a href="https://mail.python.org/mailman/listinfo/scikit-learn" rel="noreferrer noreferrer" target="_blank">https://mail.python.org/mailman/listinfo/scikit-learn</a><br>
_______________________________________________<br>
scikit-learn mailing list<br>
<a href="mailto:scikit-learn@python.org" rel="noreferrer" target="_blank">scikit-learn@python.org</a><br>
<a href="https://mail.python.org/mailman/listinfo/scikit-learn" rel="noreferrer noreferrer" target="_blank">https://mail.python.org/mailman/listinfo/scikit-learn</a><br>
</blockquote></div>
_______________________________________________<br>
scikit-learn mailing list<br>
<a href="mailto:scikit-learn@python.org" target="_blank">scikit-learn@python.org</a><br>
<a href="https://mail.python.org/mailman/listinfo/scikit-learn" rel="noreferrer" target="_blank">https://mail.python.org/mailman/listinfo/scikit-learn</a><br>
</blockquote></div><br clear="all"><br>-- <br><div dir="ltr" class="gmail_signature"><div dir="ltr"><div><br></div><div>Alex Morrise, PhD</div><div>Co-Founder & CTO, <a href="http://StayOpen.com">StayOpen.com</a></div><div><div>Chief Science Officer, <a href="http://mediajel.com/" style="color:rgb(17,85,204)" target="_blank">MediaJel.com</a></div></div>Professional Bio: <a href="http://www.linkedin.com/in/amorrise" style="color:rgb(17,85,204)" target="_blank">Machine Learning Intelligence</a></div></div></div>