I'd like to understand how parallelism works in the DBScan routine in SciKit Learn running on the Cray computer and what should I do to improve the results I'm looking at. I have adapted the existing example in [https://scikit-learn.org/stable/auto_examples/cluster/plot_dbscan.html#sphx-...] to run with 100,000 points and thus enable one processing time allowing reasonable evaluation of times obtained. I changed the parameter "n_jobs = x", "x" ranging from 1 to 6. I repeated several times the same experiments and calculated the average values of the processing time. n_jobs time 1 21,3 2 15,1 3 14,8 4 15,2 5 15,5 6 15,0 I then get the times that appear in the table above and in the attached image. As can be seen, there was only effective gain when "n_jobs = 2" and no difference for larger quantities. And yet, the gain was only less than 30%!! Why were the gains so small? Why was there no greater gain for a greater value of the "n_jobs" parameter? Is it possible to improve the results I have obtained? -- Ats., Mauricio Reis
I can not access the Cray computer at this moment to run the suggested code. Once you have access, I'll let you know. But documentation (provided by a teacher in charge of the Cray computer) shows: - 10 blades - 4 nodes per blade = 40 nodes - each node: 1 CPU, 1 GPU, 32 GBytes --- Ats., Mauricio Reis Em 19/06/2019 17:44, Olivier Grisel escreveu:
How many cores du you have on this machine?
joblib.cpu_count() _______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
2019年6月20日(木) 8:16 Mauricio Reis <reismc@ime.eb.br>:
But documentation (provided by a teacher in charge of the Cray computer) shows: - each node: 1 CPU, 1 GPU, 32 GBytes
If that's true, then it appears to me that running on any individual compute host (node) has 1-core / 2-threads, and that would be why you wouldn't get any more performance after n_jobs=2. For n_jobs=3/4/..., you're just asking the same amount of compute hardware to do the same calculations. As instructed, you'll need to execute joblib.cpu_count() to resolve your host environment.
Finally I was able to access the Cray computer and run the <cpu_count> routine. I am sending below the files and commands I used and the result found, where you can see "ncpus = 1" (I still do not know why 4 lines were printed - I only know that this amount depends on the value of the "aprun" command used in the file "ncpus.pbs"). But I do not know if you know the Cray computer environment and you'll understand what I did! I use Cray XK7 computer which has 10 blades, each blade has 4 nodes (total of 40 nodes) and each node has 1 CPU and 1 GPU! --- Ats., Mauricio Reis ---------------------------------------------------------------------------------------------- === p.sh === #!/bin/bash /usr/local/python_3.7/bin/python3.7 $1 === ncpus.py === from sklearn.externals import joblib import sklearn print('The scikit-learn version is {}.'.format(sklearn.__version__)) ncpus = joblib.cpu_count() print("--- ncpus =", ncpus) === ncpus.pbs === #!/bin/bash #PBS -l select=1:ncpus=8:mpiprocs=8 #PBS -j oe #PBS -l walltime=00:00:10 date echo "[$PBS_O_WORKDIR]" cd $PBS_O_WORKDIR aprun -n 4 p.sh ./ncpus.py === command === qsub ncpus.pbs === output === Thu Jun 27 05:22:35 BRT 2019 [/home/reismc] The scikit-learn version is 0.20.3. The scikit-learn version is 0.20.3. The scikit-learn version is 0.20.3. The scikit-learn version is 0.20.3. --- ncpus = 1 --- ncpus = 1 --- ncpus = 1 --- ncpus = 1 Application 32826 resources: utime ~8s, stime ~1s, Rss ~43168, inblocks ~102981, outblocks ~0 ---------------------------------------------------------------------------------------------- Em 19/06/2019 17:44, Olivier Grisel escreveu:
How many cores du you have on this machine?
joblib.cpu_count() _______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
where you can see "ncpus = 1" (I still do not know why 4 lines were printed -
(total of 40 nodes) and each node has 1 CPU and 1 GPU!
#PBS -l select=1:ncpus=8:mpiprocs=8 aprun -n 4 p.sh ./ncpus.py
You can request 8 CPUs from a job scheduler, but if each node the script runs on contains only one virtual/physical core, then cpu_count() will return 1. If that CPU supports multi-threading, you would typically get 2. For example, on my workstation: `--> egrep "processor|model name|core id" /proc/cpuinfo processor : 0 model name : Intel(R) Core(TM) i3-4160 CPU @ 3.60GHz core id : 0 processor : 1 model name : Intel(R) Core(TM) i3-4160 CPU @ 3.60GHz core id : 1 processor : 2 model name : Intel(R) Core(TM) i3-4160 CPU @ 3.60GHz core id : 0 processor : 3 model name : Intel(R) Core(TM) i3-4160 CPU @ 3.60GHz core id : 1 `--> python3 -c "from sklearn.externals import joblib; print(joblib.cpu_count())" 4 It seems that in this situation, if you're wanting to parallelize *independent* sklearn calculations (e.g., changing dataset or random seed), you'll ask for the MPI by PBS processes like you have, but you'll need to place the sklearn computations in a function and then take care of distributing that function call across the MPI processes. Then again, if the runs are independent, it's a lot easier to write a for loop in a shell script that changes the dataset/seed and submits it to the job scheduler to let the job handler take care of the parallel distribution. (I do this when performing 10+ independent runs of sklearn modeling, where models use multiple threads during calculations; in my case, SLURM then takes care of finding the available nodes to distribute the work to.) Hope this helps. J.B.
My laptop has Intel I7 processor with 4 cores. When I run the program on Windows 10, the "joblib.cpu_count()" routine returns "4". In these cases, the same test I did on the Cray computer caused a 10% increase in the processing time of the DBScan routine when I used the "n_jobs = 4" parameter compared to the processing time of that routine without this parameter. Do you know what is the cause of the longer processing time when I use "n_jobs = 4" on my laptop? --- Ats., Mauricio Reis Em 28/06/2019 06:29, Brown J.B. via scikit-learn escreveu:
where you can see "ncpus = 1" (I still do not know why 4 lines were printed -
(total of 40 nodes) and each node has 1 CPU and 1 GPU!
#PBS -l select=1:ncpus=8:mpiprocs=8 aprun -n 4 p.sh ./ncpus.py
You can request 8 CPUs from a job scheduler, but if each node the script runs on contains only one virtual/physical core, then cpu_count() will return 1. If that CPU supports multi-threading, you would typically get 2.
For example, on my workstation: `--> egrep "processor|model name|core id" /proc/cpuinfo processor : 0 model name : Intel(R) Core(TM) i3-4160 CPU @ 3.60GHz core id : 0 processor : 1 model name : Intel(R) Core(TM) i3-4160 CPU @ 3.60GHz core id : 1 processor : 2 model name : Intel(R) Core(TM) i3-4160 CPU @ 3.60GHz core id : 0 processor : 3 model name : Intel(R) Core(TM) i3-4160 CPU @ 3.60GHz core id : 1 `--> python3 -c "from sklearn.externals import joblib; print(joblib.cpu_count())" 4
It seems that in this situation, if you're wanting to parallelize *independent* sklearn calculations (e.g., changing dataset or random seed), you'll ask for the MPI by PBS processes like you have, but you'll need to place the sklearn computations in a function and then take care of distributing that function call across the MPI processes.
Then again, if the runs are independent, it's a lot easier to write a for loop in a shell script that changes the dataset/seed and submits it to the job scheduler to let the job handler take care of the parallel distribution. (I do this when performing 10+ independent runs of sklearn modeling, where models use multiple threads during calculations; in my case, SLURM then takes care of finding the available nodes to distribute the work to.)
Hope this helps. J.B. _______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
Sorry, but just now I reread your answer more closely. It seems that the "n_jobs" parameter of the DBScan routine brings no benefit to performance. If I want to improve the performance of the DBScan routine I will have to redesign the solution to use MPI resources. Is it correct? --- Ats., Mauricio Reis Em 28/06/2019 16:47, Mauricio Reis escreveu:
My laptop has Intel I7 processor with 4 cores. When I run the program on Windows 10, the "joblib.cpu_count()" routine returns "4". In these cases, the same test I did on the Cray computer caused a 10% increase in the processing time of the DBScan routine when I used the "n_jobs = 4" parameter compared to the processing time of that routine without this parameter. Do you know what is the cause of the longer processing time when I use "n_jobs = 4" on my laptop?
--- Ats., Mauricio Reis
Em 28/06/2019 06:29, Brown J.B. via scikit-learn escreveu:
where you can see "ncpus = 1" (I still do not know why 4 lines were printed -
(total of 40 nodes) and each node has 1 CPU and 1 GPU!
#PBS -l select=1:ncpus=8:mpiprocs=8 aprun -n 4 p.sh ./ncpus.py
You can request 8 CPUs from a job scheduler, but if each node the script runs on contains only one virtual/physical core, then cpu_count() will return 1. If that CPU supports multi-threading, you would typically get 2.
For example, on my workstation: `--> egrep "processor|model name|core id" /proc/cpuinfo processor : 0 model name : Intel(R) Core(TM) i3-4160 CPU @ 3.60GHz core id : 0 processor : 1 model name : Intel(R) Core(TM) i3-4160 CPU @ 3.60GHz core id : 1 processor : 2 model name : Intel(R) Core(TM) i3-4160 CPU @ 3.60GHz core id : 0 processor : 3 model name : Intel(R) Core(TM) i3-4160 CPU @ 3.60GHz core id : 1 `--> python3 -c "from sklearn.externals import joblib; print(joblib.cpu_count())" 4
It seems that in this situation, if you're wanting to parallelize *independent* sklearn calculations (e.g., changing dataset or random seed), you'll ask for the MPI by PBS processes like you have, but you'll need to place the sklearn computations in a function and then take care of distributing that function call across the MPI processes.
Then again, if the runs are independent, it's a lot easier to write a for loop in a shell script that changes the dataset/seed and submits it to the job scheduler to let the job handler take care of the parallel distribution. (I do this when performing 10+ independent runs of sklearn modeling, where models use multiple threads during calculations; in my case, SLURM then takes care of finding the available nodes to distribute the work to.)
Hope this helps. J.B. _______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
You have to use a dedicated framework to distribute the computation on a cluster like you cray system. You can use mpi, or dask with dask-jobqueue but the also need to run parallel algorithms that are efficient when running in a distributed with a high cost for communication between distributed worker nodes. I am not sure that the dbscan implementation in scikit-learn would benefit much from naively running in distributed mode. Le ven. 28 juin 2019 22 h 06, Mauricio Reis <reismc@ime.eb.br> a écrit :
Sorry, but just now I reread your answer more closely.
It seems that the "n_jobs" parameter of the DBScan routine brings no benefit to performance. If I want to improve the performance of the DBScan routine I will have to redesign the solution to use MPI resources.
Is it correct?
--- Ats., Mauricio Reis
Em 28/06/2019 16:47, Mauricio Reis escreveu:
My laptop has Intel I7 processor with 4 cores. When I run the program on Windows 10, the "joblib.cpu_count()" routine returns "4". In these cases, the same test I did on the Cray computer caused a 10% increase in the processing time of the DBScan routine when I used the "n_jobs = 4" parameter compared to the processing time of that routine without this parameter. Do you know what is the cause of the longer processing time when I use "n_jobs = 4" on my laptop?
--- Ats., Mauricio Reis
Em 28/06/2019 06:29, Brown J.B. via scikit-learn escreveu:
where you can see "ncpus = 1" (I still do not know why 4 lines were printed -
(total of 40 nodes) and each node has 1 CPU and 1 GPU!
#PBS -l select=1:ncpus=8:mpiprocs=8 aprun -n 4 p.sh ./ncpus.py
You can request 8 CPUs from a job scheduler, but if each node the script runs on contains only one virtual/physical core, then cpu_count() will return 1. If that CPU supports multi-threading, you would typically get 2.
For example, on my workstation: `--> egrep "processor|model name|core id" /proc/cpuinfo processor : 0 model name : Intel(R) Core(TM) i3-4160 CPU @ 3.60GHz core id : 0 processor : 1 model name : Intel(R) Core(TM) i3-4160 CPU @ 3.60GHz core id : 1 processor : 2 model name : Intel(R) Core(TM) i3-4160 CPU @ 3.60GHz core id : 0 processor : 3 model name : Intel(R) Core(TM) i3-4160 CPU @ 3.60GHz core id : 1 `--> python3 -c "from sklearn.externals import joblib; print(joblib.cpu_count())" 4
It seems that in this situation, if you're wanting to parallelize *independent* sklearn calculations (e.g., changing dataset or random seed), you'll ask for the MPI by PBS processes like you have, but you'll need to place the sklearn computations in a function and then take care of distributing that function call across the MPI processes.
Then again, if the runs are independent, it's a lot easier to write a for loop in a shell script that changes the dataset/seed and submits it to the job scheduler to let the job handler take care of the parallel distribution. (I do this when performing 10+ independent runs of sklearn modeling, where models use multiple threads during calculations; in my case, SLURM then takes care of finding the available nodes to distribute the work to.)
Hope this helps. J.B. _______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
Dear All, Alex Lovell-Troy heads up innovation/cloud supercomputing at Cray (cc'd) and he is a great resource for all things. I thought he might find this thread useful. Best, Alex On Fri, Jun 28, 2019 at 11:45 PM Olivier Grisel <olivier.grisel@ensta.org> wrote:
You have to use a dedicated framework to distribute the computation on a cluster like you cray system.
You can use mpi, or dask with dask-jobqueue but the also need to run parallel algorithms that are efficient when running in a distributed with a high cost for communication between distributed worker nodes.
I am not sure that the dbscan implementation in scikit-learn would benefit much from naively running in distributed mode.
Le ven. 28 juin 2019 22 h 06, Mauricio Reis <reismc@ime.eb.br> a écrit :
Sorry, but just now I reread your answer more closely.
It seems that the "n_jobs" parameter of the DBScan routine brings no benefit to performance. If I want to improve the performance of the DBScan routine I will have to redesign the solution to use MPI resources.
Is it correct?
--- Ats., Mauricio Reis
Em 28/06/2019 16:47, Mauricio Reis escreveu:
My laptop has Intel I7 processor with 4 cores. When I run the program on Windows 10, the "joblib.cpu_count()" routine returns "4". In these cases, the same test I did on the Cray computer caused a 10% increase in the processing time of the DBScan routine when I used the "n_jobs = 4" parameter compared to the processing time of that routine without this parameter. Do you know what is the cause of the longer processing time when I use "n_jobs = 4" on my laptop?
--- Ats., Mauricio Reis
Em 28/06/2019 06:29, Brown J.B. via scikit-learn escreveu:
where you can see "ncpus = 1" (I still do not know why 4 lines were printed -
(total of 40 nodes) and each node has 1 CPU and 1 GPU!
#PBS -l select=1:ncpus=8:mpiprocs=8 aprun -n 4 p.sh ./ncpus.py
You can request 8 CPUs from a job scheduler, but if each node the script runs on contains only one virtual/physical core, then cpu_count() will return 1. If that CPU supports multi-threading, you would typically get 2.
For example, on my workstation: `--> egrep "processor|model name|core id" /proc/cpuinfo processor : 0 model name : Intel(R) Core(TM) i3-4160 CPU @ 3.60GHz core id : 0 processor : 1 model name : Intel(R) Core(TM) i3-4160 CPU @ 3.60GHz core id : 1 processor : 2 model name : Intel(R) Core(TM) i3-4160 CPU @ 3.60GHz core id : 0 processor : 3 model name : Intel(R) Core(TM) i3-4160 CPU @ 3.60GHz core id : 1 `--> python3 -c "from sklearn.externals import joblib; print(joblib.cpu_count())" 4
It seems that in this situation, if you're wanting to parallelize *independent* sklearn calculations (e.g., changing dataset or random seed), you'll ask for the MPI by PBS processes like you have, but you'll need to place the sklearn computations in a function and then take care of distributing that function call across the MPI processes.
Then again, if the runs are independent, it's a lot easier to write a for loop in a shell script that changes the dataset/seed and submits it to the job scheduler to let the job handler take care of the parallel distribution. (I do this when performing 10+ independent runs of sklearn modeling, where models use multiple threads during calculations; in my case, SLURM then takes care of finding the available nodes to distribute the work to.)
Hope this helps. J.B. _______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
-- Alex Morrise, PhD Co-Founder & CTO, StayOpen.com Chief Science Officer, MediaJel.com <http://mediajel.com/> Professional Bio: Machine Learning Intelligence <http://www.linkedin.com/in/amorrise>
participants (4)
-
Brown J.B. -
desitter.gravity@gmail.com -
Mauricio Reis -
Olivier Grisel