[scikit-learn] Scikit Learn in a Cray computer

Wed Jun 19 16:36:39 EDT 2019

I'd like to understand how parallelism works in the DBScan routine in 
SciKit Learn running on the Cray computer and what should I do to 
improve the results I'm looking at.

I have adapted the existing example in 
[https://scikit-learn.org/stable/auto_examples/cluster/plot_dbscan.html#sphx-glr-auto-examples-cluster-plot-dbscan-py] 
to run with 100,000 points and thus enable one processing time allowing 
reasonable evaluation of times obtained. I changed the parameter "n_jobs 
= x", "x" ranging from 1 to 6. I repeated several times the same 
experiments and calculated the average values of the processing time.

n_jobs	time
1	21,3
2	15,1
3	14,8
4	15,2
5	15,5
6	15,0

I then get the times that appear in the table above and in the attached 
image. As can be seen, there was only effective gain when "n_jobs = 2" 
and no difference for larger quantities. And yet, the gain was only less 
than 30%!!

Why were the gains so small? Why was there no greater gain for a greater 
value of the "n_jobs" parameter? Is it possible to improve the results I 
have obtained?

-- 
Ats.,
Mauricio Reis
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Time_X_CPUs (Cray).jpg
Type: image/jpeg
Size: 23348 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20190619/0f518db9/attachment-0001.jpg>