Increasing number of CPU cores can not speed up the works

Dear yt developers, I found that increasing number of CPU cores in a single node that have 16 cores and 16 additional hyper-threadings can dramatically deteriorate speed-up. figure [3] shows this result. You may reproduce figure [3] by the following steps: Running shell script [1] can automatically invoke yt script [2] and record elapsed time on files, called core_* [4]. [1] shell script: http://paste.yt-project.org/show/150/ [2] yt script: http://paste.yt-project.org/show/151/ [3] # of cores vs. elapsed time http://i.imgur.com/mFk6lKu.png [4] all scripts, data and log files: http://use.yt/upload/d8efa692 Thanks for your help. Rocky Tseng

Hi Rocky, Great question. There's a few things to consider when measuring yt's parallel performance. Firstly, yt operations tend to be very i/o heavy, so the performance will depend quite a bit on the read speed of the filesystem. Secondly, it looks like the total time to finish the script is of order of seconds. There is some overhead involved in most yt functionality, which may be a considerable portion of the runtime. If you can try with a larger dataset where the script takes a little longer, you should find better scaling, but i/o speed may also be a limitation. Britton On Wed, Jul 17, 2019 at 10:13 AM Tseng, Po-Hsun <zengbs@gmail.com> wrote:
Dear yt developers,
I found that increasing number of CPU cores in a single node that have 16 cores and 16 additional hyper-threadings can dramatically deteriorate speed-up. figure [3] shows this result. You may reproduce figure [3] by the following steps:
Running shell script [1] can automatically invoke yt script [2] and record elapsed time on files, called core_* [4].
[1] shell script: http://paste.yt-project.org/show/150/
[2] yt script: http://paste.yt-project.org/show/151/
[3] # of cores vs. elapsed time http://i.imgur.com/mFk6lKu.png
[4] all scripts, data and log files: http://use.yt/upload/d8efa692
Thanks for your help.
Rocky Tseng _______________________________________________ yt-users mailing list -- yt-users@python.org To unsubscribe send an email to yt-users-leave@python.org

Hi Britton, Thanks for your prompt reply. I followed your advice that move I/O step outside the duration of timer[1] and make the total time longer, and then I obtain the scaling [2]. Look at figure[2], we can see that the performance of 2 cores is only 124/112~1.1 times better than that of 1 core. Does this make sense? Hope I really miss something because I have a huge data to be analysed. Thanks. [1] http://paste.yt-project.org/show/152/ [2] http://i.imgur.com/3vALOL1.png Rocky

yt.load() doesn't do any I/O beyond reading some simulation metadata so there's likely still I/O contention happening in your script. On Wed, Jul 17, 2019 at 9:43 AM Tseng, Po-Hsun <zengbs@gmail.com> wrote:
Hi Britton,
Thanks for your prompt reply. I followed your advice that move I/O step outside the duration of timer[1] and make the total time longer, and then I obtain the scaling [2]. Look at figure[2], we can see that the performance of 2 cores is only 124/112~1.1 times better than that of 1 core. Does this make sense? Hope I really miss something because I have a huge data to be analysed. Thanks.
[1] http://paste.yt-project.org/show/152/ [2] http://i.imgur.com/3vALOL1.png
Rocky _______________________________________________ yt-users mailing list -- yt-users@python.org To unsubscribe send an email to yt-users-leave@python.org

Hi Rocky, That's still not great, but it does confirm my suspicion that scaling will improve with bigger data. Hopefully, that will continue to even larger datasets. Additionally, if you have more than one dataset to analyze, you can also parallelize your loop over datasets with the parallel_objects and piter commands. https://yt-project.org/docs/dev/analyzing/parallel_computation.html#parallel... This may work better for you since it will use significantly less communication between processes. Also, again, you'll be limited by the performance of your filesystem. In practice, I find that 8-16 processes is the absolute most I can use efficiently in parallel on your average computing cluster. Britton On Wed, Jul 17, 2019 at 2:52 PM Nathan <nathan.goldbaum@gmail.com> wrote:
yt.load() doesn't do any I/O beyond reading some simulation metadata so there's likely still I/O contention happening in your script.
On Wed, Jul 17, 2019 at 9:43 AM Tseng, Po-Hsun <zengbs@gmail.com> wrote:
Hi Britton,
Thanks for your prompt reply. I followed your advice that move I/O step outside the duration of timer[1] and make the total time longer, and then I obtain the scaling [2]. Look at figure[2], we can see that the performance of 2 cores is only 124/112~1.1 times better than that of 1 core. Does this make sense? Hope I really miss something because I have a huge data to be analysed. Thanks.
[1] http://paste.yt-project.org/show/152/ [2] http://i.imgur.com/3vALOL1.png
Rocky _______________________________________________ yt-users mailing list -- yt-users@python.org To unsubscribe send an email to yt-users-leave@python.org
_______________________________________________ yt-users mailing list -- yt-users@python.org To unsubscribe send an email to yt-users-leave@python.org
participants (3)
-
Britton Smith
-
Nathan
-
Tseng, Po-Hsun