performance issue with Kwant on CentOS with mpi4py
Hello, I have recently installed Kwant using miniconda onto my own account (no root access). I installed Python and all its dependencies as well with your help. Thank you!! When I run a code on a single processor, it works fine. When I add more CPUs, each calculation takes longer time. Code builds a system with hopping, on-site repulsion and SOC, with leads, then I calculate the scattering matrix of it. I ran a check with no-Kwant function (a simple summation), calculation takes almost same time independent of number of CPUs. I understand Kwant itself is not a parallel program, but given the same task, it should take same time to finish regardless of how many times it has been called. The code I am using builds 2 systems and calculates 2 smatrices. I paste below the times it takes for each set of CPUs. I am guessing this is an issue with sharing the library, but I am not certain. The time increases faster than linear. Any ideas? Thanks Fatih Single CPU ----------------- hello world from process 0 done system 1 0 0.05028367042541504 done smatrix 1 0 0.23258066177368164 done system 2 0 3.244915723800659 done smatrix 2 0 13.960146427154541 Two CPUs --------------- hello world from process 1 done system 1 1 0.04906630516052246 done smatrix 1 1 0.23961257934570312 done system 2 1 3.5525107383728027 done smatrix 2 1 16.66365075111389 hello world from process 0 done system 1 0 0.045102834701538086 done smatrix 1 0 0.20883417129516602 done system 2 0 2.842400550842285 done smatrix 2 0 14.466880559921265 Four CPUs --------------- hello world from process 1 done system 1 1 0.09960174560546875 done smatrix 1 1 0.630927324295044 done system 2 1 3.514568328857422 done smatrix 2 1 119.35529565811157 hello world from process 2 done system 1 2 0.09219717979431152 done smatrix 1 2 0.8386905193328857 done system 2 2 4.74960732460022 done smatrix 2 2 119.06909513473511 hello world from process 3 done system 1 3 0.13100957870483398 done smatrix 1 3 0.860680341720581 done system 2 3 3.4969165325164795 done smatrix 2 3 120.44963669776917 hello world from process 0 done system 1 0 0.1307511329650879 done smatrix 1 0 0.8210744857788086 done system 2 0 4.6470348834991455 done smatrix 2 0 120.91561484336853 8 CPUs -------------- hello world from process 3 done system 1 3 0.08914065361022949 done smatrix 1 3 1.9215054512023926 done system 2 3 12.615397214889526 done smatrix 2 3 264.25777077674866 hello world from process 1 done system 1 1 0.05163288116455078 done smatrix 1 1 1.5547888278961182 done system 2 1 7.847089767456055 done smatrix 2 1 264.47394609451294 hello world from process 4 done system 1 4 0.05323219299316406 done smatrix 1 4 1.8263118267059326 done system 2 4 8.452224254608154 done smatrix 2 4 258.97272992134094 hello world from process 6 done system 1 6 0.0951833724975586 done smatrix 1 6 1.7297015190124512 done system 2 6 9.489607334136963 done smatrix 2 6 258.52338123321533 hello world from process 7 done system 1 7 0.052448272705078125 done smatrix 1 7 1.3691565990447998 done system 2 7 8.31990361213684 done smatrix 2 7 260.8307328224182 hello world from process 2 done system 1 2 0.0588071346282959 done smatrix 1 2 1.5174744129180908 done system 2 2 5.244871377944946 done smatrix 2 2 242.011638879776 hello world from process 5 done system 1 5 0.05212259292602539 done smatrix 1 5 1.5607316493988037 done system 2 5 5.980914831161499 done smatrix 2 5 251.20327258110046 hello world from process 0 done system 1 0 0.05146503448486328 done smatrix 1 0 1.4909172058105469 done system 2 0 5.17970871925354 done smatrix 2 0 246.73624682426453
Hi,
When I run a code on a single processor, it works fine. When I add more CPUs, each calculation takes longer time. Code builds a system with hopping, on-site repulsion and SOC, with leads, then I calculate the scattering matrix of it.
Did you make sure to set the number of threads that BLAS uses to 1? By default BLAS will try and use all the cores on the machine, although this rarely seems to produce any speedup for Kwant's use case. You should try executing this before running your script: export OMP_NUM_THREADS=1 Bear in mind that if you are using a queuing system on this machine (e.g. PBS) then the above line will need to go into the submission script. Depending on what MPI implementation you are using, you may need to tell 'mpirun' to set this environment variable on all ranks. I know that for OpenMPI you would do something like: mpirun -n 4 -x OMP_NUM_THREADS=1 python my_kwant_script.py Try that and let us know if it solves your problem. Happy Kwanting, Joe
participants (2)
-
Fatih Dogan
-
Joseph Weston