
Hi everyone, I am writing to discuss the parallelization of Kwant. I installed Kwant with MUMPS from Ubuntu(https://launchpad.net/ubuntu/+source/mumps). When I run a Kwant script (which will be shown in the end) in the normal way on a 4-core laptop (2 threads per core), the screenshot (the output of htop, see https://drive.google.com/open?id=1TwEuc21DMjRnVQ9yG3XpkVilhFiAKaeb) shows that all the 8 CPU threads are involved, but 21 python threads. However, according to the tutorial: " Kwant uses only the sequential, single core version of MUMPS. The advantages due to MUMPS as used by Kwant are thus independent of the number of CPU cores of the machine on which Kwant runs.” So it confuses me whether Kwant 1.4.1 has already employed parallelization or not, and why there are 21 python threads? Moreover, I find the usage of concurrent package slows down the calculation. Thanks a lot for your time, and it would be greatly appreciated if anyone could share some news about the Kwant parallelization development. Best regards, Jiaqi UCLouvain ______________________ import kwant import tbmodels import numpy as np import matplotlib.pyplot as plt model = tbmodels.Model.from_wannier_files( hr_file='graphene_hr.dat', wsvec_file='graphene_wsvec.dat', xyz_file='graphene_centres.xyz', win_file='graphene.win', h_cutoff=0.01) lattice = model.to_kwant_lattice() wire = kwant.Builder() def shape(p): x, y, z = p return -5 < x < 5 and -5 < y < 5 and -1 < z < 1 wire[lattice.shape(shape, (0, 0, 0))] = 0 model.add_hoppings_kwant(wire) sym_lead_x = kwant.TranslationalSymmetry(lattice.vec((-2, 0, 0))) lead_x = kwant.Builder(sym_lead_x) def lead_shape(p): x, y, z = p return -5 <= x <= 5 and -5 < y < 5 and -1 < z < 1 lead_x[lattice.shape(lead_shape, (0, 0, 0))] = 0 model.add_hoppings_kwant(lead_x) wire.attach_lead(lead_x) wire.attach_lead(lead_x.reversed()) syst = wire.finalized() def trans(energy): smatrix = kwant.smatrix(syst, energy) data = smatrix.transmission(1, 0) return data def main(): energies = np.linspace(0, 1, 100) tc = map(trans, energies) # transmission coefficient te = zip(energies, tc) lte = list(te) print(lte) if __name__ == '__main__': main()

Hi Jiaqi, Most likely you're seeing the LAPACK-level parallelization. I'm not an expert on LAPACK using multiple CPU cores, but I haven't seen parallelization of LAPACK result in any useful speedup. Try setting the following environment variables (these are used by different LAPACK implementations): OPENBLAS_NUM_THREADS=1 OMP_NUM_THREADS=1 MKL_DYNAMIC=FALSE MKL_NUM_THREADS=1 Best, Anton On Thu, 21 May 2020 at 23:55, Zhou Jiaqi <jiaqi.zhou@uclouvain.be> wrote:
Hi everyone,
I am writing to discuss the parallelization of Kwant.
I installed Kwant with MUMPS from Ubuntu(https://launchpad.net/ubuntu/+source/mumps). When I run a Kwant script (which will be shown in the end) in the normal way on a 4-core laptop (2 threads per core), the screenshot (the output of htop, see https://drive.google.com/open?id=1TwEuc21DMjRnVQ9yG3XpkVilhFiAKaeb) shows that all the 8 CPU threads are involved, but 21 python threads.
However, according to the tutorial: " Kwant uses only the sequential, single core version of MUMPS. The advantages due to MUMPS as used by Kwant are thus independent of the number of CPU cores of the machine on which Kwant runs.”
So it confuses me whether Kwant 1.4.1 has already employed parallelization or not, and why there are 21 python threads? Moreover, I find the usage of concurrent package slows down the calculation.
Thanks a lot for your time, and it would be greatly appreciated if anyone could share some news about the Kwant parallelization development.
Best regards,
Jiaqi UCLouvain
______________________
import kwant import tbmodels import numpy as np import matplotlib.pyplot as plt
model = tbmodels.Model.from_wannier_files( hr_file='graphene_hr.dat', wsvec_file='graphene_wsvec.dat', xyz_file='graphene_centres.xyz', win_file='graphene.win', h_cutoff=0.01)
lattice = model.to_kwant_lattice()
wire = kwant.Builder() def shape(p): x, y, z = p return -5 < x < 5 and -5 < y < 5 and -1 < z < 1 wire[lattice.shape(shape, (0, 0, 0))] = 0 model.add_hoppings_kwant(wire)
sym_lead_x = kwant.TranslationalSymmetry(lattice.vec((-2, 0, 0))) lead_x = kwant.Builder(sym_lead_x) def lead_shape(p): x, y, z = p return -5 <= x <= 5 and -5 < y < 5 and -1 < z < 1 lead_x[lattice.shape(lead_shape, (0, 0, 0))] = 0 model.add_hoppings_kwant(lead_x) wire.attach_lead(lead_x) wire.attach_lead(lead_x.reversed())
syst = wire.finalized()
def trans(energy): smatrix = kwant.smatrix(syst, energy) data = smatrix.transmission(1, 0) return data
def main(): energies = np.linspace(0, 1, 100) tc = map(trans, energies) # transmission coefficient te = zip(energies, tc) lte = list(te) print(lte)
if __name__ == '__main__': main()

Dear Anton, Thank you very much for your timely reply. I find that OPENBLAS_NUM_THREADS=1 is quite effective, it can enforce the calculation to be in 1 thread and improve the efficiency by 3 times. Thanks for advice ! Still there remains a problem : When I use a normal instruction, i.e. ‘python kwant.py’ to run, it starts many threads and results in low speed. However, when I use ‘OPENBLAS_NUM_THREADS=1 python kwant.py’, it starts only 1 thread and results in high speed. I am quite confused about why the normal instruction involves many threads, as well as presents an unsatisfactory performance. The procedure I installed Kwant : sudo apt-get install libmumps-scotch-dev # for MUMPS python setup.py build python setup.py install Thanks again for your help ! Sincerely, Jiaqi

Hi Jiaqi, Unfortunately this is default openblas configuration, and it is outside of our control, as far as I know. Best, Anton On Fri, 22 May 2020 at 14:18, Zhou Jiaqi <jiaqi.zhou@uclouvain.be> wrote:
Dear Anton,
Thank you very much for your timely reply. I find that OPENBLAS_NUM_THREADS=1 is quite effective, it can enforce the calculation to be in 1 thread and improve the efficiency by 3 times. Thanks for advice !
Still there remains a problem : When I use a normal instruction, i.e. ‘python kwant.py’ to run, it starts many threads and results in low speed. However, when I use ‘OPENBLAS_NUM_THREADS=1 python kwant.py’, it starts only 1 thread and results in high speed. I am quite confused about why the normal instruction involves many threads, as well as presents an unsatisfactory performance.
The procedure I installed Kwant : sudo apt-get install libmumps-scotch-dev # for MUMPS python setup.py build python setup.py install
Thanks again for your help !
Sincerely, Jiaqi

Dear Anton, Thanks for your kind reply, that's very helpful. Have a nice day ! Sincerely, Jiaqi

Zhou Jiaqi wrote:
Still there remains a problem : When I use a normal instruction, i.e. ‘python kwant.py’ to run, it starts many threads and results in low speed. However, when I use ‘OPENBLAS_NUM_THREADS=1 python kwant.py’, it starts only 1 thread and results in high speed. I am quite confused about why the normal instruction involves many threads, as well as presents an unsatisfactory performance.
These issues are known to the developers of OpenBLAS (see for example [1]). They are outside of the scope of Kwant. You can make your choice of number of threads permanent by setting the OPENBLAS_NUM_THREADS environment variable in a startup script, for example in .bashrc if you are using bash. The default of OpenBLAS is to use as many BLAS/LAPACK threads as there are logical CPU cores available. This is often not such a bad default if all you want to do is perform a single calculation and you are ready to throw all the cores that you have at it. The speed up that is provided by the parallelization is often not very impressive, but better then nothing. Of course, if you have 16 cores, and launch 16 copies of Kwant, and each of them launches 16 threads, you will suffer from severe CPU over-subscription. During a CPU-bound computation the system load should not raise over the number of cores. This can be verified with the “uptime” command. This story is further complicated by the “hyperthreading” feature of many CPUs. For example, on my laptop I have *four* logical cores that are just a way to better occupy the pipelines of the *two* physical cores. On this machine, running four CPU-bound threads can be a good idea if these perform a mix of different kinds of work (integer, floating-point, etc.). However when doing number-crunching, e.g. using BLAS/LAPACK, it’s typically better to only run two threads. Thanks to better memory locality this results in better performance. So, on a machine with a hyperthreading CPU the OpenBLAS default of using as many threads as there are logical cores is, in my experience, never a good choice. [1] https://github.com/xianyi/OpenBLAS/issues/1881

Dear Christoph, Thanks for your detailed explanation. Using the output of Wannier90, the scale of my calculation is considerable, and I have confirmed that OPENBLAS_NUM_THREADS = 1 can truly improve the efficiency, and added this instruction in my bashrc. Here I share an instance of parallelization using concurrent package: Assuming the time of single-CPU calculation is 't', then the time of 12-CPUs calculation is 't/5.5', which is affordable for clusters. Thanks to all of you, this discussion is important to the DFT-TB calculations. Sincerely, Jiaqi
participants (3)
-
Anton Akhmerov
-
Christoph Groth
-
Zhou Jiaqi