Zhou Jiaqi wrote:
Still there remains a problem : When I use a normal instruction, i.e. ‘python kwant.py’ to run, it starts many threads and results in low speed. However, when I use ‘OPENBLAS_NUM_THREADS=1 python kwant.py’, it starts only 1 thread and results in high speed. I am quite confused about why the normal instruction involves many threads, as well as presents an unsatisfactory performance.
These issues are known to the developers of OpenBLAS (see for example [1]). They are outside of the scope of Kwant. You can make your choice of number of threads permanent by setting the OPENBLAS_NUM_THREADS environment variable in a startup script, for example in .bashrc if you are using bash. The default of OpenBLAS is to use as many BLAS/LAPACK threads as there are logical CPU cores available. This is often not such a bad default if all you want to do is perform a single calculation and you are ready to throw all the cores that you have at it. The speed up that is provided by the parallelization is often not very impressive, but better then nothing. Of course, if you have 16 cores, and launch 16 copies of Kwant, and each of them launches 16 threads, you will suffer from severe CPU over-subscription. During a CPU-bound computation the system load should not raise over the number of cores. This can be verified with the “uptime” command. This story is further complicated by the “hyperthreading” feature of many CPUs. For example, on my laptop I have *four* logical cores that are just a way to better occupy the pipelines of the *two* physical cores. On this machine, running four CPU-bound threads can be a good idea if these perform a mix of different kinds of work (integer, floating-point, etc.). However when doing number-crunching, e.g. using BLAS/LAPACK, it’s typically better to only run two threads. Thanks to better memory locality this results in better performance. So, on a machine with a hyperthreading CPU the OpenBLAS default of using as many threads as there are logical cores is, in my experience, never a good choice. [1] https://github.com/xianyi/OpenBLAS/issues/1881