kwant much faster in conda?
Hello everyone, This might be a bit off topic. I've been using kwant via anaconda for a while because it was easier to install, but recently I've moved to a standard virtualenv and compiled kwant on my own (including mumps and openblas). But I've notice that my code is running much faster in conda+kwant than in my virtualenv+kwant. I'm still tracking why is there such a huge difference. The code basically builds a system without leads, extract the "hamiltonian_submatrix" for a given set of parameters and calculates the eigenvalues. Then I loop over a set of parameters. Running the code in the conda environment, it takes 15 minutes, but in my own virtualenv with compiled kwant it takes 2 hours!! It's exactly the same code. In conda the numpy runs with MKL, while in my virtualenv it runs with openblas. I've run a benchmark and the MKL is giving a near 2x speedup over OpenBlas for SVD and eigenvalues. I've compiled kwant with MUMPS and it seems ok. I've also tested with the kwant binary from Debian packages by setting "include-system-site-packages = false" and it get the same result (2h running time). Is there anything else I could be missing that would speed up the code? It seems that numpy is not the problem, since the MKL vs OpenBlas is at most a factor of 2. Maybe my compilation is not properly linking to MUMPS? How can I check if my kwant compilation is properly using openblas and MUMPS?
Gerson J. Ferreira wrote:
I'm still tracking why is there such a huge difference. The code basically builds a system without leads, extract the "hamiltonian_submatrix" for a given set of parameters and calculates the eigenvalues. Then I loop over a set of parameters. Running the code in the conda environment, it takes 15 minutes, but in my own virtualenv with compiled kwant it takes 2 hours!! It's exactly the same code.
Often when there is a difference in performance between different deployments of Kwant, it is due to MUMPS being used or not. After all, the MUMPS transport solver is typically considerably faster than the one using scipy.sparse. However in your case you seem not to be using MUMPS at all, so it should not matter whether it’s available or not. If you use Kwant to create a builder, finalize it, and run hamiltonian_submatrix, the code that takes up time is predominantly • Python Kwant code, • Cython Kwant code (for example hamiltonian_submatrix), • and Tinyarray (implemented in C++). There should not be a huge difference between the time it takes to execute such code no matter whether conda is used or not. There should not be a huge difference even if different compilers were used, except if something unusual was done like disabling compiler optimizations. My guess is that the runtime difference could be due the eigenvalue calculation. Perhaps the MKL routine that you are most likely using is significantly faster? You should be able to test this hypothesis by adding some simple timing to your script, time.time() should do the job. Or you can try to do some more serious profiling if you like.
Is there anything else I could be missing that would speed up the code? It seems that numpy is not the problem, since the MKL vs OpenBlas is at most a factor of 2. Maybe my compilation is not properly linking to MUMPS? How can I check if my kwant compilation is properly using openblas and MUMPS?
If you only create a system and evaluate its Hamiltonian Kwant should not do any significant linear algebra computations. But perhaps I’m overseeing something - could you try to narrow down the problem and let us know what you found out? Ideally you could post a simple example script so that others Christoph
Hi Christoph, I did some tests already. First, in other codes where I use kwant it seems that MUMPS is working correctly. But overall it is also slower in my venv than in conda. Regarding numpy with MKL or OpenBlas, I've found a benchmark script on github and it gave me the results below. These show that MKL is about 2x faster for SVD and eigenvalues, which is the type of calls I'm making. While a factor of 2 cannot explain the huge time difference in my main code, I'm actually using even larger matrices (at least 8000x8000). So it could simply be MKL vs Openblas, depending on how this scale with the matrix size. I'll check it soon. numpy + conda + mkl ------------------------- Dotted two 4096x4096 matrices in 0.52 s. Dotted two vectors of length 524288 in 0.04 ms. SVD of a 2048x1024 matrix in 0.26 s. Cholesky decomposition of a 2048x2048 matrix in 0.08 s. Eigendecomposition of a 2048x2048 matrix in 3.07 s. numpy+openblas: -------------------- Dotted two 4096x4096 matrices in 0.64 s. Dotted two vectors of length 524288 in 0.05 ms. SVD of a 2048x1024 matrix in 0.46 s. Cholesky decomposition of a 2048x2048 matrix in 0.11 s. Eigendecomposition of a 2048x2048 matrix in 5.70 s. I'll try to run the code measuring time.time() in different parts to track where it's slowing down. Also, I'll try to trim the code into a simple example to see if I can reproduce it with something easier to understand. If I find something I'll let you know. But since I'm currently trying to finish a paper, I'm using conda for a little longer until I understand this huge time difference. Thanks for the attention. I'll reply here as soon as I have more relevant numbers.
participants (2)
-
Christoph Groth
-
Gerson J. Ferreira