GPU speedup via PyCUDA
Hello, My name is Abdul, a research student, and I’ve been using kwant to simulate transmission in quantum devices. I’ve been thinking about speeding up the process by using GPU matrix operations under PyCUDA or something similar, and was wondering if this would speed things at all? Many systems that I’m studying take hours to calculate transmission, and It would be great if this can be reduced. I appreciate any help, especially with where sparse matrix operations take place in the source code. Kwant is a very useful piece of software, and making it faster would make my life much easier :) Best regards, Abdul Afandi
Hi Abdul, Neither me, nor the rest of the Kwant team have spent much time looking into the issue. The breakdown of how GPU can be used to speed up the calculation is as follows: solving a scattering problem has a step of computing modes in the leads, and solving the sparse linear system. The asymptotic scaling of both parts is the same, and the prefactors seem to be comparable as well, at least for 2D problems with few orbitals and low degree. The repetitive modes calculation can often be avoided by precalculating the modes and storing the result if the energy or the system parameters don't change. This calculation spends most time either in solving an eigenproblem or a more complex generalized Schur decomposition. As far as I can tell, the former is available in MAGMA, the latter hasn't been ported to GPUs yet. Finally speeding up the solution of the sparse problem also seems possible on GPU's, but it's a matter of ongoing research, quick googling yielded this for example: https://www.oerc.ox.ac.uk/sites/default/files/uploads/sparse_direct_gpu.pdf A person with sufficient time and technical knowledge could try and replace the linear algebra currently used in Kwant with the GPU-accelerated ones, and this likely would result in a speedup. Hope this helps, Anton On Sun, Jan 24, 2016 at 8:35 PM, Abdulkareem Afandi <abdulkareem.afandi@gmail.com> wrote:
Hello,
My name is Abdul, a research student, and I’ve been using kwant to simulate transmission in quantum devices. I’ve been thinking about speeding up the process by using GPU matrix operations under PyCUDA or something similar, and was wondering if this would speed things at all? Many systems that I’m studying take hours to calculate transmission, and It would be great if this can be reduced.
I appreciate any help, especially with where sparse matrix operations take place in the source code. Kwant is a very useful piece of software, and making it faster would make my life much easier :)
Best regards, Abdul Afandi
Hi,
The repetitive modes calculation can often be avoided by precalculating the modes and storing the result if the energy or the system parameters don't change. This calculation spends most time either in solving an eigenproblem or a more complex generalized Schur decomposition. As far as I can tell, the former is available in MAGMA, the latter hasn't been ported to GPUs yet.
Isn"t the QR algorithm basically a Schur decomposition? Then it would be in Magma as well: http://icl.cs.utk.edu/projectsfiles/magma/doxygen/group__magma__zgeqp3__comp...
A person with sufficient time and technical knowledge could try and replace the linear algebra currently used in Kwant with the GPU-accelerated ones, and this likely would result in a speedup.
last year I started writing a Python binding for Magma (using cython), which I never completed beyond the 3 routines I needed at that time, but it could be a good starting point to include Magma into python. If anybody is interested I could share that code. Best, Max
Hi Max, Did you notice a difference in processing time? Which routines did you do this for? It would be great if you could share the code, as I don't know where to start. Best, Abdul On 25 Jan 2016 10:06 am, "Maximilian Trescher" < maximilian.trescher@fu-berlin.de> wrote:
Hi,
The repetitive modes calculation can often be avoided by precalculating the modes and storing the result if the energy or the system parameters don't change. This calculation spends most time either in solving an eigenproblem or a more complex generalized Schur decomposition. As far as I can tell, the former is available in MAGMA, the latter hasn't been ported to GPUs yet.
Isn"t the QR algorithm basically a Schur decomposition? Then it would be in Magma as well:
http://icl.cs.utk.edu/projectsfiles/magma/doxygen/group__magma__zgeqp3__comp...
A person with sufficient time and technical knowledge could try and replace the linear algebra currently used in Kwant with the GPU-accelerated ones, and this likely would result in a speedup.
last year I started writing a Python binding for Magma (using cython), which I never completed beyond the 3 routines I needed at that time, but it could be a good starting point to include Magma into python. If anybody is interested I could share that code.
Best, Max
Hi Max,
Isn"t the QR algorithm basically a Schur decomposition? Then it would be in Magma as well: http://icl.cs.utk.edu/projectsfiles/magma/doxygen/group__magma__zgeqp3__comp...
When the problem is well-conditioned, even a regular eigenproblem is used; however in the worst case scenario we make use of QZ factorization [1], which doesn't seem to be in MAGMA yet. Best, Anton [1]: https://en.wikipedia.org/wiki/Schur_decomposition#Generalized_Schur_decompos...
A person with sufficient time and technical knowledge could try and replace the linear algebra currently used in Kwant with the GPU-accelerated ones, and this likely would result in a speedup.
last year I started writing a Python binding for Magma (using cython), which I never completed beyond the 3 routines I needed at that time, but it could be a good starting point to include Magma into python. If anybody is interested I could share that code.
Best, Max
Anton Akhmerov wrote:
Finally speeding up the solution of the sparse problem also seems possible on GPU's, but it's a matter of ongoing research, quick googling yielded this for example: https://www.oerc.ox.ac.uk/sites/default/files/uploads/sparse_direct_gpu.pdf
Another approach (that is not ongoing research) is to use the classic recursive Green’s function algorithm (in the case of Kwant one would want to use the multi-terminal version as described in Part I of Michael Wimmer’s thesis [1]). We never worked on that because the MUMPS solver has been good enough for us so far, but if someone is motivated, we would be happy to lend a hand. Kwant actually includes Michael’s graph-slicing code. I should have a simple RGF solver for Kwant lying around somewhere. It should be not much work to port it to MAGMA. Christoph [1] http://epub.uni-regensburg.de/12142/
participants (4)
-
Abdulkareem Afandi
-
Anton Akhmerov
-
Christoph Groth
-
Maximilian Trescher