Hello,
I started recently to look at Kwant from under the hood, and I have a question to address you.
My current understanding is that the wavefunction is defined as a 1D array of complex values, the orbitals for each site being a contiguous chunk in the array, that can be accessed through book-keeping.
What I was expecting was a simple compilable representation that uses for example an array of pointers to tinyarrays:
`ta.array[:] wavefunction' or maybe `std::vector<ta.array> wavefunction`
Accessing the number of orbitals in a given site would then be a simple call to a `size()` member function:
`wavefunction[i].size()'.
The benefit I am seeing to this approach is shorter and more readable code, from the developer point of view, because we would use for example tinarray’s functions for expectation values of operators, for example for the hamiltonian:
`ta.dot(ta.conjugate(ta.transpose(wavefunction(i)), ta.dot(syst.hamiltonian(i,j, params), wavefunction[j])`
If the * operator is defined between tinyarray, and with a member function `dagger’, we could even write:
`wavefunction[i].dagger() * syst.hamiltonian(i,j, params) * wavefunction[j]`
And if one really needs to access orbitals, only an additionnal bracket is needed:
`wavefunction[site] [orb]'
Using C++ vectors or Cython arrays of `ta.array`s to represent wave functions would be as fast as the current representation, or am I wrong ?What were your reasons against such an approach ?
Have a nice day!
Adel
Hi Adel,
Your understanding of the way wave functions (and other observables) are stored is correct. You are also right that working with low-level observables is inconvenient and could be improved. I'll explain how the current situation came about, and offer a possible solution. As always, we're glad to hear comments or suggestions.
I still think that our original design choice to store everything in one flat array was a good one. The alternatives you mention would require much more RAM and would be also significantly slower due to decreased memory locality.
If we were to store a list of tinyarrays that would require storing one Python object (the tinyarray) per site. For example, if we were to do this for a system with two orbitals per site, that would mean a memory cost of 8 + 56 = 64 bytes per site. With the current approach, we require two complex numbers, that is 16 bytes per site. What's more, accessing this memory is no longer sequential, which makes it even slower.
(You could object that finalized builders already store O(N) Python objects, namely the sites and their tags. But (1) that is not true for general Kwant low-level systems, and (2) we are working on solving this problem for finalized builders as well.)
The other solutions that you mention have similar problems. What we could have done is sort the sites according to their number of orbitals, and then for each block of sites store a 2d array of shape (num_sites, num_orbs). But back when the low-level format was established, the number of orbitals of a site was not fixed, and so this could not have been done. The reason why the number was not fixed was not so much a false sense of generality, but rather that we didn't see a good way to specify the number of orbitals per site. We only later had the idea to fix the number of orbitals per site family.
(The above solution would have the inconvenience that the wave function would be no longer a simple array, but rather a list of 2d arrays. This exposes more structure, but it also makes things like normalizing a wave function more difficult.)
Here is how you can efficiently realize the above solution with a couple of lines of code:
Check out the attribute 'site_ranges' of Kwant systems [1]. With this information, you can write a function 'unpack' that takes a system and a numpy array with one entry per orbital (for example a wave function), and returns a list of 2d arrays, that, without copying, refer to the data in a way that simplifies working with sites.
If this sounds like a good idea and you write this function, be sure to share it. We might want to include it into Kwant as a method of 'System'.
Cheers Christoph
[1] https://kwant-project.org/doc/1/reference/generated/kwant.system.System#kwan...
Hello Christoph,
Thank you for your detailed answer.
I convinced myself with a small C++ benchmark code that summing over the complex values of a vector<vector<complex<double>>> array is more than twice slower than summing over a flat, same size, dynamically allocated array (I attached the code to this mail). About the memory usage, I don't understand why it would take signifcantly more space, but maybe because I first suggested an array of Python objects instead of compiled objects, which take more space ?
I agree and I am interested by coding the solution you are suggesting, I would like to suggest further developpement to it: instead of returning a simple list of 2D arrays without copying anything, we use an "accesser" object that would enable matrix products : It gets initialised with the system "syst"Calling "accesser.wavefunction(bra, site, orbital)" (or some similar writing) returns "wavefunction[orbital + offset(site)]" with the right offset.The matricial inner product is implemented: "accesser.wavefunction(bra, site) * accesser.hamiltonian(site) * accesser.wavefunction(ket, site)" would be a valid expression (or something similar, with the same idea to have an apparent matrix-vector or matrix-martix product). The third bullet point raises for me some questions though: What do you think of it ?I need to understand how the hamiltionian is represented to be able to think of possibilities to implement that, but I can't grasp what is the low level representation of the hamiltonian when the system is finalised, is it a flat array too ? When I look at "_eval_hamiltonian()" in the class "_LocalOperator". I see that it uses the return value of "syst.hamiltonian()", then converts it into a Python tinyarray.matrix object, then convert it once again into a BlockSpareMatrix.
Thank you,
Adel
On Fri, 10 May, 2019 at 6:15 PM, Christoph Groth christoph.groth@cea.fr wrote:
Hi Adel,
Your understanding of the way wave functions (and other observables) are stored is correct. You are also right that working with low-level observables is inconvenient and could be improved. I'll explain how the current situation came about, and offer a possible solution. As always, we're glad to hear comments or suggestions.
I still think that our original design choice to store everything in one flat array was a good one. The alternatives you mention would require much more RAM and would be also significantly slower due to decreased memory locality.
If we were to store a list of tinyarrays that would require storing one Python object (the tinyarray) per site. For example, if we were to do this for a system with two orbitals per site, that would mean a memory cost of 8 + 56 = 64 bytes per site. With the current approach, we require two complex numbers, that is 16 bytes per site. What's more, accessing this memory is no longer sequential, which makes it even slower.
(You could object that finalized builders already store O(N) Python objects, namely the sites and their tags. But (1) that is not true for general Kwant low-level systems, and (2) we are working on solving this problem for finalized builders as well.)
The other solutions that you mention have similar problems. What we could have done is sort the sites according to their number of orbitals, and then for each block of sites store a 2d array of shape (num_sites, num_orbs). But back when the low-level format was established, the number of orbitals of a site was not fixed, and so this could not have been done. The reason why the number was not fixed was not so much a false sense of generality, but rather that we didn't see a good way to specify the number of orbitals per site. We only later had the idea to fix the number of orbitals per site family.
(The above solution would have the inconvenience that the wave function would be no longer a simple array, but rather a list of 2d arrays. This exposes more structure, but it also makes things like normalizing a wave function more difficult.)
Here is how you can efficiently realize the above solution with a couple of lines of code:
Check out the attribute 'site_ranges' of Kwant systems [1]. With this information, you can write a function 'unpack' that takes a system and a numpy array with one entry per orbital (for example a wave function), and returns a list of 2d arrays, that, without copying, refer to the data in a way that simplifies working with sites.
If this sounds like a good idea and you write this function, be sure to share it. We might want to include it into Kwant as a method of 'System'.
Cheers Christoph
[1] https://kwant-project.org/doc/1/reference/generated/kwant.system.System#kwant.system.System
Hello Christoph,
I have a few additions to make above my previous mail, I hope I am not making you waste your time maybe with trivial and unuseful suggestions.
I have found a C++ library, Armadillo http://arma.sourceforge.net/, that can do the trick for implementing a fast accesser, with built-in matrix-vector products.
Let's assume we have the wavefunction and the hamiltonian on flat contiguous arrays. This library has Row, Column, Matrix and SparseMatrix object implementations, with inner functions like hermitian conjugate or transpose, and with operations between them like sum and multiplication. The good thing about it is that one initialise such objects with pointers to existing memory, without copying, here are the constructors that one could use:
For vectors: http://arma.sourceforge.net/docs.html#Col vec(ptr_aux_mem, number_of_elements, copy_aux_mem = true, strict = false) For matrices http://arma.sourceforge.net/docs.html#Mat: mat(ptr_aux_mem, n_rows, n_cols, copy_aux_mem = true, strict = false) So one can wrap such objects in Cython and stack allocate them with the correct memory pointers and sizes then use them to write the wanted operator expression in a compact way. In terms of speed, I expect it to be nearly as fast as a "by-hand" implementation, since the expression will be cythonised/compiled, thus in the end with similar nested for-loops after compilation.
If I understand well, matrices, for example the hamiltonian, don't have really a low level representation since "params" need to be provided first to create a flat matrix of only complex values, and that would be the point of "_eval_hamiltonian()" in the "_LocalOperator" class, it returns a BlockSparseMatrix which is basically a flat array containing all the matrices of all the needed hoppings, given in the "where" list. I have a question about that:
Would it be as fast to compute the hamiltonian little by little while calculating the operator's value on each hopping of the "where" list ? To show what I mean exactly, I have a code example https://gist.github.com/AdelKS/e1dc8ab625c96138ee63e24d0a0ad4f2 (I couldn't test it yet unfortunately) using the "tinyarray" pythong package. The idea would then be to use Aramadillo objects instead of tinyarray, and have the code compilable. This approach would have the benefit of not using "_eval_hamiltonian()" nor the "BlockSparseMatrix" class, but will need the extra Armadillo wrapping code, and mapping the return of "syst.hamiltonian" to a Matrix object.
Thank you,
Adel KARA SLIMANE
On Mon, 13 May, 2019 at 4:38 PM, Adel Kara Slimane adel.kara-slimane@cea.fr wrote:
Hello Christoph,
Thank you for your detailed answer.
I convinced myself with a small C++ benchmark code that summing over the complex values of a vector<vector<complex<double>>> array is more than twice slower than summing over a flat, same size, dynamically allocated array (I attached the code to this mail). About the memory usage, I don't understand why it would take signifcantly more space, but maybe because I first suggested an array of Python objects instead of compiled objects, which take more space ?
I agree and I am interested by coding the solution you are suggesting, I would like to suggest further developpement to it: instead of returning a simple list of 2D arrays without copying anything, we use an "accesser" object that would enable matrix products : It gets initialised with the system "syst"Calling "accesser.wavefunction(bra, site, orbital)" (or some similar writing) returns "wavefunction[orbital + offset(site)]" with the right offset.The matricial inner product is implemented: "accesser.wavefunction(bra, site) * accesser.hamiltonian(site) * accesser.wavefunction(ket, site)" would be a valid expression (or something similar, with the same idea to have an apparent matrix-vector or matrix-martix product). The third bullet point raises for me some questions though: What do you think of it ?I need to understand how the hamiltionian is represented to be able to think of possibilities to implement that, but I can't grasp what is the low level representation of the hamiltonian when the system is finalised, is it a flat array too ? When I look at "_eval_hamiltonian()" in the class "_LocalOperator". I see that it uses the return value of "syst.hamiltonian()", then converts it into a Python tinyarray.matrix object, then convert it once again into a BlockSpareMatrix.
Thank you,
Adel
On Fri, 10 May, 2019 at 6:15 PM, Christoph Groth christoph.groth@cea.fr wrote:
Hi Adel,
Your understanding of the way wave functions (and other observables) are stored is correct. You are also right that working with low-level observables is inconvenient and could be improved. I'll explain how the current situation came about, and offer a possible solution. As always, we're glad to hear comments or suggestions.
I still think that our original design choice to store everything in one flat array was a good one. The alternatives you mention would require much more RAM and would be also significantly slower due to decreased memory locality.
If we were to store a list of tinyarrays that would require storing one Python object (the tinyarray) per site. For example, if we were to do this for a system with two orbitals per site, that would mean a memory cost of 8 + 56 = 64 bytes per site. With the current approach, we require two complex numbers, that is 16 bytes per site. What's more, accessing this memory is no longer sequential, which makes it even slower.
(You could object that finalized builders already store O(N) Python objects, namely the sites and their tags. But (1) that is not true for general Kwant low-level systems, and (2) we are working on solving this problem for finalized builders as well.)
The other solutions that you mention have similar problems. What we could have done is sort the sites according to their number of orbitals, and then for each block of sites store a 2d array of shape (num_sites, num_orbs). But back when the low-level format was established, the number of orbitals of a site was not fixed, and so this could not have been done. The reason why the number was not fixed was not so much a false sense of generality, but rather that we didn't see a good way to specify the number of orbitals per site. We only later had the idea to fix the number of orbitals per site family.
(The above solution would have the inconvenience that the wave function would be no longer a simple array, but rather a list of 2d arrays. This exposes more structure, but it also makes things like normalizing a wave function more difficult.)
Here is how you can efficiently realize the above solution with a couple of lines of code:
Check out the attribute 'site_ranges' of Kwant systems [1]. With this information, you can write a function 'unpack' that takes a system and a numpy array with one entry per orbital (for example a wave function), and returns a list of 2d arrays, that, without copying, refer to the data in a way that simplifies working with sites.
If this sounds like a good idea and you write this function, be sure to share it. We might want to include it into Kwant as a method of 'System'.
Cheers Christoph
[1] https://kwant-project.org/doc/1/reference/generated/kwant.system.System#kwant.system.System
Hi Adel,
I read through this thread and I'm still not really sure what is being proposed. In the first email in the thread it seems as if you are proposing a wrapper for wavefunctions so that they can be indexed by site. It seems that you want to be able to write:
psi(i).dagger() * syst.hamiltonian(i, j, params) * psi(j)
This is a fine proposal. I could imagine that a simple implementation for such a wrapper would not need to use any exotic storage formats, C-level accessors or anything: finalized systems already contains an attribute 'site_ranges' that allows for constant-time access to the orbital offset of a site. I actually proposed such a wrapper some time ago, but my proposal was rejected [1] because it did not give vectorized access to the wavefunction elements. It is not obvious to me how to give vectorized access without copying.
Then after this proposal the thread seems to start discussing _LocalOperator (which is an internal detail that was never meant to be extended or user-facing) and lots of details about implementation strategies in C++ for *something*, but it is not clear to me what the end goal is. Perhaps you want to be able to more easily express arbitrary operators? Or are you proposing a refactoring of the kwant operators module to make it more readable? A clarification would be very useful for me.
Thanks,
Joe
[1]: https://gitlab.kwant-project.org/kwant/kwant/merge_requests/47
On 5/14/19 3:14 PM, Adel Kara Slimane wrote:
Hello Christoph,
I have a few additions to make above my previous mail, I hope I am not making you waste your time maybe with trivial and unuseful suggestions.
I have found a C++ library, Armadillo http://arma.sourceforge.net, that can do the trick for implementing a fast accesser, with built-in matrix-vector products.
Let's assume we have the wavefunction and the hamiltonian on flat contiguous arrays. This library has Row, Column, Matrix and SparseMatrix object implementations, with inner functions like hermitian conjugate or transpose, and with operations between them like sum and multiplication. The good thing about it is that one initialise such objects with pointers to existing memory, without copying, here are the constructors that one could use:
For vectors: <http://arma.so! urceforge.net/docs.html#Col>
|vec(ptr_aux_mem, number_of_elements, copy_aux_mem = true, strict = false)|
For matrices http://arma.sourceforge.net/docs.html#Mat:
|mat(ptr_aux_mem, n_rows, n_cols, copy_aux_mem = true, strict = false)|
So one can wrap such objects in Cython and stack allocate them with the correct memory pointers and sizes then use them to write the wanted operator expression in a compact way. In terms of speed, I expect it to be nearly as fast as a "by-hand" impl! ementation, since the expression will be cythonised/compiled, thus in the end with similar nested for-loops after compilation.
If I understand well, matrices, for example the hamiltonian, don't have really a low level representation since "params" need to be provided first to create a flat matrix of only complex values, and that would be the point of "_eval_hamiltonian()" in the "_LocalOperator" class, it returns a BlockSparseMatrix which is basically a flat array containing all the matrices of all the needed hoppings, given in the "where" list. I have a question about that:
Would it be as fast to compute the hamiltonian little by little while calculating the operator's value on each hopping of the "where" list ? To show what I mean exactly, I have a code example https://gist.github.com/AdelKS/e1dc8ab625c96138ee63e24d0a0ad4f2 (I couldn't test it yet unfortunately) using the "tinyarray" pythong package. The idea would then be to use Aramadillo objects instead of tinyarray, and have the code compilable. This approach would have the benefit of not using "_eval_hamiltonian()" nor the "BlockSparseMatrix" class, but will need the extra Armadillo wrapping code, and mapping the return of "syst.hamiltonian" to a Matrix object.
Thank you,
Adel KARA SLIMANE
On Mon, 13 May, 2019 at 4:38 PM, Adel Kara Slimane adel.kara-slimane@cea.fr wrote:
Hello Christoph,
Thank you for your! detailed answer.
I convinced myself with a small C++ benchmark code that summing over the complex values of a vector<vector<complex<double>>> array is more than twice slower than summing over a flat, same size, dynamically allocated array (I attached the code to this mail). About the memory usage, I don't understand why it would take signifcantly more space, but maybe because I first suggested an array of Python objects instead of compiled objects, which take more space ?
I agree and I am interested by coding the solution you are suggesting, I would like to suggest further developpement to it: instead of returning a simple list of 2D arrays without copying anything, we use an "accesser" object that would enable ma trix products :
- It gets initialised with the s! ystem "syst"
- Calling "accesser.wavefunction(bra, site, orbital)" (or some similar writing) returns "wavefunction[orbital + offset(site)]" with the right offset.
- The matricial inner product is implemented: "accesser.wavefunction(bra, site) * accesser.hamiltonian(site) * accesser.wavefunction(ket, site)" would be a valid expression (or something similar, with the same idea to have an apparent matrix-vector or matrix-martix product).
The third bullet point raises for me some questions though:
- What do you think of it ?
- I need to understand how the hamiltionian is represented to be able to think of possibilities to implement that, but I can't grasp what is the low level representation of the hamiltonian when the system is finalised, is it a flat array too ? When I look at "_eval_hamiltonian()" in the class "_LocalOperator". I see that it uses the return value of "syst.hamiltonian()", then converts it into a ! Python tinyarray.matrix object, then convert it once again into a BlockSpareMatrix.
Thank you,
Adel
On Fri, 10 May, 2019 at 6:15 PM, Christoph Groth christoph.groth@cea.fr wrote:
Hi Adel, Your understanding of the way wave functions (and other observables) are stored is correct. You are also right that working with low-level observables is inconvenient and could be improved. I'll explain how the current situation came about, and offer a possible solution. As always, we're glad to hear comments or suggestions. I still think that our original design choice to store everything in one flat array was a good one. The alternatives you mention would require much more RAM and would be also significantly slower due to decreased memory locality. If we were to store a list of tinyarrays that would require storing one Python object (the tinyarray) per site. For example, if we were to do this for a system with two orbitals per site, that would mean a memory cost of 8 + 56 = 64 bytes per site. With the current approach, we require two complex numbers, that is 16 bytes per site. What's more, accessing this memory is no longer sequential, which makes it even slower. (You could object that finalized builders already store O(N) Python objects, namely the sites and their tags. But (1) that is not true for general Kwant low-level systems, and (2) we are working on solving this problem for finalized builders as well.) The other solutions that you mention have similar problems. What we could have done is sort the sites according to their number of orbitals, and then for each block of sites store a 2d array of shape (num_sites, num_orbs). But back when the low-level format was established, the number of orbitals of a site was not fixed, and so this could not have been done. The reason why the number was not fixed was not so much a false sense of generality, but rather that we didn't see a good way to specify the number of orbitals per site. We only later had the idea to fix the number of orbitals per site family. (The above solution would have the inconvenience that the wave function would be no longer a simple array, but rather a list of 2d arrays. This exposes more structure, but it also makes things like normalizing a wave function more difficult.) Here is how you can efficiently realize the above solution with a couple of lines of code: Check out the attribute 'site_ranges' of Kwant systems [1]. With this information, you can write a function 'unpack' that takes a system and a numpy array with one entry per orbital (for example a wave function), and returns a list of 2d arrays, that, without copying, refer to the data in a way that simplifies working with sites. If this sounds like a good idea and you write this function, be sure to share it. We might want to include it into Kwant as a method of 'System'. Cheers Christoph [1] https://kwant-project.org/doc/1/reference/generated/kwant.system.System#kwan...
Hello Joseph,
I apologise for not being clear enough, I will try to remedy to this in this email. And I understand that it is very plausible that all what I may suggest have been thought of before, and maybe put aside for various good reasons.
I read through this thread and I'm still not really sure what is being proposed. In the first email in the thread it seems as if you are proposing a wrapper for wavefunctions so that they can be indexed by site. It seems that you want to be able to write:
psi(i).dagger() * syst.hamiltonian(i, j, params) * psi(j)
Yes, that is what I want, my end goal is to make writing (additional) kwant/tkwant operators easier by writing a formula closer to the mathematical expression: my group is working on implementing additionnal operators to tKwant for energy and heat transport.
To implement what I am suggesting, to me three things are required: Have a wrapper to make the wavefunctions accessible by site, and implement the "dagger()" method.Have the hamiltonian accessible by site, which is the case, but slow ? with "syst.hamiltonian()", it seems that it's indirectly used by the class "Current" for example, so using "syst.hamiltonian()" directly shouldn't be slower.Operator overload: to be able to make Cython understand what is the "*" operator between "psi(i).dagger()" and "syst.hamiltonian(i, j, params)" I understood that 1. was settled and could be done: no kwant code is changed and only an additionnal wrapper is implemented, for use when it comes in handy, for example in writing kwant operators. Is that what was rejected and what you tried implementing ? @Groth seemed to be okay with it in his first mail answer (on 10-05, last paragraph), or I may have misunderstood him...
It's in trying to come up with a concrete proposal on how to implement 3. while still being as fast as the current implemetation of for example "_operate" from "Current" class : nested for loops on contiguous arrays, everything being cythonised/compiled (the only slow/python part seems to be to evaluate the hamiltonian with "params"). The first suggestion I came up with is to use the C++ library Armadillo with its already implemented matrix-vector products along with its matrix and vector classes. I can see three good points about it: Its vector and matrix classes can be instanciated with pointers on already existing data: i.e. the existing wavefunction and hamiltonian in kwant. So no memory copy.I assume that writing "psi(i).dagger() * syst.hamiltonian(i, j, params) * psi(j)" with Armadillo objects, in a Cython file, will get compiled. Thus I expect that the performance would be similar to the current approach with nested for loops.It may be possible to take advantage of the libs optimisations ? One, maybe huge, negative point is that it would add an additionnal dependency to Kwant for something that can be implemented by hand in a relatively short time (matrix-vector products). But maybe we can use its optimisations and not bother with reimplementing them: for example apparently writing "trace(A * B * C * D)" in this lib will for example be optimised so that it doesn't do the naive approach of computing the matrix product first, then take the trace of the result, which will make writing "trace(A * B * C * D)" still as fast as its nested-for-loops implementation.
I hope I could make my thinking clearer.
Thank you,
Adel
On Thu, 16 May, 2019 at 12:49 PM, Joseph Weston joseph.weston08@gmail.com wrote:
Hi Adel,
I read through this thread and I'm still not really sure what is being proposed. In the first email in the thread it seems as if you are proposing a wrapper for wavefunctions so that they can be indexed by site. It seems that you want to be able to write:
psi(i).dagger() * syst.hamiltonian(i, j, params) * psi(j)
This is a fine proposal. I could imagine that a simple implementation for such a wrapper would not need to use any exotic storage formats, C-level accessors or anything: finalized systems already contains an attribute 'site_ranges' that allows for constant-time access to the orbital offset of a site. I actually proposed such a wrapper some time ago, but my proposal was rejected [1] because it did not give vectorized access to the wavefunction elements. It is not obvious to me how to give vectorized access without copying.
Then after this proposal the thread seems to start discussing _LocalOperator (which is an internal detail that was never meant to be extended or user-facing) and lots of details about implementation strategies in C++ for *something*, but it is not clear to me what the end goal is. Perhaps you want to be able to more easily express arbitrary operators? Or are you proposing a refactoring of the kwant operators module to make it more readable? A clarification would be very useful for me.
Thanks, Joe
On 5/14/19 3:14 PM, Adel Kara Slimane wrote:
Hello Christoph,
I have a few additions to make above my previous mail, I hope I am not making you waste your time maybe with trivial and unuseful suggestions.
I have found a C++ library, Armadillo http://arma.sourceforge.net/, that can do the trick for implementing a fast accesser, with built-in matrix-vector products.
Let's assume we have the wavefunction and the hamiltonian on flat contiguous arrays. This library has Row, Column, Matrix and SparseMatrix object implementations, with inner functions like hermitian conjugate or transpose, and with operations between them like sum and multiplication. The good thing about it is that one initialise such objects with pointers to existing memory, without copying, here are the constructors that one could use:
For vectors: <http://arma.so! urceforge.net/docs.html#Col> vec(ptr_aux_mem, number_of_elements, copy_aux_mem = true, strict = false) For matrices http://arma.sourceforge.net/docs.html#Mat: mat(ptr_aux_mem, n_rows, n_cols, copy_aux_mem = true, strict = false) So one can wrap such objects in Cython and stack allocate them with the correct memory pointers and sizes then use them to write the wanted operator expression in a compact way. In terms of speed, I expect it to be nearly as fast as a "by-hand" impl! ementation, since the expression will be cythonised/compiled, thus in the end with similar nested for-loops after compilation.
If I understand well, matrices, for example the hamiltonian, don't have really a low level representation since "params" need to be provided first to create a flat matrix of only complex values, and that would be the point of "_eval_hamiltonian()" in the "_LocalOperator" class, it returns a BlockSparseMatrix which is basically a flat array containing all the matrices of all the needed hoppings, given in the "where" list. I have a question about that:
Would it be as fast to compute the hamiltonian little by little while calculating the operator's value on each hopping of the "where" list ? To show what I mean exactly, I have a code example https://gist.github.com/AdelKS/e1dc8ab625c96138ee63e24d0a0ad4f2 (I couldn't test it yet unfortunately) using the "tinyarray" pythong package. The idea would then be to use Aramadillo objects instead of tinyarray, and have the code compilable. This approach would have the benefit of not using "_eval_hamiltonian()" nor the "BlockSparseMatrix" class, but will need the extra Armadillo wrapping code, and mapping the return of "syst.hamiltonian" to a Matrix object.
Thank you,
Adel KARA SLIMANE
On Mon, 13 May, 2019 at 4:38 PM, Adel Kara Slimane adel.kara-slimane@cea.fr mailto:adel.kara-slimane@cea.fr wrote:
Hello Christoph,
Thank you for your! detailed answer.
I convinced myself with a small C++ benchmark code that summing over the complex values of a vector<vector<complex<double>>> array is more than twice slower than summing over a flat, same size, dynamically allocated array (I attached the code to this mail). About the memory usage, I don't understand why it would take signifcantly more space, but maybe because I first suggested an array of Python objects instead of compiled objects, which take more space ?
I agree and I am interested by coding the solution you are suggesting, I would like to suggest further developpement to it: instead of returning a simple list of 2D arrays without copying anything, we use an "accesser" object that would enable ma trix products : It gets initialised with the s! ystem "syst" Calling "accesser.wavefunction(bra, site, orbital)" (or some similar writing) returns "wavefunction[orbital + offset(site)]" with the right offset. The matricial inner product is implemented: "accesser.wavefunction(bra, site) * accesser.hamiltonian(site) * accesser.wavefunction(ket, site)" would be a valid expression (or something similar, with the same idea to have an apparent matrix-vector or matrix-martix product). The third bullet point raises for me some questions though: What do you think of it ? I need to understand how the hamiltionian is represented to be able to think of possibilities to implement that, but I can't grasp what is the low level representation of the hamiltonian when the system is finalised, is it a flat array too ? When I look at "_eval_hamiltonian()" in the class "_LocalOperator". I see that it uses the return value of "syst.hamiltonian()", then converts it into a ! Python tinyarray.matrix object, then convert it once again into a BlockSpareMatrix.
Thank you,
Adel
On Fri, 10 May, 2019 at 6:15 PM, Christoph Groth christoph.groth@cea.fr mailto:christoph.groth@cea.fr wrote:
Hi Adel,
Your understanding of the way wave functions (and other observables) are stored is correct. You are also right that working with low-level observables is inconvenient and could be improved. I'll explain how the current situation came about, and offer a possible solution. As always, we're glad to hear comments or suggestions.
I still think that our original design choice to store everything in one flat array was a good one. The alternatives you mention would require much more RAM and would be also significantly slower due to decreased memory locality.
If we were to store a list of tinyarrays that would require storing one Python object (the tinyarray) per site. For example, if we were to do this for a system with two orbitals per site, that would mean a memory cost of 8 + 56 = 64 bytes per site. With the current approach, we require two complex numbers, that is 16 bytes per site. What's more, accessing this memory is no longer sequential, which makes it even slower.
(You could object that finalized builders already store O(N) Python objects, namely the sites and their tags. But (1) that is not true for general Kwant low-level systems, and (2) we are working on solving this problem for finalized builders as well.)
The other solutions that you mention have similar problems. What we could have done is sort the sites according to their number of orbitals, and then for each block of sites store a 2d array of shape (num_sites, num_orbs). But back when the low-level format was established, the number of orbitals of a site was not fixed, and so this could not have been done. The reason why the number was not fixed was not so much a false sense of generality, but rather that we didn't see a good way to specify the number of orbitals per site. We only later had the idea to fix the number of orbitals per site family.
(The above solution would have the inconvenience that the wave function would be no longer a simple array, but rather a list of 2d arrays. This exposes more structure, but it also makes things like normalizing a wave function more difficult.)
Here is how you can efficiently realize the above solution with a couple of lines of code:
Check out the attribute 'site_ranges' of Kwant systems [1]. With this information, you can write a function 'unpack' that takes a system and a numpy array with one entry per orbital (for example a wave function), and returns a list of 2d arrays, that, without copying, refer to the data in a way that simplifies working with sites.
If this sounds like a good idea and you write this function, be sure to share it. We might want to include it into Kwant as a method of 'System'.
Cheers Christoph
[1] https://kwant-project.org/doc/1/reference/generated/kwant.system.System#kwant.system.System
Adel Kara Slimane wrote :
Joseph Weson wrote:
I read through this thread and I'm still not really sure what is being proposed. In the first email in the thread it seems as if you are proposing a wrapper for wavefunctions so that they can be indexed by site. It seems that you want to be able to write:
psi(i).dagger() * syst.hamiltonian(i, j, params) * psi(j)
Yes, that is what I want, my end goal is to make writing (additional) kwant/tkwant operators easier by writing a formula closer to the mathematical expression: my group is working on implementing additionnal operators to tKwant for energy and heat transport.
I have the impression that we first need to talk about what you would like to achieve and not get lost in the details.
(Specifically, it seems premature to talk about anything like the Armadillo library. This is one of many C++ libraries for linear algebra. You seem to be interested about users of Kwant, that is people who write Python scripts and perhaps Cython scripts. How would a C++ library help here?)
I have the impression that you'd like to be able to write code like above in Python (or Cython?). You'd like that the code runs fast and is easy to read (and write). Is this correct? Can you please describe your concrete problem, perhaps even with code examples?
I am suggesting a possible improvement (at least from my current point of view) to how 'operators' are written in Kwant, by 'operators' I mean the classes "Current", "Density" and "Source", written in the file "kwant/opreator.pyx". So this proposal will not affect end-users of Kwant, only Kwant developers and code contributors: this is of interest to my group because we are currently working on implementing additional operators for tkwant, similar to "Current", "Density" and "Source".
My suggestion is to be able write the "Current" class as in the attached file "suggestion.pyx" instead of what's currently done (an extract is in "current-implementation.pyx"), while still being as fast.
The same files are available here: https://gist.github.com/AdelKS/e1dc8ab625c96138ee63e24d0a0ad4f2 (with code highlighting)
Adel.
On Fri, 17 May, 2019 at 5:07 PM, Christoph Groth christoph.groth@cea.fr wrote:
Adel Kara Slimane wrote :
Joseph Weson wrote:
I read through this thread and I'm still not really sure what is being proposed. In the first email in the thread it seems as if
you
are proposing a wrapper for wavefunctions so that they can be indexed by site. It seems that you want to be able to write:
psi(i).dagger() * syst.hamiltonian(i, j, params) * psi(j)
Yes, that is what I want, my end goal is to make writing (additional) kwant/tkwant operators easier by writing a formula closer to the mathematical expression: my group is working on implementing additionnal operators to tKwant for energy and heat transport.
I have the impression that we first need to talk about what you would like to achieve and not get lost in the details.
(Specifically, it seems premature to talk about anything like the Armadillo library. This is one of many C++ libraries for linear algebra. You seem to be interested about users of Kwant, that is people who write Python scripts and perhaps Cython scripts. How would a C++ library help here?)
I have the impression that you'd like to be able to write code like above in Python (or Cython?). You'd like that the code runs fast and is easy to read (and write). Is this correct? Can you please describe your concrete problem, perhaps even with code examples?
Adel Kara Slimane a écrit :
About the memory usage, I don't understand why it would take signifcantly more space, but maybe because I first suggested an array of Python objects instead of compiled objects, which take more space ?
A Python list of 1000 tinyarrays of two numbers each takes up much more space than an array (numpy or tinyarray or array from the Python stdlib) because it involves 1000 Python objects that each store three pointer-sized data fields. Moreover, these 1000 chunks of memory are allocated separately and this requires storing additional data and may lead to additional waste due to holes.
The situation for a vector<vector<complex<double>>> in C++ is not fundamentally different. Each vector instance store its size and a pointer to the allocated memory. And the memory allocation also requires memory for metadata and likely causes waste.
The slowness is due toe the fact that more memory needs to be read, and that this memory is not contiguous. Because of the way L1 and L2 caches work, computers are fastest when reading memory contiguously.
I agree and I am interested by coding the solution you are suggesting, I would like to suggest further developpement to it: instead of returning a si! mple list of 2D arrays without copying anything, we use an "accesser" object that would enable matrix products :
1 It gets initialised with the system "syst"
2 Calling "accesser.wavefunction(bra, site, orbital)" (or some similar writing) returns "wavefunction[orbital + offset(site)]" with the right offset.
3 The matricial inner product is implemented: "accesser.wavefunction(bra, site) * accesser.hamiltonian(site) * accesser.wavefunction(ket, site)" would be a valid expression (or something similar, with the same idea to have an apparent matrix-vector or matrix-martix product).
So you suggest introducing a separate object, the "accesser", that is initialized with a system?
First of all, why a separate object? After all, the system itself, that already knows its 'site_ranges' could take over this work.
I will reply in another message for the rest.
Christoph
Thank you for your clarifications.
Also it seems that I obtained an intersting result while trying to come up with a minimal example to do benchmarks on for representing the wavefunction: summing over the values of vector<vector<complex<double>>> takes a similar amount of time as summing over its flattened counterpart, if the sum over the flat array needs to be aware of the orbitals for each site, i.e. doing something like this:
vector<complex<double>> wavefunction; vector<int> offsets;
// ... // fill the wavefunction and offsets // ...
complex<double> result = 0; for(int site = 0 ; site < offsets.back() ; site++) for(int orbital = 0 ; orbital < offsets[site+1] - offsets[site] ; orbital++) result = wavefunction[offsets[site] + orbital];
I attached the code I wrote that gave me this result. It may be because of under-the-hood optimisations of the compiler.
I will answer your question about the "accesser" in my answer to your second email.
Adel
On Fri, 17 May, 2019 at 3:53 PM, Christoph Groth christoph.groth@cea.fr wrote:
Adel Kara Slimane a écrit :
About the memory usage, I don't understand why it would take signifcantly more space, but maybe because I first suggested an array of Python objects instead of compiled objects, which take more space ?
A Python list of 1000 tinyarrays of two numbers each takes up much more space than an array (numpy or tinyarray or array from the Python stdlib) because it involves 1000 Python objects that each store three pointer-sized data fields. Moreover, these 1000 chunks of memory are allocated separately and this requires storing additional data and may lead to additional waste due to holes.
The situation for a vector<vector<complex<double>>> in C++ is not fundamentally different. Each vector instance store its size and a pointer to the allocated memory. And the memory allocation also requires memory for metadata and likely causes waste.
The slowness is due toe the fact that more memory needs to be read, and that this memory is not contiguous. Because of the way L1 and L2 caches work, computers are fastest when reading memory contiguously.
I agree and I am interested by coding the solution you are suggesting, I would like to suggest further developpement to it: instead of returning a si! mple list of 2D arrays without copying anything, we use an "accesser" object that would enable matrix products :
1 It gets initialised with the system "syst"
2 Calling "accesser.wavefunction(bra, site, orbital)" (or some similar writing) returns "wavefunction[orbital + offset(site)]" with the right offset.
3 The matricial inner product is implemented: "accesser.wavefunction(bra, site) * accesser.hamiltonian(site) * accesser.wavefunction(ket, site)" would be a valid expression (or something similar, with the same idea to have an apparent matrix-vector or matrix-martix product).
So you suggest introducing a separate object, the "accesser", that is initialized with a system?
First of all, why a separate object? After all, the system itself, that already knows its 'site_ranges' could take over this work.
I will reply in another message for the rest.
Christoph