Hello Christoph,

I have a few additions to make above my previous mail, I hope I am not making you waste your time maybe with trivial and unuseful suggestions.

I have found a C++ library, Armadillo, that can do the trick for implementing a fast accesser, with built-in matrix-vector products.

Let's assume we have the wavefunction and the hamiltonian on flat contiguous arrays. This library has Row, Column, Matrix and SparseMatrix object implementations, with inner functions like hermitian conjugate or transpose, and with operations between them like sum and multiplication. The good thing about it is that one initialise such objects with pointers to existing memory, without copying, here are the constructors that one could use:

For vectors:

vec(ptr_aux_mem, number_of_elements, copy_aux_mem = true, strict = false)

For matrices:

mat(ptr_aux_mem, n_rows, n_cols, copy_aux_mem = true, strict = false)

So one can wrap such objects in Cython and stack allocate them with the correct memory pointers and sizes then use them to write the wanted operator expression in a compact way. In terms of speed, I expect it to be nearly as fast as a "by-hand" impl! ementation, since the expression will be cythonised/compiled, thus in the end with similar nested for-loops after compilation.

If I understand well, matrices, for example the hamiltonian, don't have really a low level representation since "params" need to be provided first to create a flat matrix of only complex values, and that would be the point of "_eval_hamiltonian()" in the "_LocalOperator" class, it returns a BlockSparseMatrix which is basically a flat array containing all the matrices of all the needed hoppings, given in the "where" list. I have a question about that:

Would it be as fast to compute the hamiltonian little by little while calculating the operator's value on each hopping of the "where" list ? To show what I mean exactly, I have a code example (I couldn't test it yet unfortunately) using the "tinyarray" pythong package. The idea would then be to use Aramadillo objects instead of tinyarray, and have the code compilable. This approach would have the benefit of not using "_eval_hamiltonian()" nor the "BlockSparseMatrix" class, but will need the extra Armadillo wrapping code, and mapping the return of "syst.hamiltonian" to a Matrix object.

Thank you,

Adel KARA SLIMANE

On Mon, 13 May, 2019 at 4:38 PM, Adel Kara Slimane <adel.kara-slimane@cea.fr> wrote:

Hello Christoph,

Thank you for your! detailed answer.

I convinced myself with a small C++ benchmark code that summing over the complex values of a vector<vector<complex<double>>> array is more than twice slower than summing over a flat, same size, dynamically allocated array (I attached the code to this mail). About the memory usage, I don't understand why it would take signifcantly more space, but maybe because I first suggested an array of Python objects instead of compiled objects, which take more space ?

I agree and I am interested by coding the solution you are suggesting, I would like to suggest further developpement to it: instead of returning a simple list of 2D arrays without copying anything, we use an "accesser" object that would enable ma trix products :
It gets initialised with the s! ystem "syst"
Calling "accesser.wavefunction(bra, site, orbital)" (or some similar writing) returns "wavefunction[orbital + offset(site)]" with the right offset.
The matricial inner product is implemented: "accesser.wavefunction(bra, site) * accesser.hamiltonian(site) * accesser.wavefunction(ket, site)" would be a valid expression (or something similar, with the same idea to have an apparent matrix-vector or matrix-martix product).
The third bullet point raises for me some questions though:
What do you think of it ?
I need to understand how the hamiltionian is represented to be able to think of possibilities to implement that, but I can't grasp what is the low level representation of the hamiltonian when the system is finalised, is it a flat array too ? When I look at "_eval_hamiltonian()" in the class "_LocalOperator". I see that it uses the return value of "syst.hamiltonian()", then converts it into a ! Python tinyarray.matrix object, then convert it once again into a BlockSpareMatrix.

Thank you,

Adel

On Fri, 10 May, 2019 at 6:15 PM, Christoph Groth <christoph.groth@cea.fr> wrote:
Hi Adel, Your understanding of the way wave functions (and other observables) are stored is correct. You are also right that working with low-level observables is inconvenient and could be improved. I'll explain how the current situation came about, and offer a possible solution. As always, we're glad to hear comments or suggestions. I still think that our original design choice to store everything in one flat array was a good one. The alternatives you mention would require much more RAM and would be also significantly slower due to decreased memory locality. If we were to store a list of tinyarrays that would require storing one Python object (the tinyarray) per site. For example, if we were to do this for a system with two orbitals per site, that would mean a memory cost of 8 + 56 = 64 bytes per site. With the current approach, we require two complex numbers, that is 16 bytes per site. What's more, accessing this memory is no longer sequential, which makes it even slower. (You could object that finalized builders already store O(N) Python objects, namely the sites and their tags. But (1) that is not true for general Kwant low-level systems, and (2) we are working on solving this problem for finalized builders as well.) The other solutions that you mention have similar problems. What we could have done is sort the sites according to their number of orbitals, and then for each block of sites store a 2d array of shape (num_sites, num_orbs). But back when the low-level format was established, the number of orbitals of a site was not fixed, and so this could not have been done. The reason why the number was not fixed was not so much a false sense of generality, but rather that we didn't see a good way to specify the number of orbitals per site. We only later had the idea to fix the number of orbitals per site family. (The above solution would have the inconvenience that the wave function would be no longer a simple array, but rather a list of 2d arrays. This exposes more structure, but it also makes things like normalizing a wave function more difficult.) Here is how you can efficiently realize the above solution with a couple of lines of code: Check out the attribute 'site_ranges' of Kwant systems [1]. With this information, you can write a function 'unpack' that takes a system and a numpy array with one entry per orbital (for example a wave function), and returns a list of 2d arrays, that, without copying, refer to the data in a way that simplifies working with sites. If this sounds like a good idea and you write this function, be sure to share it. We might want to include it into Kwant as a method of 'System'. Cheers Christoph [1] https://kwant-project.org/doc/1/reference/generated/kwant.system.System#kwant.system.System