Interpolation is slow
Hello, We're dealing with a 5000 x 5000 site system, and the time to interpolate the densities and currents using kwant.plotter.interpolate_density and kwant.plotter.interpolate_current are by far the rate limiting steps in the calculation: System assembly : 15 mins Computing scattering states: 3 hrs Interpolating density and current (for use in plotting): 6 hrs! Is there any way to speed up the interpolation routines? I see that the present implementation in plotter.py involves a for loop (in python) over each site/hopping. Perhaps a vectorized implementation may help? Thanks, Mani
Mani Chandra wrote:
We're dealing with a 5000 x 5000 site system, and the time to interpolate the densities and currents using kwant.plotter.interpolate_density and kwant.plotter.interpolate_current are by far the rate limiting steps in the calculation:
You must be applying Kwant’s interpolator to the biggest system anyone tried ever! Why are you interpolating such a huge system? What are you trying to achieve? What is the width of the interpolation “bumps” that you are using and what is the size of your output array? (Or in other words: how are you calling the interpolator?)
Is there any way to speed up the interpolation routines? I see that the present implementation in plotter.py involves a for loop (in python) over each site/hopping.
Are you using streamplot or just interpolating currents/densities? Kwant’s streamplot routine piggy backs on matplotlib’s streamplot and in a typical use case (e.g. the attached script) matplotlib’s streamplot will take half of the time that Kwant’s interpolation routine takes. So even if the interpolator produced its results instantly, current plotting would become only three times faster in such applications. The interpolator could be optimized further in several ways - as always it’s a compromise between development effort and speed. But the application that it was created for (i.e. figures for publications) tends to involve systems of size 100^2 or so and then its speed is quite satisfactory (see attached example). Creating figures from much larger systems only makes sense when the interpolation is smoothing things out a lot (i.e. the “bumps” are much larger than a hopping), otherwise there would be too much detail in the figure. In this regime the current implementation will be inefficient because its vectorization (see further below) will be over small subarrays.
Perhaps a vectorized implementation may help?
Actually, it is already vectorized - otherwise it would be much slower! The main loop inside _interpolate_field() goes indeed sequentially over all hoppings/densities, but for each such hopping/density the output field in its vicinity is updated using vectorized numpy operations. A straightforward brute-force way to speed up the interpolation would be to split up the work of that loop over multiple cores - each would have its own output array, and in the end they would be all summed together. That should be quite easy to do [1]. A complementary approach would be to rewrite the inner loop in a fast compiled language. This would help especially in the case where the interpolation bump does not cover many points of the output array and as such the existing vectorization is not very good (the involved subarrays are small). Yet another approach would be to create a special implementation for the common case of a rectangular lattice that is commensurate with the output array of the interpolator. All three approaches could be combined. To go further, someone who is really motivated could even implement the interpolation algorithm on GPUs... If you or someone else would like to work on this I would be happy to guide you. Cheers Christoph [1] https://medium.com/analytics-vidhya/using-numpy-efficiently-between-processe...
Hi Christoph, Thank you for your detailed reply!
Why are you interpolating such a huge system? What are you trying to achieve? What is the width of the interpolation “bumps” that you are using and what is the size of your output array? (Or in other words: how are you calling the interpolator?)
We want to examine quantities on scales much larger than the lattice scale, and average over lattice scale features. See attached plot of the density in our simulation of transverse magnetic focussing. The left is the raw density data and the right is the interpolated data with the relwidth parameter set to 0.05. Evidently, the focussing peaks really stand out in the interpolated plot. I'm using the interpolator as follows: density = sum(density_operator(mode) for mode in modes) density_interpolated, boundaries = kwant.plotter.interpolate_density(sys, density, relwidth=0.05)
Is there any way to speed up the interpolation routines? I see that the present implementation in plotter.py involves a for loop (in python) over each site/hopping.
Are you using streamplot or just interpolating currents/densities? Kwant’s streamplot routine piggy backs on matplotlib’s streamplot and in a typical use case (e.g. the attached script) matplotlib’s streamplot will take half of the time that Kwant’s interpolation routine takes. So even if the interpolator produced its results instantly, current plotting would become only three times faster in such applications.
Just interpolating currents and densities and then plotting at a later stage.
The interpolator could be optimized further in several ways - as always it’s a compromise between development effort and speed. But the application that it was created for (i.e. figures for publications) tends to involve systems of size 100^2 or so and then its speed is quite satisfactory (see attached example).
Creating figures from much larger systems only makes sense when the interpolation is smoothing things out a lot (i.e. the “bumps” are much larger than a hopping), otherwise there would be too much detail in the figure.
I think you mean something like the plot I attached? In this regime the current implementation will be inefficient
because its vectorization (see further below) will be over small subarrays.
Perhaps a vectorized implementation may help?
Actually, it is already vectorized - otherwise it would be much slower! The main loop inside _interpolate_field() goes indeed sequentially over all hoppings/densities, but for each such hopping/density the output field in its vicinity is updated using vectorized numpy operations.
A straightforward brute-force way to speed up the interpolation would be to split up the work of that loop over multiple cores - each would have its own output array, and in the end they would be all summed together. That should be quite easy to do [1].
A complementary approach would be to rewrite the inner loop in a fast compiled language. This would help especially in the case where the interpolation bump does not cover many points of the output array and as such the existing vectorization is not very good (the involved subarrays are small).
Do you think this can be achieved using Cython?
Yet another approach would be to create a special implementation for the common case of a rectangular lattice that is commensurate with the output array of the interpolator.
We're indeed using a rectangular lattice!
All three approaches could be combined. To go further, someone who is really motivated could even implement the interpolation algorithm on GPUs...
That would be a bit much. I think a more optimized CPU implementation will already solve the problem.
If you or someone else would like to work on this I would be happy to guide you.
Thanks! I can try Cythonizing or vectorizing the loop. More generally, what do you think about simply using a filtering method (say gaussian filter) in scipy.ndimage ( https://docs.scipy.org/doc/scipy/reference/reference/generated/scipy.ndimage...) or one of the many in scipy.signal ( https://docs.scipy.org/doc/scipy/reference/reference/signal.html)? Are you using a special kind of filter? Cheers, Mani
Mani Chandra wrote:
We want to examine quantities on scales much larger than the lattice scale, and average over lattice scale features. See attached plot of the density in our simulation of transverse magnetic focussing. The left is the raw density data and the right is the interpolated data with the relwidth parameter set to 0.05. Evidently, the focussing peaks really stand out in the interpolated plot. I'm using the interpolator as follows:
density = sum(density_operator(mode) for mode in modes) density_interpolated, boundaries = kwant.plotter.interpolate_density(sys, density, relwidth=0.05)
Smoothing out fine detail is a legitimate application of the interpolation algorithm. However, you use it at quite a massive scale. Your input array being 5000^2 in size and relwidth=0.05 means that the convolution kernel is 250^2 in size. The way the interpolation algorithm works means that for each hopping the 250^2 elements of the convolution kernel f(r) = (1-r^2)^2 Θ(1-r^2) are computed. (Remember that it’s a generic algorithm and works equally well for any amorphous “lattice”.) You seem to have 5000^2 horizontal hoppings and as many vertical ones. As a rough estimate (neglecting any border effects) this boils down to 2 * 5000^2 * 250^2 = 3.1e12 evaluations of the kernel! Each involves multiple mathematical operations and memory accesses, but the’re vectorized with NumPy in batches of 250^2. You said that this computation takes 6 h. Let’s assume that this is on a 3 GHz CPU core. Then, each evaluation takes some 20 CPU clock cycles, which is not that bad actually - given that this is pure Python. The NumPy vectorization is apparently doing it’s job. An optimal CPU implementation of exactly the same generic algorithm would still take a few CPU cycles per evaluation, it won’t help you much. A GPU would be well suited for such regular work, but we both agree that it would be overkill and as we’ll see later there’s probably a more intelligent way to do long-range smoothing.
A complementary approach would be to rewrite the inner loop in a fast compiled language. This would help especially in the case where the interpolation bump does not cover many points of the output array and as such the existing vectorization is not very good (the involved subarrays are small).
Do you think this can be achieved using Cython?
It can, but your application won’t benefit from it. Cython would help for use cases where the convolution Kernel is quite small. In your case of a huge kernel thanks to NumPy the CPU spends most of its time in highly optimized BLAS already.
Yet another approach would be to create a special implementation for the common case of a rectangular lattice that is commensurate with the output array of the interpolator.
We're indeed using a rectangular lattice!
Yes, so this might really help in your case. One would split up all the hoppings into classes where the hoppings are geometrically equivalent modulo the output lattice. In your case there would be only two such classes: on for horizontal hoppings and one for vertical. Then, for each hopping class, one would precompute the kernel and apply it to all the hoppings in the class. I guess that this would provide a significant speedup (hopefully 5 times or more) without even abandoning pure Python for the implementation. And this could be combined with parallelization among CPU cores.
More generally, what do you think about simply using a filtering method (say gaussian filter) in scipy.ndimage ( https://docs.scipy.org/doc/scipy/reference/reference/generated/scipy.ndimage...) or one of the many in scipy.signal ( https://docs.scipy.org/doc/scipy/reference/reference/signal.html)? Are you using a special kind of filter?
Kwant’s interpolation scheme was designed such that • it works for arbitrary irregular hopping graphs, • and that it is physically correct (current is conserved). Visually, the interpolation scheme can be imagined as follows: Initially, a current field is established that is only non-zero along straight lines that correspond to hoppings. These lines are infinitely thin (they have a delta-function-like cross section). Then, in a second step, the current field is smoothed by convoluting it with a polynomial (for efficiency) bump function. However, performing convolution in real space is inefficient, and as we have seen it gets especially bad when smoothing is done over long distances. I think that a hybrid approach could help speed up the algorithm a lot while keeping it fully generic: • In a first step one would run the current routine but with a kernel that is only a few times larger than the longest hopping. The result would be a regular interpolated current field but that is smoothed out only a little (only enough not to loose any local currents). • In a second step, the output current field of the first step (that lives on a regular rectangular grid) is smoothed further by using an efficient convoluting routine that uses FFT or something like that.
If you or someone else would like to work on this I would be happy to guide you.
Thanks! I can try Cythonizing or vectorizing the loop.
I suggest the following approach: • You test whether the hybrid approach sketched above works for you. This could already speed up things quite a lot. • However, the vectorization will be less efficient now due to much smaller smoothing kernels. It will be interesting to see how much slower it gets per kernel evaluation compared to the current “20 clock cycles”. We could then try to speed up the current routine by using the classification approach and/or some Cython. Hope this helps - please do keep us updated about your findings Christoph
Hello again, I noticed two problems with my reply. • I was speaking about currents while you interpolate densities. (This does not matter much.) • I was assuming that your output field is 5000^2 while it’s rather (20*9)^2 because you keep the n parameter to interpolate_density at its default value of 9. Can you verify this? The second point modifies my analysis quite a lot in that the kernel size is only 9*9 and not 250*250. Now it’s something like 32000 CPU cycles per kernel evaluation, which seems horrible. Could you send me a minimal script that reproduces your problem, i.e. that interpolates some (whatever) density with the same parameters that you use? Then I can check things directly instead of guessing. Thanks Christoph
Hi Christoph, - We need both interpolated densities and currents. I just showed you the densities as an example. - Yes, indeed n=9. Now that you mention it, I'm confused about the parameters relwidth and n. What's the difference between the two? - I added you to the code repo on github: Run using (in the following sequence): time python dump_system.py time python solve.py time python densities_and_currents.py The parameters for the run are set in params.py. The numbers currently set will produce a 5000x5000 site model. The first script will dump the system to disk, the second will compute the scattering states, and the third will load the system and the scattering states and compute the densities and the currents. Code blocks to produce the plots are in PostProcessingTMF.ipynb (slightly disorganized - please let me know if you need help here). Cheers, Mani On Thu, Oct 28, 2021 at 7:06 PM Christoph Groth <christoph.groth@cea.fr> wrote:
Hello again,
I noticed two problems with my reply.
• I was speaking about currents while you interpolate densities. (This does not matter much.)
• I was assuming that your output field is 5000^2 while it’s rather (20*9)^2 because you keep the n parameter to interpolate_density at its default value of 9. Can you verify this?
The second point modifies my analysis quite a lot in that the kernel size is only 9*9 and not 250*250. Now it’s something like 32000 CPU cycles per kernel evaluation, which seems horrible.
Could you send me a minimal script that reproduces your problem, i.e. that interpolates some (whatever) density with the same parameters that you use? Then I can check things directly instead of guessing.
Thanks Christoph
Mani Chandra wrote:
- I added you to the code repo on github:
It would be preferable if you could prepare a small self-contained example Python script (rather not a notebook) that you could post to the list. Notably, this would allow others to follow along (now and in the future). I imagine that that script could be just 20 lines or so: Make up a system with a similar graph (the values of the Hamiltonian do not matter), then make up some makeshift current/density (just a random array for example), and use this to run the interpolator. Christoph
Hi Christoph, Attached are the relevant files. The total code is ~100 lines. Copying the instructions from my previous email: time python dump_system.py time python solve.py time python densities_and_currents.py The parameters for the run are set in params.py. The numbers currently set will produce a 5000x5000 site model. The first script will dump the system to disk, the second will compute the scattering states, and the third will load the system and the scattering states and compute the densities and the currents. ------------------------------ To load the data, do: data = np.load(f'densities_and_currents_d_c_{params.d_c:.4f}.npz') rho_left = data['rho_left'] rho_right = data['rho_right'] ---------------------------------- To plot the raw density: rho = (rho_right - rho_left).reshape([params.length, params.height]) norm = matplotlib.colors.TwoSlopeNorm(vmin=rho.min(), vcenter=0, vmax=rho.max()) plt.contourf(rho.T, 1000, cmap='bwr', norm=norm) plt.gca().set_aspect('equal') plt.colorbar() ------------------------------------- To plot the interpolated density: rho_interpolated = rho_right_interpolated - rho_left_interpolated rho_interpolated = rho_interpolated/rho_interpolated.max() norm = matplotlib.colors.TwoSlopeNorm(vmin=rho_interpolated.min(), vcenter=0, vmax=rho_interpolated.max()) plt.contourf(x_rho, y_rho, rho_interpolated[:, :, 0], 1000, cmap='bwr', norm=norm) plt.gca().set_aspect('equal') plt.colorbar() ----------------------------------- Finally, check this out: from scipy.ndimage import gaussian_filter rho = (rho_right - rho_left).reshape([params.length, params.height]) rho_filtered = gaussian_filter(rho, sigma=50) norm = matplotlib.colors.TwoSlopeNorm(vmin=rho_filtered.min(), vcenter=0, vmax=rho_filtered.max()) plt.contourf(rho_filtered.T, 1000, cmap='bwr', norm=norm) plt.gca().set_aspect('equal') The above code gives a plot (attached) quite similar to the interpolated kwant density, but scipy's gaussian filter takes only 10 secs to blur out the 5000x5000 raw density data. Granted, it will probably only work with rectangular lattices. I wish I could do the same with currents, but I'm not sure how to get the currents into a format like length x height (as I did from the density above) because of their definition on the hoppings. Is there a function to obtain the current in a convenient matrix-like format, like the above? Cheers, Mani On Fri, Oct 29, 2021 at 1:27 AM Christoph Groth <christoph.groth@cea.fr> wrote:
Mani Chandra wrote:
- I added you to the code repo on github:
It would be preferable if you could prepare a small self-contained example Python script (rather not a notebook) that you could post to the list. Notably, this would allow others to follow along (now and in the future).
I imagine that that script could be just 20 lines or so: Make up a system with a similar graph (the values of the Hamiltonian do not matter), then make up some makeshift current/density (just a random array for example), and use this to run the interpolator.
Christoph
Oops, forgot to attach dump_system.py. Sorry about that. On Fri, Oct 29, 2021 at 2:37 PM Mani Chandra <mc0710@gmail.com> wrote:
Hi Christoph,
Attached are the relevant files. The total code is ~100 lines.
Copying the instructions from my previous email:
time python dump_system.py time python solve.py time python densities_and_currents.py
The parameters for the run are set in params.py. The numbers currently set will produce a 5000x5000 site model. The first script will dump the system to disk, the second will compute the scattering states, and the third will load the system and the scattering states and compute the densities and the currents.
------------------------------ To load the data, do:
data = np.load(f'densities_and_currents_d_c_{params.d_c:.4f}.npz') rho_left = data['rho_left'] rho_right = data['rho_right']
---------------------------------- To plot the raw density:
rho = (rho_right - rho_left).reshape([params.length, params.height]) norm = matplotlib.colors.TwoSlopeNorm(vmin=rho.min(), vcenter=0, vmax=rho.max())
plt.contourf(rho.T, 1000, cmap='bwr', norm=norm) plt.gca().set_aspect('equal') plt.colorbar()
------------------------------------- To plot the interpolated density:
rho_interpolated = rho_right_interpolated - rho_left_interpolated rho_interpolated = rho_interpolated/rho_interpolated.max()
norm = matplotlib.colors.TwoSlopeNorm(vmin=rho_interpolated.min(), vcenter=0, vmax=rho_interpolated.max())
plt.contourf(x_rho, y_rho, rho_interpolated[:, :, 0], 1000, cmap='bwr', norm=norm) plt.gca().set_aspect('equal') plt.colorbar()
----------------------------------- Finally, check this out:
from scipy.ndimage import gaussian_filter
rho = (rho_right - rho_left).reshape([params.length, params.height]) rho_filtered = gaussian_filter(rho, sigma=50) norm = matplotlib.colors.TwoSlopeNorm(vmin=rho_filtered.min(), vcenter=0, vmax=rho_filtered.max())
plt.contourf(rho_filtered.T, 1000, cmap='bwr', norm=norm) plt.gca().set_aspect('equal')
The above code gives a plot (attached) quite similar to the interpolated kwant density, but scipy's gaussian filter takes only 10 secs to blur out the 5000x5000 raw density data. Granted, it will probably only work with rectangular lattices. I wish I could do the same with currents, but I'm not sure how to get the currents into a format like length x height (as I did from the density above) because of their definition on the hoppings. Is there a function to obtain the current in a convenient matrix-like format, like the above?
Cheers, Mani
On Fri, Oct 29, 2021 at 1:27 AM Christoph Groth <christoph.groth@cea.fr> wrote:
Mani Chandra wrote:
- I added you to the code repo on github:
It would be preferable if you could prepare a small self-contained example Python script (rather not a notebook) that you could post to the list. Notably, this would allow others to follow along (now and in the future).
I imagine that that script could be just 20 lines or so: Make up a system with a similar graph (the values of the Hamiltonian do not matter), then make up some makeshift current/density (just a random array for example), and use this to run the interpolator.
Christoph
participants (2)
-
Christoph Groth
-
Mani Chandra