Differences performance Julia / PyPy on very similar codes
Hi, I post on this list a message written in PyPy issue tracker (https://foss.heptapod.net/pypy/pypy/-/issues/3349#note_150255). It is about some experiments I did on writing efficient implementations of the NBody problem https://github.com/paugier/nbabel to potentially answer to this article https://arxiv.org/pdf/2009.11295.pdf. I get from a PR an [interesting optimized implementation in Julia](https://github.com/paugier/nbabel/blob/master/julia/nbabel4_serial.jl). It is very fast (even slightly faster than in Pythran). One idea is to store the 3 floats of a 3d physical vector, (x, y, z), in a struct `Point4D` containing 4 floats to better use simd instructions. I added a pure Python implementation inspired by this new Julia implementation (but with a simple `Point3D` with 3 floats because with PyPy, the `Point4D` does not make the code faster) and good news it is with PyPy a bit faster than our previous PyPy implementations (only 3 times slower than the old C++ implementation). However, it is much slower than with Julia (while the code is very similar). I coded a simplified version in Julia with nearly nothing else that what can be written in pure Python (in particular, no `@inbounds` and `@simd` macros). It seems to me that the comparison of these 2 versions could be interesting. So I again simplified these 2 versions to keep only what is important for performance, which gives - https://github.com/paugier/nbabel/blob/master/py/microbench_pypy4.py - https://github.com/paugier/nbabel/blob/master/py/microbench_ju4.jl The results are summarized in https://github.com/paugier/nbabel/blob/master/py/microbench.md An important point is that with `Point3D` (a mutable class in Python and an immutable struct in Julia), Julia is 3.6 times faster than PyPy. Same code and nothing really fancy in Julia so I guess that PyPy might be missing some optimization opportunities. At least it would be interesting to understand what is slower in PyPy (and why). I have to admit that I don't know how to get interesting information on timing and what is happening with PyPy JIT in a particular case. I only used cProfile and it's of course clearly not enough. I can run vmprof but I'm not able to visualize the data because the website http://vmprof.com/ is down. I don't know if I can trust values given by IPython `%timeit` for particular instructions since I don't know if PyPy JIT does the same thing in `%timeit` and in the function `compute_accelerations`. I also feel that I really miss in pure Python an efficient fixed size homogeneous mutable sequence (a "Vector" in Julia words) that can contain basic numerical types (as Python `array.array`) but also instances of user-defined classes and instances of Vectors. The Python code uses a [pure Python implementation using a list](https://github.com/paugier/nbabel/blob/master/py/vector.py). I think it would be reasonable to have a good implementation highly compatible with PyPy (and potentially other Python implementations) in a package on PyPI. It would really help to write PyPy compatible numerical codes. What would be the good tool to implement such package? HPy? I wonder whether we can get some speedup compared to the pure Python version with lists. For very simple classes like `Point3d` and `Point4d`, I wonder if the data could be saved continuously in memory and if some operations could be done without boxing/unboxing. However, I really don't know what is slower in PyPy / faster in Julia. I would be very interested to get the points of view of people knowing well PyPy. Pierre
Does Julia based on LLVM auto-vectorize the code? I assume yes because you specifically mention SIMD design of the data structure. Have you tried NumPyPy? Development on NumPyPy has not continued, but it probably would be a better comparison of what PyPy with auto-vectorization could accomplish to compare with Julia. Thanks, David On Fri, Dec 18, 2020 at 2:56 PM PIERRE AUGIER <pierre.augier@univ-grenoble-alpes.fr> wrote:
Hi,
I post on this list a message written in PyPy issue tracker (https://foss.heptapod.net/pypy/pypy/-/issues/3349#note_150255). It is about some experiments I did on writing efficient implementations of the NBody problem https://github.com/paugier/nbabel to potentially answer to this article https://arxiv.org/pdf/2009.11295.pdf.
I get from a PR an [interesting optimized implementation in Julia](https://github.com/paugier/nbabel/blob/master/julia/nbabel4_serial.jl). It is very fast (even slightly faster than in Pythran). One idea is to store the 3 floats of a 3d physical vector, (x, y, z), in a struct `Point4D` containing 4 floats to better use simd instructions.
I added a pure Python implementation inspired by this new Julia implementation (but with a simple `Point3D` with 3 floats because with PyPy, the `Point4D` does not make the code faster) and good news it is with PyPy a bit faster than our previous PyPy implementations (only 3 times slower than the old C++ implementation).
However, it is much slower than with Julia (while the code is very similar). I coded a simplified version in Julia with nearly nothing else that what can be written in pure Python (in particular, no `@inbounds` and `@simd` macros). It seems to me that the comparison of these 2 versions could be interesting. So I again simplified these 2 versions to keep only what is important for performance, which gives
- https://github.com/paugier/nbabel/blob/master/py/microbench_pypy4.py - https://github.com/paugier/nbabel/blob/master/py/microbench_ju4.jl
The results are summarized in https://github.com/paugier/nbabel/blob/master/py/microbench.md
An important point is that with `Point3D` (a mutable class in Python and an immutable struct in Julia), Julia is 3.6 times faster than PyPy. Same code and nothing really fancy in Julia so I guess that PyPy might be missing some optimization opportunities. At least it would be interesting to understand what is slower in PyPy (and why). I have to admit that I don't know how to get interesting information on timing and what is happening with PyPy JIT in a particular case. I only used cProfile and it's of course clearly not enough. I can run vmprof but I'm not able to visualize the data because the website http://vmprof.com/ is down. I don't know if I can trust values given by IPython `%timeit` for particular instructions since I don't know if PyPy JIT does the same thing in `%timeit` and in the function `compute_accelerations`.
I also feel that I really miss in pure Python an efficient fixed size homogeneous mutable sequence (a "Vector" in Julia words) that can contain basic numerical types (as Python `array.array`) but also instances of user-defined classes and instances of Vectors. The Python code uses a [pure Python implementation using a list](https://github.com/paugier/nbabel/blob/master/py/vector.py). I think it would be reasonable to have a good implementation highly compatible with PyPy (and potentially other Python implementations) in a package on PyPI. It would really help to write PyPy compatible numerical codes. What would be the good tool to implement such package? HPy? I wonder whether we can get some speedup compared to the pure Python version with lists. For very simple classes like `Point3d` and `Point4d`, I wonder if the data could be saved continuously in memory and if some operations could be done without boxing/unboxing.
However, I really don't know what is slower in PyPy / faster in Julia.
I would be very interested to get the points of view of people knowing well PyPy.
Pierre _______________________________________________ pypy-dev mailing list pypy-dev@python.org https://mail.python.org/mailman/listinfo/pypy-dev
----- Mail original -----
De: "David Edelsohn" <dje.gcc@gmail.com> À: "PIERRE AUGIER" <pierre.augier@univ-grenoble-alpes.fr> Cc: "pypy-dev" <pypy-dev@python.org> Envoyé: Vendredi 18 Décembre 2020 21:00:42 Objet: Re: [pypy-dev] Differences performance Julia / PyPy on very similar codes
Does Julia based on LLVM auto-vectorize the code? I assume yes because you specifically mention SIMD design of the data structure.
Yes, Julia auto-vectorizes the code. Can't PyPy do the same in some case?
Have you tried NumPyPy? Development on NumPyPy has not continued, but it probably would be a better comparison of what PyPy with auto-vectorization could accomplish to compare with Julia.
I haven't tried NumPyPy because I can't import _numpypy with PyPy3.6. Anyway, for this experiment, my attempt was to stay in pure Python and to compare with what is done in pure Julia. I think it would be very interesting to understand why PyPy is much slower than Julia in this case (a factor 4 slower than very simple Julia). I'm wondering if it is an issue of the language or a limitation of the implementation. Moreover, I would really be interested to know if an extension compatible with PyPy (better, not only compatible with PyPy) could be written to make such code faster (a code involving an array of instances of a very simple class). Could we gain anything compare to using a Python list? Are there some tools to understand what is done by PyPy to speedup some code? Or to know more on the data structures used under the hood by PyPy? For example, class Point3D: def __init__(self, x, y, z): self.x = x self.y = y self.z = z def norm_square(self): return self.x**2 + self.y**2 + self.z**2 I guess it would be good for efficiency to store the 3 floats as native floats aligned in memory and to vectorized the power computation. How can one know what is done by PyPy for a particular code? Pierre
Thanks, David
On Fri, Dec 18, 2020 at 2:56 PM PIERRE AUGIER <pierre.augier@univ-grenoble-alpes.fr> wrote:
Hi,
I post on this list a message written in PyPy issue tracker (https://foss.heptapod.net/pypy/pypy/-/issues/3349#note_150255). It is about some experiments I did on writing efficient implementations of the NBody problem https://github.com/paugier/nbabel to potentially answer to this article https://arxiv.org/pdf/2009.11295.pdf.
I get from a PR an [interesting optimized implementation in Julia](https://github.com/paugier/nbabel/blob/master/julia/nbabel4_serial.jl). It is very fast (even slightly faster than in Pythran). One idea is to store the 3 floats of a 3d physical vector, (x, y, z), in a struct `Point4D` containing 4 floats to better use simd instructions.
I added a pure Python implementation inspired by this new Julia implementation (but with a simple `Point3D` with 3 floats because with PyPy, the `Point4D` does not make the code faster) and good news it is with PyPy a bit faster than our previous PyPy implementations (only 3 times slower than the old C++ implementation).
However, it is much slower than with Julia (while the code is very similar). I coded a simplified version in Julia with nearly nothing else that what can be written in pure Python (in particular, no `@inbounds` and `@simd` macros). It seems to me that the comparison of these 2 versions could be interesting. So I again simplified these 2 versions to keep only what is important for performance, which gives
- https://github.com/paugier/nbabel/blob/master/py/microbench_pypy4.py - https://github.com/paugier/nbabel/blob/master/py/microbench_ju4.jl
The results are summarized in https://github.com/paugier/nbabel/blob/master/py/microbench.md
An important point is that with `Point3D` (a mutable class in Python and an immutable struct in Julia), Julia is 3.6 times faster than PyPy. Same code and nothing really fancy in Julia so I guess that PyPy might be missing some optimization opportunities. At least it would be interesting to understand what is slower in PyPy (and why). I have to admit that I don't know how to get interesting information on timing and what is happening with PyPy JIT in a particular case. I only used cProfile and it's of course clearly not enough. I can run vmprof but I'm not able to visualize the data because the website http://vmprof.com/ is down. I don't know if I can trust values given by IPython `%timeit` for particular instructions since I don't know if PyPy JIT does the same thing in `%timeit` and in the function `compute_accelerations`.
I also feel that I really miss in pure Python an efficient fixed size homogeneous mutable sequence (a "Vector" in Julia words) that can contain basic numerical types (as Python `array.array`) but also instances of user-defined classes and instances of Vectors. The Python code uses a [pure Python implementation using a list](https://github.com/paugier/nbabel/blob/master/py/vector.py). I think it would be reasonable to have a good implementation highly compatible with PyPy (and potentially other Python implementations) in a package on PyPI. It would really help to write PyPy compatible numerical codes. What would be the good tool to implement such package? HPy? I wonder whether we can get some speedup compared to the pure Python version with lists. For very simple classes like `Point3d` and `Point4d`, I wonder if the data could be saved continuously in memory and if some operations could be done without boxing/unboxing.
However, I really don't know what is slower in PyPy / faster in Julia.
I would be very interested to get the points of view of people knowing well PyPy.
Pierre _______________________________________________ pypy-dev mailing list pypy-dev@python.org https://mail.python.org/mailman/listinfo/pypy-dev
On Mon, Dec 21, 2020 at 11:19 PM PIERRE AUGIER < pierre.augier@univ-grenoble-alpes.fr> wrote:
class Point3D: def __init__(self, x, y, z): self.x = x self.y = y self.z = z
def norm_square(self): return self.x**2 + self.y**2 + self.z**2
you could try to store x, y and z inside a list instead of 3 different attributes: PyPy will use the specialized implementation which stores them unboxed, which might help the subsequent code. You can even use @property do expose them as .x .y and .z, since the JIT should happily remove the abstraction away
You did not state on exactly what system you are conducting the experiment, but "a factor of 4" seems very close to the auto-vectorization speedup of a vector of floats.
I think it would be very interesting to understand why PyPy is much slower than Julia in this case (a factor 4 slower than very simple Julia). I'm wondering if it is an issue of the language or a limitation of the implementation.
If the performance gap is caused by auto-vectorization, I would recommend that you use consider Numpy with Numba LLVM-based JIT. Or, for a "pure" Python solution, you can experiment with an older release of PyPy and NumPyPy. If the problem is the abstraction penalty, then the suggestion from Anto should help. But, for the question of why, you can examine the code for the inner loop generated by Julia and the code for the inner loop generate by PyPy and analyze the reason for the performance gap. It should be evident if the difference is abstraction or SIMD. Thanks, David On Mon, Dec 21, 2020 at 5:20 PM PIERRE AUGIER <pierre.augier@univ-grenoble-alpes.fr> wrote:
----- Mail original -----
De: "David Edelsohn" <dje.gcc@gmail.com> À: "PIERRE AUGIER" <pierre.augier@univ-grenoble-alpes.fr> Cc: "pypy-dev" <pypy-dev@python.org> Envoyé: Vendredi 18 Décembre 2020 21:00:42 Objet: Re: [pypy-dev] Differences performance Julia / PyPy on very similar codes
Does Julia based on LLVM auto-vectorize the code? I assume yes because you specifically mention SIMD design of the data structure.
Yes, Julia auto-vectorizes the code. Can't PyPy do the same in some case?
Have you tried NumPyPy? Development on NumPyPy has not continued, but it probably would be a better comparison of what PyPy with auto-vectorization could accomplish to compare with Julia.
I haven't tried NumPyPy because I can't import _numpypy with PyPy3.6.
Anyway, for this experiment, my attempt was to stay in pure Python and to compare with what is done in pure Julia.
I think it would be very interesting to understand why PyPy is much slower than Julia in this case (a factor 4 slower than very simple Julia). I'm wondering if it is an issue of the language or a limitation of the implementation.
Moreover, I would really be interested to know if an extension compatible with PyPy (better, not only compatible with PyPy) could be written to make such code faster (a code involving an array of instances of a very simple class). Could we gain anything compare to using a Python list?
Are there some tools to understand what is done by PyPy to speedup some code? Or to know more on the data structures used under the hood by PyPy?
For example,
class Point3D: def __init__(self, x, y, z): self.x = x self.y = y self.z = z
def norm_square(self): return self.x**2 + self.y**2 + self.z**2
I guess it would be good for efficiency to store the 3 floats as native floats aligned in memory and to vectorized the power computation. How can one know what is done by PyPy for a particular code?
Pierre
Thanks, David
On Fri, Dec 18, 2020 at 2:56 PM PIERRE AUGIER <pierre.augier@univ-grenoble-alpes.fr> wrote:
Hi,
I post on this list a message written in PyPy issue tracker (https://foss.heptapod.net/pypy/pypy/-/issues/3349#note_150255). It is about some experiments I did on writing efficient implementations of the NBody problem https://github.com/paugier/nbabel to potentially answer to this article https://arxiv.org/pdf/2009.11295.pdf.
I get from a PR an [interesting optimized implementation in Julia](https://github.com/paugier/nbabel/blob/master/julia/nbabel4_serial.jl). It is very fast (even slightly faster than in Pythran). One idea is to store the 3 floats of a 3d physical vector, (x, y, z), in a struct `Point4D` containing 4 floats to better use simd instructions.
I added a pure Python implementation inspired by this new Julia implementation (but with a simple `Point3D` with 3 floats because with PyPy, the `Point4D` does not make the code faster) and good news it is with PyPy a bit faster than our previous PyPy implementations (only 3 times slower than the old C++ implementation).
However, it is much slower than with Julia (while the code is very similar). I coded a simplified version in Julia with nearly nothing else that what can be written in pure Python (in particular, no `@inbounds` and `@simd` macros). It seems to me that the comparison of these 2 versions could be interesting. So I again simplified these 2 versions to keep only what is important for performance, which gives
- https://github.com/paugier/nbabel/blob/master/py/microbench_pypy4.py - https://github.com/paugier/nbabel/blob/master/py/microbench_ju4.jl
The results are summarized in https://github.com/paugier/nbabel/blob/master/py/microbench.md
An important point is that with `Point3D` (a mutable class in Python and an immutable struct in Julia), Julia is 3.6 times faster than PyPy. Same code and nothing really fancy in Julia so I guess that PyPy might be missing some optimization opportunities. At least it would be interesting to understand what is slower in PyPy (and why). I have to admit that I don't know how to get interesting information on timing and what is happening with PyPy JIT in a particular case. I only used cProfile and it's of course clearly not enough. I can run vmprof but I'm not able to visualize the data because the website http://vmprof.com/ is down. I don't know if I can trust values given by IPython `%timeit` for particular instructions since I don't know if PyPy JIT does the same thing in `%timeit` and in the function `compute_accelerations`.
I also feel that I really miss in pure Python an efficient fixed size homogeneous mutable sequence (a "Vector" in Julia words) that can contain basic numerical types (as Python `array.array`) but also instances of user-defined classes and instances of Vectors. The Python code uses a [pure Python implementation using a list](https://github.com/paugier/nbabel/blob/master/py/vector.py). I think it would be reasonable to have a good implementation highly compatible with PyPy (and potentially other Python implementations) in a package on PyPI. It would really help to write PyPy compatible numerical codes. What would be the good tool to implement such package? HPy? I wonder whether we can get some speedup compared to the pure Python version with lists. For very simple classes like `Point3d` and `Point4d`, I wonder if the data could be saved continuously in memory and if some operations could be done without boxing/unboxing.
However, I really don't know what is slower in PyPy / faster in Julia.
I would be very interested to get the points of view of people knowing well PyPy.
Pierre _______________________________________________ pypy-dev mailing list pypy-dev@python.org https://mail.python.org/mailman/listinfo/pypy-dev
_______________________________________________ pypy-dev mailing list pypy-dev@python.org https://mail.python.org/mailman/listinfo/pypy-dev
----- Mail original -----
De: "David Edelsohn" <dje.gcc@gmail.com> À: "PIERRE AUGIER" <pierre.augier@univ-grenoble-alpes.fr> Cc: "pypy-dev" <pypy-dev@python.org> Envoyé: Lundi 21 Décembre 2020 23:47:22 Objet: Re: [pypy-dev] Differences performance Julia / PyPy on very similar codes
You did not state on exactly what system you are conducting the experiment, but "a factor of 4" seems very close to the auto-vectorization speedup of a vector of floats.
The problem is described in details in the repository https://github.com/paugier/nbabel and in the related issue https://foss.heptapod.net/pypy/pypy/-/issues/3349
I think it would be very interesting to understand why PyPy is much slower than Julia in this case (a factor 4 slower than very simple Julia). I'm wondering if it is an issue of the language or a limitation of the implementation.
If the performance gap is caused by auto-vectorization, I would recommend that you use consider Numpy with Numba LLVM-based JIT. Or, for a "pure" Python solution, you can experiment with an older release of PyPy and NumPyPy.
There is already an implementation based on Numba (which is slower and in my point of view less elegant that what can be done with Transonic-Pythran). Here, it is really about what can be done with PyPy, nowadays and in future. About NumPyPy, I'm sorry about this story, but I'm not interested to play with an unsupported project.
If the problem is the abstraction penalty, then the suggestion from Anto should help.
I tried to use a list to store the data but unfortunatelly, it's slower (1.5 times slower than with attributes and 6 times slower than Julia on my slow laptop): Measurements with Julia (https://github.com/paugier/nbabel/blob/master/py/microbench_ju4.jl): pierre@voyage ~/Dev/nbabel/py master $ julia microbench_ju4.jl Main.NB.MutablePoint3D 17.833 ms (1048576 allocations: 32.00 MiB) Main.NB.Point3D 5.737 ms (0 allocations: 0 bytes) Main.NB.Point4D 4.984 ms (0 allocations: 0 bytes) Measurements with PyPy objects with x, y, z attributes (like Julia, https://github.com/paugier/nbabel/blob/master/py/microbench_pypy4.py): pierre@voyage ~/Dev/nbabel/py master $ pypy microbench_pypy4.py Point3D: 22.503 ms Point4D: 45.127 ms Measurements with PyPy, lists and @property (https://github.com/paugier/nbabel/blob/master/py/microbench_pypy_list.py): pierre@voyage ~/Dev/nbabel/py master $ pypy microbench_pypy_list.py Point3D: 34.115 ms Point4D: 59.646 ms
But, for the question of why, you can examine the code for the inner loop generated by Julia and the code for the inner loop generate by PyPy and analyze the reason for the performance gap. It should be evident if the difference is abstraction or SIMD.
Sorry for this naive question but how can I examine the code for the inner loop generated by PyPy ? Pierre
On Mon, Dec 21, 2020 at 5:20 PM PIERRE AUGIER <pierre.augier@univ-grenoble-alpes.fr> wrote:
----- Mail original -----
De: "David Edelsohn" <dje.gcc@gmail.com> À: "PIERRE AUGIER" <pierre.augier@univ-grenoble-alpes.fr> Cc: "pypy-dev" <pypy-dev@python.org> Envoyé: Vendredi 18 Décembre 2020 21:00:42 Objet: Re: [pypy-dev] Differences performance Julia / PyPy on very similar codes
Does Julia based on LLVM auto-vectorize the code? I assume yes because you specifically mention SIMD design of the data structure.
Yes, Julia auto-vectorizes the code. Can't PyPy do the same in some case?
Have you tried NumPyPy? Development on NumPyPy has not continued, but it probably would be a better comparison of what PyPy with auto-vectorization could accomplish to compare with Julia.
I haven't tried NumPyPy because I can't import _numpypy with PyPy3.6.
Anyway, for this experiment, my attempt was to stay in pure Python and to compare with what is done in pure Julia.
I think it would be very interesting to understand why PyPy is much slower than Julia in this case (a factor 4 slower than very simple Julia). I'm wondering if it is an issue of the language or a limitation of the implementation.
Moreover, I would really be interested to know if an extension compatible with PyPy (better, not only compatible with PyPy) could be written to make such code faster (a code involving an array of instances of a very simple class). Could we gain anything compare to using a Python list?
Are there some tools to understand what is done by PyPy to speedup some code? Or to know more on the data structures used under the hood by PyPy?
For example,
class Point3D: def __init__(self, x, y, z): self.x = x self.y = y self.z = z
def norm_square(self): return self.x**2 + self.y**2 + self.z**2
I guess it would be good for efficiency to store the 3 floats as native floats aligned in memory and to vectorized the power computation. How can one know what is done by PyPy for a particular code?
Pierre
Thanks, David
On Fri, Dec 18, 2020 at 2:56 PM PIERRE AUGIER <pierre.augier@univ-grenoble-alpes.fr> wrote:
Hi,
I post on this list a message written in PyPy issue tracker (https://foss.heptapod.net/pypy/pypy/-/issues/3349#note_150255). It is about some experiments I did on writing efficient implementations of the NBody problem https://github.com/paugier/nbabel to potentially answer to this article https://arxiv.org/pdf/2009.11295.pdf.
I get from a PR an [interesting optimized implementation in Julia](https://github.com/paugier/nbabel/blob/master/julia/nbabel4_serial.jl). It is very fast (even slightly faster than in Pythran). One idea is to store the 3 floats of a 3d physical vector, (x, y, z), in a struct `Point4D` containing 4 floats to better use simd instructions.
I added a pure Python implementation inspired by this new Julia implementation (but with a simple `Point3D` with 3 floats because with PyPy, the `Point4D` does not make the code faster) and good news it is with PyPy a bit faster than our previous PyPy implementations (only 3 times slower than the old C++ implementation).
However, it is much slower than with Julia (while the code is very similar). I coded a simplified version in Julia with nearly nothing else that what can be written in pure Python (in particular, no `@inbounds` and `@simd` macros). It seems to me that the comparison of these 2 versions could be interesting. So I again simplified these 2 versions to keep only what is important for performance, which gives
- https://github.com/paugier/nbabel/blob/master/py/microbench_pypy4.py - https://github.com/paugier/nbabel/blob/master/py/microbench_ju4.jl
The results are summarized in https://github.com/paugier/nbabel/blob/master/py/microbench.md
An important point is that with `Point3D` (a mutable class in Python and an immutable struct in Julia), Julia is 3.6 times faster than PyPy. Same code and nothing really fancy in Julia so I guess that PyPy might be missing some optimization opportunities. At least it would be interesting to understand what is slower in PyPy (and why). I have to admit that I don't know how to get interesting information on timing and what is happening with PyPy JIT in a particular case. I only used cProfile and it's of course clearly not enough. I can run vmprof but I'm not able to visualize the data because the website http://vmprof.com/ is down. I don't know if I can trust values given by IPython `%timeit` for particular instructions since I don't know if PyPy JIT does the same thing in `%timeit` and in the function `compute_accelerations`.
I also feel that I really miss in pure Python an efficient fixed size homogeneous mutable sequence (a "Vector" in Julia words) that can contain basic numerical types (as Python `array.array`) but also instances of user-defined classes and instances of Vectors. The Python code uses a [pure Python implementation using a list](https://github.com/paugier/nbabel/blob/master/py/vector.py). I think it would be reasonable to have a good implementation highly compatible with PyPy (and potentially other Python implementations) in a package on PyPI. It would really help to write PyPy compatible numerical codes. What would be the good tool to implement such package? HPy? I wonder whether we can get some speedup compared to the pure Python version with lists. For very simple classes like `Point3d` and `Point4d`, I wonder if the data could be saved continuously in memory and if some operations could be done without boxing/unboxing.
However, I really don't know what is slower in PyPy / faster in Julia.
I would be very interested to get the points of view of people knowing well PyPy.
Pierre _______________________________________________ pypy-dev mailing list pypy-dev@python.org https://mail.python.org/mailman/listinfo/pypy-dev
_______________________________________________ pypy-dev mailing list pypy-dev@python.org https://mail.python.org/mailman/listinfo/pypy-dev
On 22.12.20 16:34, PIERRE AUGIER wrote:
Here, it is really about what can be done with PyPy, nowadays and in future.
Hi Pierre, A few somewhat random comments from me. First note is that you shouldn't run two different implementations that you are comparing (Point3D and Point4D in this case) within the same process, since they can influence each other. If I run them in the same process I get this: Point3D: 11.426 ms Point4D: 21.572 ms in separate processes the latter speeds up: Point4D: 13.136 ms (but it doesn't become faster than Point4D, indeed because we don't have any real SIMD support in the JIT.) Next: some information about how to look at the generated code with PyPy. What I do is look at the JIT IR (which is very close to machine code, but one abstraction level above it). You get it like this: PYPYLOG=jit-log-opt,jit-summary,jit-backend-counts:out pypy3 microbench_pypy4.py This produces a file called 'out' with different sections. I usually start by looking at the bottom, which shows how often each trace is entered. This way, you can find the hottest trace: [26f0c8566379] {jit-backend-counts ... TargetToken(140179837690368):43692970 TargetToken(140179837690448):74923530 ... [26f0c8567905] jit-backend-counts} Now I search for the address of the hottest trace to find its IR. The IR shows traced Python bycodes interspersed with IR instructions (takes a bit of time to get used to reading it, but it's not super hard). Looking through that it's my opinion that the trace looks quite good. There are many small inefficiencies (a bit too much pointer chasing, a bit too much type checking everywhere, a few allocations that aren't necessary), but no single thing missed optimization that could immediately give a 5x speedup. Which also follows my expectations of how I suspect a shootout between Julia and PyPy to end up: PyPy is much faster than CPython for algorithmic pure Python code (~150x on my laptop! that's really good :-)). But it can't really beat a "serious" ahead-of-time compiler for a statically typed language that specifically targets numerical code. That is for several reasons, the most important ones being that 1) PyPy has a lot less time to produce code given that it does it at runtime 2) PyPy has to support the full dynamically typed language Python where really random things can be done at runtime and PyPy must still always observe the Python semantics. That said, I can understand that 5x slower is still a somewhat disappointing result and I suspect given enough effort we could maybe get it down to around 3x slower. Cheers, Carl Friedrich
On Tue, 22 Dec 2020, Carl Friedrich Bolz-Tereick wrote:
That said, I can understand that 5x slower is still a somewhat disappointing result and I suspect given enough effort we could maybe get it down to around 3x slower.
Just to clarify, if I understand you correctly, you mean that by investing some serious effort into optimising those "small" inefficiencies one could improve the situation from 5x to 3x. However, I wonder if anything could be done on the SIMD front in a rather generic way with a reasonable investment of time but without going the full NumPyPy way, e.g. by doing something special for tight loops performing math on objects with a special layout (lists, arrays)... -- Sincerely yours, Yury V. Zaytsev
----- Mail original -----
De: "David Edelsohn" <dje.gcc@gmail.com> À: "PIERRE AUGIER" <pierre.augier@univ-grenoble-alpes.fr> Cc: "pypy-dev" <pypy-dev@python.org> Envoyé: Lundi 21 Décembre 2020 23:47:22 Objet: Re: [pypy-dev] Differences performance Julia / PyPy on very similar codes
You did not state on exactly what system you are conducting the experiment, but "a factor of 4" seems very close to the auto-vectorization speedup of a vector of floats.
I wrote another very simple benchmark that should not depend on auto-vectorization. The bench function is: ```python def sum_x(positions): result = 0.0 for i in range(len(positions)): result += positions[i].x return result ``` The scripts are: - https://github.com/paugier/nbabel/blob/master/py/microbench_sum_x.py - https://github.com/paugier/nbabel/blob/master/py/microbench_sum_x.jl Even on this case, Julia is again notably (~2.7 times) faster on this case: ``` $ julia microbench_sum_x.jl 1.208 μs (1 allocation: 16 bytes) In [1]: run microbench_sum_x.py sum_x(positions) 3.29 µs ± 133 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) sum_x(positions_list) 14.5 µs ± 291 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) ``` For `positions_list`, each `point` contains a list to store the 3 floats. How can I analyze these performance differences? How can I get more information on what happens for this code with PyPy?
On 23.12.20 14:42, PIERRE AUGIER wrote:
I wrote another very simple benchmark that should not depend on auto-vectorization. The bench function is:
```python def sum_x(positions): result = 0.0 for i in range(len(positions)): result += positions[i].x return result ```
This benchmark probably really shows the crux of the problem. In Python, the various Points instances (whether with lists, or with direct attributes) are vastly more complex beasts than the structs in Julia. There, you can declare a struct with a certain number of Float64 fields and be done. Thus, reading .x from such a struct is just a pointer dereference. In Python, due to dynamic typing, the ability to add more fields later and even the ability to change the class of an instance, the actual memory layout of a Point3D type is much more complex with various indirections and boxing. Reading .x out of such a thing is done in several steps: 1) check that positions[i] is an instance 2) check that it's an instance of Point3D 3) read its x field 4) check that the field is a float 5) read the float's value All of these steps involve a pointer read. Improving this situation is probably possible (there's even a paper how to get rid of steps 1 and 2: https://www.csl.cornell.edu/~cbatten/pdfs/cheng-type-freezing-cgo2020.pdf but the work wasn't merged). But there are problems: - basically every single one of these steps needs to be addressed, and every one is its own optimization - it's extremely delicate to get the balance and the trade-offs right, because the object system is so central in getting good performance for Python code across a wide variety of areas (not just numerical algorithms). Another approach would indeed be (as you say in the other mail) to add support for telling PyPy explicitly that some list can contain only instances of a specific class and (more importantly) that a class is not to be considered to be "dynamic" meaning that its fields are fixed and of specific types. So far, we have not really gone in such directions, because that is language design and we leave that to the CPython devs ;-). Note that some of your other benchmarks are not measuring what you hope! eg I suspect that get_objects, get_xs and loop_over_list_of_objects from your other mail get completely removed by the Julia compiler, since they don't have side effects and don't compute anything. PyPy isn't actually able to remove empty loops. So you are comparing empty loops in PyPy with no code at all in Julia. Cheers, Carl Friedrich
Hi Carl, On +2020-12-24 07:06:43 +0100, Carl Friedrich Bolz-Tereick wrote:
On 23.12.20 14:42, PIERRE AUGIER wrote:
I wrote another very simple benchmark that should not depend on auto-vectorization. The bench function is:
```python def sum_x(positions): result = 0.0 for i in range(len(positions)): result += positions[i].x return result ```
This benchmark probably really shows the crux of the problem. In Python, the various Points instances (whether with lists, or with direct attributes) are vastly more complex beasts than the structs in Julia. There, you can declare a struct with a certain number of Float64 fields and be done. Thus, reading .x from such a struct is just a pointer dereference.
In Python, due to dynamic typing, the ability to add more fields later and even the ability to change the class of an instance, the actual memory layout of a Point3D type is much more complex with various indirections and boxing. Reading .x out of such a thing is done in several steps:
1) check that positions[i] is an instance 2) check that it's an instance of Point3D 3) read its x field 4) check that the field is a float 5) read the float's value
All of these steps involve a pointer read.
Could crafted asserts tell the compiler that dynamic stuff is not happening? Is the compiler listening? :)
Improving this situation is probably possible (there's even a paper how to get rid of steps 1 and 2: https://www.csl.cornell.edu/~cbatten/pdfs/cheng-type-freezing-cgo2020.pdf but the work wasn't merged). But there are problems: - basically every single one of these steps needs to be addressed, and every one is its own optimization - it's extremely delicate to get the balance and the trade-offs right, because the object system is so central in getting good performance for Python code across a wide variety of areas (not just numerical algorithms).
Another approach would indeed be (as you say in the other mail) to add support for telling PyPy explicitly that some list can contain only instances of a specific class and (more importantly) that a class is not to be considered to be "dynamic" meaning that its fields are fixed and of specific types. So far, we have not really gone in such directions, because that is language design and we leave that to the CPython devs ;-).
assert is already there :)
Note that some of your other benchmarks are not measuring what you hope! eg I suspect that get_objects, get_xs and loop_over_list_of_objects from your other mail get completely removed by the Julia compiler, since they don't have side effects and don't compute anything. PyPy isn't actually able to remove empty loops. So you are comparing empty loops in PyPy with no code at all in Julia.
And benchmarks that don't segregate outliers, or clusters, and just average are terrible. Scattergrams are much more informative :) E.g., in a scattergram you can discover that a CPU shortcutting multiply by zero affects a subset of your computations.
Cheers,
Carl Friedrich _______________________________________________ pypy-dev mailing list pypy-dev@python.org https://mail.python.org/mailman/listinfo/pypy-dev
Happy Holidays! -- Regards, Bengt Richter
On Thu, 24 Dec 2020, Carl Friedrich Bolz-Tereick wrote:
Another approach would indeed be (as you say in the other mail) to add support for telling PyPy explicitly that some list can contain only instances of a specific class and (more importantly) that a class is not to be considered to be "dynamic" meaning that its fields are fixed and of specific types. So far, we have not really gone in such directions, because that is language design and we leave that to the CPython devs ;-).
Hmmm, how about dataclasses ;-) Maybe those can be used as optimization targets under some reasonable assumptions: @dataclass class Point3D: x: float y: float z: float Feels pythonic much to me... -- Sincerely yours, Yury V. Zaytsev
----- Mail original -----
De: "Carl Friedrich Bolz-Tereick" <cfbolz@gmx.de> À: "PIERRE AUGIER" <pierre.augier@univ-grenoble-alpes.fr>, "pypy-dev" <pypy-dev@python.org> Envoyé: Jeudi 24 Décembre 2020 07:06:43 Objet: Re: [pypy-dev] Differences performance Julia / PyPy on very similar codes
On 23.12.20 14:42, PIERRE AUGIER wrote:
I wrote another very simple benchmark that should not depend on auto-vectorization. The bench function is:
```python def sum_x(positions): result = 0.0 for i in range(len(positions)): result += positions[i].x return result ```
This benchmark probably really shows the crux of the problem. In Python, the various Points instances (whether with lists, or with direct attributes) are vastly more complex beasts than the structs in Julia. There, you can declare a struct with a certain number of Float64 fields and be done. Thus, reading .x from such a struct is just a pointer dereference.
In Python, due to dynamic typing, the ability to add more fields later and even the ability to change the class of an instance, the actual memory layout of a Point3D type is much more complex with various indirections and boxing. Reading .x out of such a thing is done in several steps:
1) check that positions[i] is an instance 2) check that it's an instance of Point3D 3) read its x field 4) check that the field is a float 5) read the float's value
All of these steps involve a pointer read.
Improving this situation is probably possible (there's even a paper how to get rid of steps 1 and 2: https://www.csl.cornell.edu/~cbatten/pdfs/cheng-type-freezing-cgo2020.pdf but the work wasn't merged). But there are problems: - basically every single one of these steps needs to be addressed, and every one is its own optimization - it's extremely delicate to get the balance and the trade-offs right, because the object system is so central in getting good performance for Python code across a wide variety of areas (not just numerical algorithms).
Another approach would indeed be (as you say in the other mail) to add support for telling PyPy explicitly that some list can contain only instances of a specific class and (more importantly) that a class is not to be considered to be "dynamic" meaning that its fields are fixed and of specific types. So far, we have not really gone in such directions, because that is language design and we leave that to the CPython devs ;-).
Thanks a lot Carl for your very interesting answers. I'm wondering if it could be possible to write an extension that would improve the situation for such numerical codes? I wrote a first description here https://github.com/paugier/nbabel/blob/master/py/vector.md (more about the Python API). I think that if something like this extension could exist and be very efficient with PyPy, it would greatly help writing very efficient numerical codes in "pure Python style". For the case of the NBabel problem, the code would be very nice and it seems to me that we could reach very good performance compared to Julia and other compiled languages. I would be very interested to get some feedback on this proposition. Do you think that HPy could be used to implement such extension? Could such extension be fully compatible with PyPy JIT without modification in PyPy? Cheers, Pierre Augier
I think I understood that what is very slow compared to Julia is looping over a list of Python objects. def loop_over_list_of_objects(l): for o in l: o loop_over_list_of_objects([object() for _ in range(1000)]) See https://github.com/paugier/nbabel/blob/master/py/microbench_sum_x.py Is there a better way to store Python objects (homogeneous in type) to be able to loop over them more efficiency? Would it be possible to store them in a contiguous array (if it makes sense for Python objects)? ----- Mail original -----
De: "David Edelsohn" <dje.gcc@gmail.com> À: "PIERRE AUGIER" <pierre.augier@univ-grenoble-alpes.fr> Cc: "pypy-dev" <pypy-dev@python.org> Envoyé: Lundi 21 Décembre 2020 23:47:22 Objet: Re: [pypy-dev] Differences performance Julia / PyPy on very similar codes
You did not state on exactly what system you are conducting the experiment, but "a factor of 4" seems very close to the auto-vectorization speedup of a vector of floats.
I think it would be very interesting to understand why PyPy is much slower than Julia in this case (a factor 4 slower than very simple Julia). I'm wondering if it is an issue of the language or a limitation of the implementation.
If the performance gap is caused by auto-vectorization, I would recommend that you use consider Numpy with Numba LLVM-based JIT. Or, for a "pure" Python solution, you can experiment with an older release of PyPy and NumPyPy.
If the problem is the abstraction penalty, then the suggestion from Anto should help.
But, for the question of why, you can examine the code for the inner loop generated by Julia and the code for the inner loop generate by PyPy and analyze the reason for the performance gap. It should be evident if the difference is abstraction or SIMD.
Thanks, David
On Mon, Dec 21, 2020 at 5:20 PM PIERRE AUGIER <pierre.augier@univ-grenoble-alpes.fr> wrote:
----- Mail original -----
De: "David Edelsohn" <dje.gcc@gmail.com> À: "PIERRE AUGIER" <pierre.augier@univ-grenoble-alpes.fr> Cc: "pypy-dev" <pypy-dev@python.org> Envoyé: Vendredi 18 Décembre 2020 21:00:42 Objet: Re: [pypy-dev] Differences performance Julia / PyPy on very similar codes
Does Julia based on LLVM auto-vectorize the code? I assume yes because you specifically mention SIMD design of the data structure.
Yes, Julia auto-vectorizes the code. Can't PyPy do the same in some case?
Have you tried NumPyPy? Development on NumPyPy has not continued, but it probably would be a better comparison of what PyPy with auto-vectorization could accomplish to compare with Julia.
I haven't tried NumPyPy because I can't import _numpypy with PyPy3.6.
Anyway, for this experiment, my attempt was to stay in pure Python and to compare with what is done in pure Julia.
I think it would be very interesting to understand why PyPy is much slower than Julia in this case (a factor 4 slower than very simple Julia). I'm wondering if it is an issue of the language or a limitation of the implementation.
Moreover, I would really be interested to know if an extension compatible with PyPy (better, not only compatible with PyPy) could be written to make such code faster (a code involving an array of instances of a very simple class). Could we gain anything compare to using a Python list?
Are there some tools to understand what is done by PyPy to speedup some code? Or to know more on the data structures used under the hood by PyPy?
For example,
class Point3D: def __init__(self, x, y, z): self.x = x self.y = y self.z = z
def norm_square(self): return self.x**2 + self.y**2 + self.z**2
I guess it would be good for efficiency to store the 3 floats as native floats aligned in memory and to vectorized the power computation. How can one know what is done by PyPy for a particular code?
Pierre
Thanks, David
On Fri, Dec 18, 2020 at 2:56 PM PIERRE AUGIER <pierre.augier@univ-grenoble-alpes.fr> wrote:
Hi,
I post on this list a message written in PyPy issue tracker (https://foss.heptapod.net/pypy/pypy/-/issues/3349#note_150255). It is about some experiments I did on writing efficient implementations of the NBody problem https://github.com/paugier/nbabel to potentially answer to this article https://arxiv.org/pdf/2009.11295.pdf.
I get from a PR an [interesting optimized implementation in Julia](https://github.com/paugier/nbabel/blob/master/julia/nbabel4_serial.jl). It is very fast (even slightly faster than in Pythran). One idea is to store the 3 floats of a 3d physical vector, (x, y, z), in a struct `Point4D` containing 4 floats to better use simd instructions.
I added a pure Python implementation inspired by this new Julia implementation (but with a simple `Point3D` with 3 floats because with PyPy, the `Point4D` does not make the code faster) and good news it is with PyPy a bit faster than our previous PyPy implementations (only 3 times slower than the old C++ implementation).
However, it is much slower than with Julia (while the code is very similar). I coded a simplified version in Julia with nearly nothing else that what can be written in pure Python (in particular, no `@inbounds` and `@simd` macros). It seems to me that the comparison of these 2 versions could be interesting. So I again simplified these 2 versions to keep only what is important for performance, which gives
- https://github.com/paugier/nbabel/blob/master/py/microbench_pypy4.py - https://github.com/paugier/nbabel/blob/master/py/microbench_ju4.jl
The results are summarized in https://github.com/paugier/nbabel/blob/master/py/microbench.md
An important point is that with `Point3D` (a mutable class in Python and an immutable struct in Julia), Julia is 3.6 times faster than PyPy. Same code and nothing really fancy in Julia so I guess that PyPy might be missing some optimization opportunities. At least it would be interesting to understand what is slower in PyPy (and why). I have to admit that I don't know how to get interesting information on timing and what is happening with PyPy JIT in a particular case. I only used cProfile and it's of course clearly not enough. I can run vmprof but I'm not able to visualize the data because the website http://vmprof.com/ is down. I don't know if I can trust values given by IPython `%timeit` for particular instructions since I don't know if PyPy JIT does the same thing in `%timeit` and in the function `compute_accelerations`.
I also feel that I really miss in pure Python an efficient fixed size homogeneous mutable sequence (a "Vector" in Julia words) that can contain basic numerical types (as Python `array.array`) but also instances of user-defined classes and instances of Vectors. The Python code uses a [pure Python implementation using a list](https://github.com/paugier/nbabel/blob/master/py/vector.py). I think it would be reasonable to have a good implementation highly compatible with PyPy (and potentially other Python implementations) in a package on PyPI. It would really help to write PyPy compatible numerical codes. What would be the good tool to implement such package? HPy? I wonder whether we can get some speedup compared to the pure Python version with lists. For very simple classes like `Point3d` and `Point4d`, I wonder if the data could be saved continuously in memory and if some operations could be done without boxing/unboxing.
However, I really don't know what is slower in PyPy / faster in Julia.
I would be very interested to get the points of view of people knowing well PyPy.
Pierre _______________________________________________ pypy-dev mailing list pypy-dev@python.org https://mail.python.org/mailman/listinfo/pypy-dev
_______________________________________________ pypy-dev mailing list pypy-dev@python.org https://mail.python.org/mailman/listinfo/pypy-dev
participants (6)
-
Antonio Cuni
-
Bengt Richter
-
Carl Friedrich Bolz-Tereick
-
David Edelsohn
-
PIERRE AUGIER
-
Yury V. Zaytsev