[pypy-dev] Differences performance Julia / PyPy on very similar codes

PIERRE AUGIER pierre.augier at univ-grenoble-alpes.fr
Tue Dec 22 10:34:23 EST 2020


----- Mail original -----
> De: "David Edelsohn" <dje.gcc at gmail.com>
> À: "PIERRE AUGIER" <pierre.augier at univ-grenoble-alpes.fr>
> Cc: "pypy-dev" <pypy-dev at python.org>
> Envoyé: Lundi 21 Décembre 2020 23:47:22
> Objet: Re: [pypy-dev] Differences performance Julia / PyPy on very similar codes

> You did not state on exactly what system you are conducting the
> experiment, but "a factor of 4" seems very close to the
> auto-vectorization speedup of a vector of floats.

The problem is described in details in the repository https://github.com/paugier/nbabel and in the related issue https://foss.heptapod.net/pypy/pypy/-/issues/3349

>> I think it would be very interesting to understand why PyPy is much slower than
>> Julia in this case (a factor 4 slower than very simple Julia). I'm wondering if
>> it is an issue of the language or a limitation of the implementation.
> 
> If the performance gap is caused by auto-vectorization, I would
> recommend that you use consider Numpy with Numba LLVM-based JIT.  Or,
> for a "pure" Python solution, you can experiment with an older release
> of PyPy and NumPyPy.

There is already an implementation based on Numba (which is slower and in my point of view less elegant that what can be done with Transonic-Pythran). 

Here, it is really about what can be done with PyPy, nowadays and in future. 

About NumPyPy, I'm sorry about this story, but I'm not interested to play with an unsupported project.

> If the problem is the abstraction penalty, then the suggestion from
> Anto should help.

I tried to use a list to store the data but unfortunatelly, it's slower (1.5 times slower than with attributes and 6 times slower than Julia on my slow laptop):

Measurements with Julia (https://github.com/paugier/nbabel/blob/master/py/microbench_ju4.jl):

pierre at voyage ~/Dev/nbabel/py master $ julia microbench_ju4.jl                  
Main.NB.MutablePoint3D  17.833 ms (1048576 allocations: 32.00 MiB)
Main.NB.Point3D  5.737 ms (0 allocations: 0 bytes)
Main.NB.Point4D  4.984 ms (0 allocations: 0 bytes)

Measurements with PyPy objects with x, y, z attributes (like Julia, https://github.com/paugier/nbabel/blob/master/py/microbench_pypy4.py):

pierre at voyage ~/Dev/nbabel/py master $ pypy microbench_pypy4.py                 
Point3D: 22.503 ms
Point4D: 45.127 ms

Measurements with PyPy, lists and @property (https://github.com/paugier/nbabel/blob/master/py/microbench_pypy_list.py):

pierre at voyage ~/Dev/nbabel/py master $ pypy microbench_pypy_list.py             
Point3D: 34.115 ms
Point4D: 59.646 ms

> But, for the question of why, you can examine the code for the inner
> loop generated by Julia and the code for the inner loop generate by
> PyPy and analyze the reason for the performance gap.  It should be
> evident if the difference is abstraction or SIMD.

Sorry for this naive question but how can I examine the code for the inner loop generated by PyPy ?

Pierre

> 
> On Mon, Dec 21, 2020 at 5:20 PM PIERRE AUGIER
> <pierre.augier at univ-grenoble-alpes.fr> wrote:
>>
>>
>> ----- Mail original -----
>> > De: "David Edelsohn" <dje.gcc at gmail.com>
>> > À: "PIERRE AUGIER" <pierre.augier at univ-grenoble-alpes.fr>
>> > Cc: "pypy-dev" <pypy-dev at python.org>
>> > Envoyé: Vendredi 18 Décembre 2020 21:00:42
>> > Objet: Re: [pypy-dev] Differences performance Julia / PyPy on very similar codes
>>
>> > Does Julia based on LLVM auto-vectorize the code?  I assume yes
>> > because you specifically mention SIMD design of the data structure.
>>
>> Yes, Julia auto-vectorizes the code. Can't PyPy do the same in some case?
>>
>> > Have you tried NumPyPy?  Development on NumPyPy has not continued, but
>> > it probably would be a better comparison of what PyPy with
>> > auto-vectorization could accomplish to compare with Julia.
>>
>> I haven't tried NumPyPy because I can't import _numpypy with PyPy3.6.
>>
>> Anyway, for this experiment, my attempt was to stay in pure Python and to
>> compare with what is done in pure Julia.
>>
>> I think it would be very interesting to understand why PyPy is much slower than
>> Julia in this case (a factor 4 slower than very simple Julia). I'm wondering if
>> it is an issue of the language or a limitation of the implementation.
>>
>> Moreover, I would really be interested to know if an extension compatible with
>> PyPy (better, not only compatible with PyPy) could be written to make such code
>> faster (a code involving an array of instances of a very simple class). Could
>> we gain anything compare to using a Python list?
>>
>> Are there some tools to understand what is done by PyPy to speedup some code? Or
>> to know more on the data structures used under the hood by PyPy?
>>
>> For example,
>>
>> class Point3D:
>>     def __init__(self, x, y, z):
>>         self.x = x
>>         self.y = y
>>         self.z = z
>>
>>     def norm_square(self):
>>         return self.x**2 + self.y**2 + self.z**2
>>
>> I guess it would be good for efficiency to store the 3 floats as native floats
>> aligned in memory and to vectorized the power computation. How can one know
>> what is done by PyPy for a particular code?
>>
>> Pierre
>>
>> >
>> > Thanks, David
>> >
>> > On Fri, Dec 18, 2020 at 2:56 PM PIERRE AUGIER
>> > <pierre.augier at univ-grenoble-alpes.fr> wrote:
>> >>
>> >> Hi,
>> >>
>> >> I post on this list a message written in PyPy issue tracker
>> >> (https://foss.heptapod.net/pypy/pypy/-/issues/3349#note_150255). It is about
>> >> some experiments I did on writing efficient implementations of the NBody
>> >> problem https://github.com/paugier/nbabel to potentially answer to this article
>> >> https://arxiv.org/pdf/2009.11295.pdf.
>> >>
>> >> I get from a PR an [interesting optimized implementation in
>> >> Julia](https://github.com/paugier/nbabel/blob/master/julia/nbabel4_serial.jl).
>> >> It is very fast (even slightly faster than in Pythran). One idea is to store
>> >> the 3 floats of a 3d physical vector, (x, y, z), in a struct `Point4D`
>> >> containing 4 floats to better use simd instructions.
>> >>
>> >> I added a pure Python implementation inspired by this new Julia implementation
>> >> (but with a simple `Point3D` with 3 floats because with PyPy, the `Point4D`
>> >> does not make the code faster) and good news it is with PyPy a bit faster than
>> >> our previous PyPy implementations (only 3 times slower than the old C++
>> >> implementation).
>> >>
>> >> However, it is much slower than with Julia (while the code is very similar). I
>> >> coded a simplified version in Julia with nearly nothing else that what can be
>> >> written in pure Python (in particular, no `@inbounds` and `@simd` macros). It
>> >> seems to me that the comparison of these 2 versions could be interesting. So I
>> >> again simplified these 2 versions to keep only what is important for
>> >> performance, which gives
>> >>
>> >> - https://github.com/paugier/nbabel/blob/master/py/microbench_pypy4.py
>> >> - https://github.com/paugier/nbabel/blob/master/py/microbench_ju4.jl
>> >>
>> >> The results are summarized in
>> >> https://github.com/paugier/nbabel/blob/master/py/microbench.md
>> >>
>> >> An important point is that with `Point3D` (a mutable class in Python and an
>> >> immutable struct in Julia), Julia is 3.6 times faster than PyPy. Same code and
>> >> nothing really fancy in Julia so I guess that PyPy might be missing some
>> >> optimization opportunities. At least it would be interesting to understand what
>> >> is slower in PyPy (and why). I have to admit that I don't know how to get
>> >> interesting information on timing and what is happening with PyPy JIT in a
>> >> particular case. I only used cProfile and it's of course clearly not enough. I
>> >> can run vmprof but I'm not able to visualize the data because the website
>> >> http://vmprof.com/ is down. I don't know if I can trust values given by IPython
>> >> `%timeit` for particular instructions since I don't know if PyPy JIT does the
>> >> same thing in `%timeit` and in the function `compute_accelerations`.
>> >>
>> >> I also feel that I really miss in pure Python an efficient fixed size
>> >> homogeneous mutable sequence (a "Vector" in Julia words) that can contain basic
>> >> numerical types (as Python `array.array`) but also instances of user-defined
>> >> classes and instances of Vectors. The Python code uses a [pure Python
>> >> implementation using a
>> >> list](https://github.com/paugier/nbabel/blob/master/py/vector.py). I think it
>> >> would be reasonable to have a good implementation highly compatible with PyPy
>> >> (and potentially other Python implementations) in a package on PyPI. It would
>> >> really help to write PyPy compatible numerical codes. What would be the good
>> >> tool to implement such package? HPy? I wonder whether we can get some speedup
>> >> compared to the pure Python version with lists. For very simple classes like
>> >> `Point3d` and `Point4d`, I wonder if the data could be saved continuously in
>> >> memory and if some operations could be done without boxing/unboxing.
>> >>
>> >> However, I really don't know what is slower in PyPy / faster in Julia.
>> >>
>> >> I would be very interested to get the points of view of people knowing well
>> >> PyPy.
>> >>
>> >> Pierre
>> >> _______________________________________________
>> >> pypy-dev mailing list
>> >> pypy-dev at python.org
>> > > https://mail.python.org/mailman/listinfo/pypy-dev
>> _______________________________________________
>> pypy-dev mailing list
>> pypy-dev at python.org
> > https://mail.python.org/mailman/listinfo/pypy-dev


More information about the pypy-dev mailing list