[pypy-dev] Differences performance Julia / PyPy on very similar codes

Fri Dec 18 14:48:27 EST 2020

Hi, 

I post on this list a message written in PyPy issue tracker (https://foss.heptapod.net/pypy/pypy/-/issues/3349#note_150255). It is about some experiments I did on writing efficient implementations of the NBody problem https://github.com/paugier/nbabel to potentially answer to this article https://arxiv.org/pdf/2009.11295.pdf. 

I get from a PR an [interesting optimized implementation in Julia](https://github.com/paugier/nbabel/blob/master/julia/nbabel4_serial.jl). It is very fast (even slightly faster than in Pythran). One idea is to store the 3 floats of a 3d physical vector, (x, y, z), in a struct `Point4D` containing 4 floats to better use simd instructions.

I added a pure Python implementation inspired by this new Julia implementation (but with a simple `Point3D` with 3 floats because with PyPy, the `Point4D` does not make the code faster) and good news it is with PyPy a bit faster than our previous PyPy implementations (only 3 times slower than the old C++ implementation).

However, it is much slower than with Julia (while the code is very similar). I coded a simplified version in Julia with nearly nothing else that what can be written in pure Python (in particular, no `@inbounds` and `@simd` macros). It seems to me that the comparison of these 2 versions could be interesting. So I again simplified these 2 versions to keep only what is important for performance, which gives 

- https://github.com/paugier/nbabel/blob/master/py/microbench_pypy4.py
- https://github.com/paugier/nbabel/blob/master/py/microbench_ju4.jl

The results are summarized in https://github.com/paugier/nbabel/blob/master/py/microbench.md

An important point is that with `Point3D` (a mutable class in Python and an immutable struct in Julia), Julia is 3.6 times faster than PyPy. Same code and nothing really fancy in Julia so I guess that PyPy might be missing some optimization opportunities. At least it would be interesting to understand what is slower in PyPy (and why). I have to admit that I don't know how to get interesting information on timing and what is happening with PyPy JIT in a particular case. I only used cProfile and it's of course clearly not enough. I can run vmprof but I'm not able to visualize the data because the website http://vmprof.com/ is down. I don't know if I can trust values given by IPython `%timeit` for particular instructions since I don't know if PyPy JIT does the same thing in `%timeit` and in the function `compute_accelerations`.

I also feel that I really miss in pure Python an efficient fixed size homogeneous mutable sequence (a "Vector" in Julia words) that can contain basic numerical types (as Python `array.array`) but also instances of user-defined classes and instances of Vectors. The Python code uses a [pure Python implementation using a list](https://github.com/paugier/nbabel/blob/master/py/vector.py). I think it would be reasonable to have a good implementation highly compatible with PyPy (and potentially other Python implementations) in a package on PyPI. It would really help to write PyPy compatible numerical codes. What would be the good tool to implement such package? HPy? I wonder whether we can get some speedup compared to the pure Python version with lists. For very simple classes like `Point3d` and `Point4d`, I wonder if the data could be saved continuously in memory and if some operations could be done without boxing/unboxing.

However, I really don't know what is slower in PyPy / faster in Julia.

I would be very interested to get the points of view of people knowing well PyPy.

Pierre