[Numpy-discussion] Looking for a difference between Numpy 0.19.5 and 0.20 explaining a perf regression with Pythran

Sebastian Berg
Fri Mar 12 16:50:24 EST 2021

On Fri, 2021-03-12 at 21:36 +0100, PIERRE AUGIER wrote:
> Hi,
> I'm looking for a difference between Numpy 0.19.5 and 0.20 which
> could explain a performance regression (~15 %) with Pythran.
> I observe this regression with the script 
> https://github.com/paugier/nbabel/blob/master/py/bench.py
> Pythran reimplements Numpy so it is not about Numpy code for
> computation. However, Pythran of course uses the native array
> contained in a Numpy array. I'm quite sure that something has changed
> between Numpy 0.19.5 and 0.20 (or between the corresponding wheels?)
> since I don't get the same performance with Numpy 0.20. I checked
> that the values in the arrays are the same and that the flags
> characterizing the arrays are also the same.
> Good news, I'm now able to obtain the performance difference just
> with Numpy 0.19.5. In this code, I load the data with Pandas and need
> to prepare contiguous Numpy arrays to give them to Pythran. With
> Numpy 0.19.5, if I use np.copy I get better performance that with
> np.ascontiguousarray. With Numpy 0.20, both functions create array
> giving the same performance with Pythran (again, less good that with
> Numpy 0.19.5).
> Note that this code is very efficient (more that 100 times faster
> than using Numpy), so I guess that things like alignment or memory
> location can lead to such difference.
> More details in this issue 
> https://github.com/serge-sans-paille/pythran/issues/1735
> Any help to understand what has changed would be greatly appreciated!

If you want to really dig into this, it would be good to do profiling
to find out at where the differences are.

Without that, I don't have much appetite to investigate personally. The
reason is that fluctuations of ~30% (or even much more) when running
the NumPy benchmarks are very common.

I am not aware of an immediate change in NumPy, especially since you
are talking pythran, and only the memory space or the interface code
should matter.
As to the interface code... I would expect it to be quite a bit faster,
not slower.
There was no change around data allocation, so at best what you are
seeing is a different pattern in how the "small array cache" ends up
being used.

Unfortunately, getting stable benchmarks that reflect code changes
exactly is tough...  Here is a nice blog post from Victor Stinner where
he had to go as far as using "profile guided compilation" to avoid


I somewhat hope that this is also the reason for the huge fluctuations
we see in the NumPy benchmarks due to absolutely unrelated code
But I did not have the energy to try it (and a probably fixed bug in
gcc makes it a bit harder right now).



> Cheers,
> Pierre
