
Hi, This is Florin Papa from the Dynamic Scripting Languages Optimizations Team at Intel Corporation. I have been working with NumPyPy to evaluate its performance and it seems significantly slower compared to CPython NumPy or even PyPy NumPy (installed with pip). The results were gathered after running microbenchmarks inspired from here: http://www.labri.fr/perso/nrougier/teaching/numpy.100/ These benchmarks perform basic tasks, such as matrix multiplication, generate Cauchy matrix, generate Gaussian array, find min and max in matrix, convert float array to integer array in place, sum all elements in array. Please see below the results containing run time, normalized to the CPython NumPy results (baseline). Benchmark CPython NumPy PyPy NumPy PyPy NumPyPy cauchy 1 5.838852812 4.866947551 pointbypoint 1 4.922654347 0.981008211 numrand 1 2.478997019 1.082185897 rowmean 1 2.512893263 1.062233015 dsums 1 33.58240465 1.013388981 vectsum 1 1.738446611 0.771660704 cauchy 1 2.168377906 0.887388291 polarcoords 1 1.030962402 0.500905427 vectsort 1 2.214586698 0.973727924 arange 1 2.045342386 0.69941044 vectoradd 1 5.447667037 1.513217941 extractint 1 1.655717606 2.671712185 float2int 1 3.1688 0.905406988 insertzeros 1 2.375043445 1.037504453 Is there an official benchmark suite for NumPy or a more relevant workload to compare against CPython? What is NumPyPy's maturity / adoption rate from your knowledge? The benchmarks used to collect the results are attached in this mail. Regards, Florin

Hi, On Wed, 27 Jul 2016, Papa, Florin wrote:
After having a brief look at the your table, I'm very confused by this assessment: To me, it seems that PyPy NumPyPy is equal or significantly faster than CPython NumPy on most benchmarks, but substantially slower on just a few of them. PyPy NumPy is slower than CPython NumPy on all benchmarks, with some being not that bad, and some pretty bad, but this is absolutely to be expected, and in fact nevertheless very impressive, considering that it runs via CPyExt... Am I completely misinterpreting your numbers?! -- Sincerely yours, Yury V. Zaytsev

Hi Yury, The table contains run time values, normalized to the CPython Numpy results. This means that a value of 1 is equal to the CPython NumPy result, less than 1 means faster than CPython NumPy and more than 1 is slower than CPython NumPy. Let's consider the following line in the table: Benchmark CPython NumPy PyPy NumPy PyPy NumPyPy cauchy 1 5.838852812 4.866947551 Here, PyPy NumPy is 5.83 times slower than CPython NumPy and PyPy NumPyPy is 4.86 times slower than CPython NumPy. Hope this makes the results table more clear. Regards, Florin

Hi Florin, On Wed, 27 Jul 2016, Papa, Florin wrote:
Thank you for the explanation! I think this supports my assessment though, as I can't see how your conclusion can be justified on the basis of this table: "NumPyPy performance seems to be significantly slower compared to CPython NumPy or even PyPy NumPy" In fact, NumPyPy performance seems to be significantly *faster* compared to CPython NumPy and, in any case, PyPy NumPy (with the exception of a few benchmarks, such as "cauchy", which should be investigated). I'd also be very curious as to whether you've tried the vectorizer already, or these results were obtained without it. -- Sincerely yours, Yury V. Zaytsev

I am sorry, I mistakenly switched the header of the table, the middle column is actually the result for PyPy NumPyPy. The correct table is this: Benchmark CPython NumPy PyPy NumPyPy PyPy NumPy cauchy 1 5.838852812 4.866947551 pointbypoint 1 4.922654347 0.981008211 numrand 1 2.478997019 1.082185897 rowmean 1 2.512893263 1.062233015 dsums 1 33.58240465 1.013388981 vectsum 1 1.738446611 0.771660704 cauchy 1 2.168377906 0.887388291 polarcoords 1 1.030962402 0.500905427 vectsort 1 2.214586698 0.973727924 arange 1 2.045342386 0.69941044 vectoradd 1 5.447667037 1.513217941 extractint 1 1.655717606 2.671712185 float2int 1 3.1688 0.905406988 insertzeros 1 2.375043445 1.037504453 The results were gathered without vectorization, I will provide the results with vectorization as soon as I have them. Sorry again for the mistake. Regards, Florin

On 27/07/2016 3:35 AM, Papa, Florin wrote:
There is no official numpy benchmark, since there is really no "typical" numpy workload. Numpy is used as a common container for data processing, and each field has its own cases that interest it, for instance a workload done by CAFFE for neural network processing is much different that one done by OpenCV for image processing, which is different that the natural language processing done in NLTK, even though for the most part all three of these use numpy. There are a few numpy benchmarks available; https://github.com/serge-sans-paille/numpy-benchmarks (needs to be adapted to pypy's slow warmup time) http://yarikoptic.github.io/numpy-vbench (also AFAICT never run on PyPy) https://bitbucket.org/mikefc/numpy-benchmark.git I would expect numpypy to shine in cases where there is heavy use of python together with numpy. Your benchmarks are at the other extreme; they demonstrate that our reimplementation of the numpy looping ufuncs is slower than C, but do not test the python-numpy interaction nor how well the JIT can optimize python code using numpy. For your tests Richard's suggestion of turning on vectorization may show a large improvement, as it brings numpypy's optimizations closer to the ones done by a good C compiler. But even so, it is impressive that without vectorization we are only 2-4 times slower than the heavily vectorized c implementation, and that the cpyext emulation layer seems not to matter that much in your benchmarks. In general, timeit does a bad job for pypy benchmarks since it does not allow for warmup time and is geared to measure a minimum. Your data demonstrates some of the pitfalls of benchmarking - note that you show two very different results for your "cauchy" benchmark. You may want to check out the perf module http://perf.readthedocs.io for a more sophisticated way of running benchmarks or read https://arxiv.org/abs/1602.00602, which summarizes the problems benchmarking. In order to continue this discussion, could you create a repository with these benchmarks and a set of instructions how to reproduce them? You do not say what platform you use, what machine you ran the tests on, whether you used MKL/BLAS, what versions of pypy and cpython you used, ... Once we have a conveniently reproducible way to have this conversation we may be able to make progress toward reaching some operative conclusions, but I'm not sure a mailing list is the best place these days. Matti

Hi Matti, Thank you for your reply and for indicating additional numpy benchmarks. Please see below the results obtained with vectorization turned on. It seems that vectorization will significantly improve run time for some benchmarks (matrixmul, vectoradd, float2int). Benchmark CPython NumPy PyPy NumPyPy PyPy NumPy PyPy NumPyPy vectorial matrixmul 1 5.838852812 4.866947551 3.332052386 pointbypoint 1 4.922654347 0.981008211 4.917323386 numrand 1 2.478997019 1.082185897 2.486596082 rowmean 1 2.512893263 1.062233015 2.531627012 dsums 1 33.58240465 1.013388981 33.73959105 vectsum 1 1.738446611 0.771660704 1.651790546 cauchy 1 2.168377906 0.887388291 1.789566808 polarcoords 1 1.030962402 0.500905427 1.031192576 vectsort 1 2.214586698 0.973727924 2.205043894 arange 1 2.045342386 0.69941044 2.064583705 vectoradd 1 5.447667037 1.513217941 4.838760016 extractint 1 1.655717606 2.671712185 1.633729987 float2int 1 3.1688 0.905406988 2.764488512 insertzeros 1 2.375043445 1.037504453 2.145735211
The benchmarks I wrote do seem to stress numpy, leaving out python or the cpyext emulation layer. I realize that this is not a realistic scenario for real life workloads, which is why I am interested in more representative workloads that have a high visibility and can emphasize the advantages of PyPy. I will look at the benchmarking links indicated to find more suitable workloads and benchmarking methodology. Also, I corrected the "cauchy" issue, the first row was actually matrix multiplication.
Creating a public repository with the benchmarks can be a time consuming procedure due to internal methodologies, but please find attached the benchmarks and the Python script used to run them (num_perf.py, similar to perf.py in CPython's benchmark suite). In order to obtain a csv file with the benchmark results, please follow these steps: unzip numpy_benchmark.zip cd numpy_benchmark python num_perf.py -b all /path/to/python1 /path/to/python2 I do not seem to have MKL on my system, but I do have lapack/blas runtimes installed, as recommended here [1]. I am running Ubuntu 16.04 LTS on an 18-core Intel(R) Xeon(R) (Haswell) CPU E5-2699 v3 @ 2.30GHz. BIOS settings: Intel Turbo Boost Technology: false Hyper-Threading: false OS configuration: CPU freq set at fixed: 2.6GHz by echo 2300000 > /sys/devices/system/cpu/cpu*/cpufreq/scaling_min_freq echo 2300000 > /sys/devices/system/cpu/cpu*/cpufreq/scaling_max_freq Address Space Layout Randomization (ASLR) disabled (to reduce run to run variation) by echo 0 > /proc/sys/kernel/randomize_va_space CPython version is 2.7.11+, the default that comes with Ubuntu 16.04 LTS. PyPy version is 5.3.1, I downloaded an already compiled binary (7e8df3df9641, Jun 14 2016, 13:58:02). We can continue this discussion any place you consider suitable (if the mailing list is not the place for this). [1] https://bitbucket.org/pypy/numpy/overview Regards, Florin

On 28/07/2016 8:05 AM, Papa, Florin wrote:
The zip file is not a very iterative-freindly format for improving the benchmarks, how can I contribute to your work? - There should be some kind of shell script that downloads and installs the packages from a known source so anyone else can reproduce - You should add np.__config__.show() to the scripts so the output reflects the external libraries used - Examine other suites to find how they display the basic python/computer/environment variables in use when you run the bencmarks - how many cores are in use? How much memory? - You should check the result, for instance AFAICT the dsums benchmark does not run to completion on numpypy, bincount is not implemented Matti

We usually hang out on IRC, you can find me there most evenings European time.
- how many cores are in use? How much memory?
- You should check the result, for instance AFAICT the dsums benchmark does not run to completion on numpypy, bincount is not implemented Matti
Thank you for your feedback. I started the process of making the benchmarks open source, so that we can easily collaborate. Until then, I will make the modifications you suggested. Regards, Florin

Hi, On 27 July 2016 at 10:35, Papa, Florin <florin.papa@intel.com> wrote:
I am sorry, I mistakenly switched the header of the table, the middle column is actually the result for PyPy NumPyPy.
The resulting table makes sense to me: it shows that PyPy NumPy (with cpyext) is, in most case, running at the same speed as CPython NumPy; and the rare exceptions can be guessed to be because these benchmarks happen to invoke a much larger number of CPython C API calls than all the others. The table also shows that PyPy NumPyPy is really slower, even with vectorization enabled. It seems that the current focus of our work, on continuing to improve cpyext instead of numpypy, is a good idea. A bientôt, Armin.

Hi, On 1 August 2016 at 10:24, Maciej Fijalkowski <fijall@gmail.com> wrote:
Yes, and your benchmarks reinforce the impression that numpy-via-cpyext is faster in a lot of cases. Moreover it is more compatible with CPython's numpy, because supporting it fully is "only" a matter of us improving the general cpyext compatibility layer. Some benchmarks like "extractint" show cases where numpy-via-cpyext suffers from high levels of crossing the cpyext boundary. As fijal says we want to ultimately add some things from numpypy into numpy-via-cpyext, maybe by patching or special-casing some methods like ndarray.__getitem__ after the module is imported. By the way, it would make a cool project for someone new to the pypy code base (<= still trying to recruit help in making numpy, although it turned out to be very difficult in the past). A bientôt, Armin.

Hi,
Is there an official benchmark suite for NumPy or a more relevant workload to compare against CPython? What is NumPyPy's maturity / adoption rate from your knowledge?
I do not think there is. I have been looking for something similar for over a year. It seems though people tend to make their own benchmarks for their own jit compiler, stressing their optimization. (Well, I did too for my thesis) Having that said, it would be beneficial task to sit down and extract such a benchmark set not targeting a special jit/aot compiler, but rather thinking about real world application workloads.
I agree with Yury, there are 2-3 benchmarks for NumPyPy where it performs worse than cpython. All others are not significant. Have you tried turning on the beta verion of the vectorizer in NumPyPy? (command is $ pypy --jit vec=1 program.py args) Cheers, Richard

Hi, On Wed, 27 Jul 2016, Papa, Florin wrote:
After having a brief look at the your table, I'm very confused by this assessment: To me, it seems that PyPy NumPyPy is equal or significantly faster than CPython NumPy on most benchmarks, but substantially slower on just a few of them. PyPy NumPy is slower than CPython NumPy on all benchmarks, with some being not that bad, and some pretty bad, but this is absolutely to be expected, and in fact nevertheless very impressive, considering that it runs via CPyExt... Am I completely misinterpreting your numbers?! -- Sincerely yours, Yury V. Zaytsev

Hi Yury, The table contains run time values, normalized to the CPython Numpy results. This means that a value of 1 is equal to the CPython NumPy result, less than 1 means faster than CPython NumPy and more than 1 is slower than CPython NumPy. Let's consider the following line in the table: Benchmark CPython NumPy PyPy NumPy PyPy NumPyPy cauchy 1 5.838852812 4.866947551 Here, PyPy NumPy is 5.83 times slower than CPython NumPy and PyPy NumPyPy is 4.86 times slower than CPython NumPy. Hope this makes the results table more clear. Regards, Florin

Hi Florin, On Wed, 27 Jul 2016, Papa, Florin wrote:
Thank you for the explanation! I think this supports my assessment though, as I can't see how your conclusion can be justified on the basis of this table: "NumPyPy performance seems to be significantly slower compared to CPython NumPy or even PyPy NumPy" In fact, NumPyPy performance seems to be significantly *faster* compared to CPython NumPy and, in any case, PyPy NumPy (with the exception of a few benchmarks, such as "cauchy", which should be investigated). I'd also be very curious as to whether you've tried the vectorizer already, or these results were obtained without it. -- Sincerely yours, Yury V. Zaytsev

I am sorry, I mistakenly switched the header of the table, the middle column is actually the result for PyPy NumPyPy. The correct table is this: Benchmark CPython NumPy PyPy NumPyPy PyPy NumPy cauchy 1 5.838852812 4.866947551 pointbypoint 1 4.922654347 0.981008211 numrand 1 2.478997019 1.082185897 rowmean 1 2.512893263 1.062233015 dsums 1 33.58240465 1.013388981 vectsum 1 1.738446611 0.771660704 cauchy 1 2.168377906 0.887388291 polarcoords 1 1.030962402 0.500905427 vectsort 1 2.214586698 0.973727924 arange 1 2.045342386 0.69941044 vectoradd 1 5.447667037 1.513217941 extractint 1 1.655717606 2.671712185 float2int 1 3.1688 0.905406988 insertzeros 1 2.375043445 1.037504453 The results were gathered without vectorization, I will provide the results with vectorization as soon as I have them. Sorry again for the mistake. Regards, Florin

On 27/07/2016 3:35 AM, Papa, Florin wrote:
There is no official numpy benchmark, since there is really no "typical" numpy workload. Numpy is used as a common container for data processing, and each field has its own cases that interest it, for instance a workload done by CAFFE for neural network processing is much different that one done by OpenCV for image processing, which is different that the natural language processing done in NLTK, even though for the most part all three of these use numpy. There are a few numpy benchmarks available; https://github.com/serge-sans-paille/numpy-benchmarks (needs to be adapted to pypy's slow warmup time) http://yarikoptic.github.io/numpy-vbench (also AFAICT never run on PyPy) https://bitbucket.org/mikefc/numpy-benchmark.git I would expect numpypy to shine in cases where there is heavy use of python together with numpy. Your benchmarks are at the other extreme; they demonstrate that our reimplementation of the numpy looping ufuncs is slower than C, but do not test the python-numpy interaction nor how well the JIT can optimize python code using numpy. For your tests Richard's suggestion of turning on vectorization may show a large improvement, as it brings numpypy's optimizations closer to the ones done by a good C compiler. But even so, it is impressive that without vectorization we are only 2-4 times slower than the heavily vectorized c implementation, and that the cpyext emulation layer seems not to matter that much in your benchmarks. In general, timeit does a bad job for pypy benchmarks since it does not allow for warmup time and is geared to measure a minimum. Your data demonstrates some of the pitfalls of benchmarking - note that you show two very different results for your "cauchy" benchmark. You may want to check out the perf module http://perf.readthedocs.io for a more sophisticated way of running benchmarks or read https://arxiv.org/abs/1602.00602, which summarizes the problems benchmarking. In order to continue this discussion, could you create a repository with these benchmarks and a set of instructions how to reproduce them? You do not say what platform you use, what machine you ran the tests on, whether you used MKL/BLAS, what versions of pypy and cpython you used, ... Once we have a conveniently reproducible way to have this conversation we may be able to make progress toward reaching some operative conclusions, but I'm not sure a mailing list is the best place these days. Matti

Hi Matti, Thank you for your reply and for indicating additional numpy benchmarks. Please see below the results obtained with vectorization turned on. It seems that vectorization will significantly improve run time for some benchmarks (matrixmul, vectoradd, float2int). Benchmark CPython NumPy PyPy NumPyPy PyPy NumPy PyPy NumPyPy vectorial matrixmul 1 5.838852812 4.866947551 3.332052386 pointbypoint 1 4.922654347 0.981008211 4.917323386 numrand 1 2.478997019 1.082185897 2.486596082 rowmean 1 2.512893263 1.062233015 2.531627012 dsums 1 33.58240465 1.013388981 33.73959105 vectsum 1 1.738446611 0.771660704 1.651790546 cauchy 1 2.168377906 0.887388291 1.789566808 polarcoords 1 1.030962402 0.500905427 1.031192576 vectsort 1 2.214586698 0.973727924 2.205043894 arange 1 2.045342386 0.69941044 2.064583705 vectoradd 1 5.447667037 1.513217941 4.838760016 extractint 1 1.655717606 2.671712185 1.633729987 float2int 1 3.1688 0.905406988 2.764488512 insertzeros 1 2.375043445 1.037504453 2.145735211
The benchmarks I wrote do seem to stress numpy, leaving out python or the cpyext emulation layer. I realize that this is not a realistic scenario for real life workloads, which is why I am interested in more representative workloads that have a high visibility and can emphasize the advantages of PyPy. I will look at the benchmarking links indicated to find more suitable workloads and benchmarking methodology. Also, I corrected the "cauchy" issue, the first row was actually matrix multiplication.
Creating a public repository with the benchmarks can be a time consuming procedure due to internal methodologies, but please find attached the benchmarks and the Python script used to run them (num_perf.py, similar to perf.py in CPython's benchmark suite). In order to obtain a csv file with the benchmark results, please follow these steps: unzip numpy_benchmark.zip cd numpy_benchmark python num_perf.py -b all /path/to/python1 /path/to/python2 I do not seem to have MKL on my system, but I do have lapack/blas runtimes installed, as recommended here [1]. I am running Ubuntu 16.04 LTS on an 18-core Intel(R) Xeon(R) (Haswell) CPU E5-2699 v3 @ 2.30GHz. BIOS settings: Intel Turbo Boost Technology: false Hyper-Threading: false OS configuration: CPU freq set at fixed: 2.6GHz by echo 2300000 > /sys/devices/system/cpu/cpu*/cpufreq/scaling_min_freq echo 2300000 > /sys/devices/system/cpu/cpu*/cpufreq/scaling_max_freq Address Space Layout Randomization (ASLR) disabled (to reduce run to run variation) by echo 0 > /proc/sys/kernel/randomize_va_space CPython version is 2.7.11+, the default that comes with Ubuntu 16.04 LTS. PyPy version is 5.3.1, I downloaded an already compiled binary (7e8df3df9641, Jun 14 2016, 13:58:02). We can continue this discussion any place you consider suitable (if the mailing list is not the place for this). [1] https://bitbucket.org/pypy/numpy/overview Regards, Florin

On 28/07/2016 8:05 AM, Papa, Florin wrote:
The zip file is not a very iterative-freindly format for improving the benchmarks, how can I contribute to your work? - There should be some kind of shell script that downloads and installs the packages from a known source so anyone else can reproduce - You should add np.__config__.show() to the scripts so the output reflects the external libraries used - Examine other suites to find how they display the basic python/computer/environment variables in use when you run the bencmarks - how many cores are in use? How much memory? - You should check the result, for instance AFAICT the dsums benchmark does not run to completion on numpypy, bincount is not implemented Matti

We usually hang out on IRC, you can find me there most evenings European time.
- how many cores are in use? How much memory?
- You should check the result, for instance AFAICT the dsums benchmark does not run to completion on numpypy, bincount is not implemented Matti
Thank you for your feedback. I started the process of making the benchmarks open source, so that we can easily collaborate. Until then, I will make the modifications you suggested. Regards, Florin

Hi, On 27 July 2016 at 10:35, Papa, Florin <florin.papa@intel.com> wrote:
I am sorry, I mistakenly switched the header of the table, the middle column is actually the result for PyPy NumPyPy.
The resulting table makes sense to me: it shows that PyPy NumPy (with cpyext) is, in most case, running at the same speed as CPython NumPy; and the rare exceptions can be guessed to be because these benchmarks happen to invoke a much larger number of CPython C API calls than all the others. The table also shows that PyPy NumPyPy is really slower, even with vectorization enabled. It seems that the current focus of our work, on continuing to improve cpyext instead of numpypy, is a good idea. A bientôt, Armin.

Hi, On 1 August 2016 at 10:24, Maciej Fijalkowski <fijall@gmail.com> wrote:
Yes, and your benchmarks reinforce the impression that numpy-via-cpyext is faster in a lot of cases. Moreover it is more compatible with CPython's numpy, because supporting it fully is "only" a matter of us improving the general cpyext compatibility layer. Some benchmarks like "extractint" show cases where numpy-via-cpyext suffers from high levels of crossing the cpyext boundary. As fijal says we want to ultimately add some things from numpypy into numpy-via-cpyext, maybe by patching or special-casing some methods like ndarray.__getitem__ after the module is imported. By the way, it would make a cool project for someone new to the pypy code base (<= still trying to recruit help in making numpy, although it turned out to be very difficult in the past). A bientôt, Armin.

Hi,
Is there an official benchmark suite for NumPy or a more relevant workload to compare against CPython? What is NumPyPy's maturity / adoption rate from your knowledge?
I do not think there is. I have been looking for something similar for over a year. It seems though people tend to make their own benchmarks for their own jit compiler, stressing their optimization. (Well, I did too for my thesis) Having that said, it would be beneficial task to sit down and extract such a benchmark set not targeting a special jit/aot compiler, but rather thinking about real world application workloads.
I agree with Yury, there are 2-3 benchmarks for NumPyPy where it performs worse than cpython. All others are not significant. Have you tried turning on the beta verion of the vectorizer in NumPyPy? (command is $ pypy --jit vec=1 program.py args) Cheers, Richard
participants (6)
-
Armin Rigo
-
Maciej Fijalkowski
-
Matti Picus
-
Papa, Florin
-
Richard Plangger
-
Yury V. Zaytsev