PGO Optimized Binary

Hi All, We applied compiler assisted optimization technique called PGO or Profile Guided Optimization while building PyPy, and found performance got improved by up to 22.4% on the Grand Unified Python Benchmark (GUPB) from “hg clone https://hg.python.org/benchmarks”. The below result table shows majority of 51 micros got performance boost with 8 got performance regression. Benchmark Baseline PGO Perf Delta % hg_startup 0.0160 0.0124 22.4 2to3 6.1157 5.1978 15.0 html5lib 4.9263 4.1961 14.8 formatted_logging 0.0463 0.0399 13.9 regex_v8 0.1394 0.1206 13.5 simple_logging 0.0328 0.0289 11.9 html5lib_warmup 2.5411 2.2939 9.7 bzr_startup 0.0686 0.0621 9.6 unpack_sequence 0.0001 0.0001 8.6 normal_startup 0.8694 0.7983 8.2 regex_compile 0.0707 0.0657 7.0 json_load 0.2924 0.2734 6.5 fastpickle 1.7315 1.6290 5.9 tornado_http 0.0707 0.0665 5.8 pickle_list 1.8614 1.7897 3.9 slowunpickle 0.0260 0.0250 3.8 slowpickle 0.0336 0.0323 3.7 telco 0.0194 0.0187 3.7 pathlib 0.0171 0.0165 3.2 go 0.1069 0.1036 3.1 slowspitfire 0.2624 0.2547 2.9 etree_generate 0.1037 0.1008 2.8 silent_logging 0.0000 0.0000 2.8 pickle_dict 3.2698 3.1796 2.8 spambayes 0.0581 0.0566 2.6 startup_nosite 0.5691 0.5549 2.5 chameleon_v2 2.7629 2.7009 2.2 etree_parse 0.5610 0.5505 1.9 etree_process 0.0725 0.0712 1.9 regex_effbot 0.0377 0.0371 1.7 fastunpickle 0.8521 0.8382 1.6 float 0.0171 0.0169 0.9 pidigits 0.3833 0.3801 0.8 call_method_unknown 0.0123 0.0122 0.6 hexiom2 15.8354 15.7533 0.5 etree_iterparse 0.2102 0.2094 0.4 chaos 0.0089 0.0088 0.2 spectral_norm 0.0099 0.0099 0.2 call_simple 0.0102 0.0102 0.1 mako_v2 0.0204 0.0204 0.1 fannkuch 0.2262 0.2260 0.1 unpickle_list 0.6448 0.6449 0.0 call_method_slots 0.0106 0.0106 0.0 call_method 0.0106 0.0106 -0.1 raytrace 0.0210 0.0210 -0.2 richards 0.0042 0.0043 -1.6 json_dump_v2 0.9288 0.9501 -2.3 django_v3 0.0551 0.0570 -3.4 meteor_contest 0.0984 0.1021 -3.8 nbody 0.0446 0.0463 -3.8 nqueens 0.0498 0.0525 -5.4 Average 3.6 We’d like to get some input on how to contribute our optimization recipe to the PyPy dev tree, perhaps by creating an item to the PyPy issue tracker? In addition, we would also appreciate any other benchmark or real world use based workload as alternatives to evaluate this. Thanks, Yash

Hi, On 31 October 2016 at 22:28, Singh, Yashwardhan <yashwardhan.singh@intel.com> wrote:
We applied compiler assisted optimization technique called PGO or Profile Guided Optimization while building PyPy, and found performance got improved by up to 22.4% on the Grand Unified Python Benchmark (GUPB) from “hg clone https://hg.python.org/benchmarks”. The below result table shows majority of 51 micros got performance boost with 8 got performance regression.
The kind of performance improvement you are measuring involves only short- or very short-running programs. A few years ago we'd have shrugged it off as irrelevant---"please modify the benchmarks so that they run for at least 10 seconds, more if they are larger"---because the JIT compiler doesn't have a chance to warm up. But we'd also have shrugged off your whole attempt---"PGO optimization cannot change anything to the speed of JIT-produced machine code". Nowadays we tend to look more seriously at the cold or warming-up performance too, or at least we know that we should look there. There are (stalled) plans of setting up a second benchmark suite for PyPy which focuses on this. You can get an estimate of whether you're looking at cold or hot code: compare the timings with CPython. Also, you can set the environment variable ``PYPYLOG=jit-summary:-`` and look at the first 2 lines to see how much time was spent warming up the JIT (or attempting to). Note that we did enable PGO long ago, with modest benefits. We gave up when our JIT compiler became good enough. Maybe now is the time to try again (and also, PGO itself might have improved in the meantime).
We’d like to get some input on how to contribute our optimization recipe to the PyPy dev tree, perhaps by creating an item to the PyPy issue tracker?
The best would be to create a pull request so that we can look at your changes more easily.
In addition, we would also appreciate any other benchmark or real world use based workload as alternatives to evaluate this.
You can take any Python program that runs either very shortly or not faster than CPython. For a larger example (with Python 2.7): cd rpython/jit/tl python ../../bin/rpython -O2 --source targettlr # 24 secs pypy ../../bin/rpython -O2 --source targettlr # 39 secs A bientôt, Armin.

Hi Armin, Thanks for your feedback. We ran one of the program suggested by you as an example for evaluation: cd rpython/jit/tl non-pgo-pypy ../../bin/rpython -O2 --source targettlr pgo-pypy ../../bin/rpython -O2 --source targettlr We got the following results : Non-Pgo pypy - [Timer] Timings: [Timer] annotate --- 7.5 s [Timer] rtype_lltype --- 5.8 s [Timer] backendopt_lltype --- 3.6 s [Timer] stackcheckinsertion_lltype --- 0.1 s [Timer] database_c --- 19.6 s [Timer] source_c --- 2.6 s [Timer] ========================================= [Timer] Total: --- 39.2 s PGO-pypy : [Timer] Timings: [Timer] annotate --- 7.6 s [Timer] rtype_lltype --- 5.1 s [Timer] backendopt_lltype --- 3.1 s [Timer] stackcheckinsertion_lltype --- 0.0 s [Timer] database_c --- 18.5 s [Timer] source_c --- 2.3 s [Timer] ========================================= [Timer] Total: --- 36.6 s The delta in performance between these two is about 8%. We are working on getting the data to identify the % of interpreted code vs the jited code for both the binaries. We are also working on creating a pull request to get a better feedback on the change. Regards Yash ________________________________________ From: Armin Rigo [armin.rigo@gmail.com] Sent: Wednesday, November 02, 2016 2:18 AM To: Singh, Yashwardhan Cc: pypy-dev@python.org Subject: Re: [pypy-dev] PGO Optimized Binary Hi, On 31 October 2016 at 22:28, Singh, Yashwardhan <yashwardhan.singh@intel.com> wrote:
We applied compiler assisted optimization technique called PGO or Profile Guided Optimization while building PyPy, and found performance got improved by up to 22.4% on the Grand Unified Python Benchmark (GUPB) from “hg clone https://hg.python.org/benchmarks”. The below result table shows majority of 51 micros got performance boost with 8 got performance regression.
The kind of performance improvement you are measuring involves only short- or very short-running programs. A few years ago we'd have shrugged it off as irrelevant---"please modify the benchmarks so that they run for at least 10 seconds, more if they are larger"---because the JIT compiler doesn't have a chance to warm up. But we'd also have shrugged off your whole attempt---"PGO optimization cannot change anything to the speed of JIT-produced machine code". Nowadays we tend to look more seriously at the cold or warming-up performance too, or at least we know that we should look there. There are (stalled) plans of setting up a second benchmark suite for PyPy which focuses on this. You can get an estimate of whether you're looking at cold or hot code: compare the timings with CPython. Also, you can set the environment variable ``PYPYLOG=jit-summary:-`` and look at the first 2 lines to see how much time was spent warming up the JIT (or attempting to). Note that we did enable PGO long ago, with modest benefits. We gave up when our JIT compiler became good enough. Maybe now is the time to try again (and also, PGO itself might have improved in the meantime).
We’d like to get some input on how to contribute our optimization recipe to the PyPy dev tree, perhaps by creating an item to the PyPy issue tracker?
The best would be to create a pull request so that we can look at your changes more easily.
In addition, we would also appreciate any other benchmark or real world use based workload as alternatives to evaluate this.
You can take any Python program that runs either very shortly or not faster than CPython. For a larger example (with Python 2.7): cd rpython/jit/tl python ../../bin/rpython -O2 --source targettlr # 24 secs pypy ../../bin/rpython -O2 --source targettlr # 39 secs A bientôt, Armin.

Hi 8% of that is very good if you can reproduce it across multiple runs (there is a pretty high variance I would think). You can also try running with --jit off. This gives you an indication of the speed of interpreter, which is a part of warmup On Wed, Nov 9, 2016 at 12:30 AM, Singh, Yashwardhan <yashwardhan.singh@intel.com> wrote:
Hi Armin,
Thanks for your feedback. We ran one of the program suggested by you as an example for evaluation: cd rpython/jit/tl non-pgo-pypy ../../bin/rpython -O2 --source targettlr pgo-pypy ../../bin/rpython -O2 --source targettlr
We got the following results : Non-Pgo pypy - [Timer] Timings: [Timer] annotate --- 7.5 s [Timer] rtype_lltype --- 5.8 s [Timer] backendopt_lltype --- 3.6 s [Timer] stackcheckinsertion_lltype --- 0.1 s [Timer] database_c --- 19.6 s [Timer] source_c --- 2.6 s [Timer] ========================================= [Timer] Total: --- 39.2 s
PGO-pypy : [Timer] Timings: [Timer] annotate --- 7.6 s [Timer] rtype_lltype --- 5.1 s [Timer] backendopt_lltype --- 3.1 s [Timer] stackcheckinsertion_lltype --- 0.0 s [Timer] database_c --- 18.5 s [Timer] source_c --- 2.3 s [Timer] ========================================= [Timer] Total: --- 36.6 s
The delta in performance between these two is about 8%.
We are working on getting the data to identify the % of interpreted code vs the jited code for both the binaries. We are also working on creating a pull request to get a better feedback on the change.
Regards Yash
________________________________________ From: Armin Rigo [armin.rigo@gmail.com] Sent: Wednesday, November 02, 2016 2:18 AM To: Singh, Yashwardhan Cc: pypy-dev@python.org Subject: Re: [pypy-dev] PGO Optimized Binary
Hi,
On 31 October 2016 at 22:28, Singh, Yashwardhan <yashwardhan.singh@intel.com> wrote:
We applied compiler assisted optimization technique called PGO or Profile Guided Optimization while building PyPy, and found performance got improved by up to 22.4% on the Grand Unified Python Benchmark (GUPB) from “hg clone https://hg.python.org/benchmarks”. The below result table shows majority of 51 micros got performance boost with 8 got performance regression.
The kind of performance improvement you are measuring involves only short- or very short-running programs. A few years ago we'd have shrugged it off as irrelevant---"please modify the benchmarks so that they run for at least 10 seconds, more if they are larger"---because the JIT compiler doesn't have a chance to warm up. But we'd also have shrugged off your whole attempt---"PGO optimization cannot change anything to the speed of JIT-produced machine code".
Nowadays we tend to look more seriously at the cold or warming-up performance too, or at least we know that we should look there. There are (stalled) plans of setting up a second benchmark suite for PyPy which focuses on this.
You can get an estimate of whether you're looking at cold or hot code: compare the timings with CPython. Also, you can set the environment variable ``PYPYLOG=jit-summary:-`` and look at the first 2 lines to see how much time was spent warming up the JIT (or attempting to).
Note that we did enable PGO long ago, with modest benefits. We gave up when our JIT compiler became good enough. Maybe now is the time to try again (and also, PGO itself might have improved in the meantime).
We’d like to get some input on how to contribute our optimization recipe to the PyPy dev tree, perhaps by creating an item to the PyPy issue tracker?
The best would be to create a pull request so that we can look at your changes more easily.
In addition, we would also appreciate any other benchmark or real world use based workload as alternatives to evaluate this.
You can take any Python program that runs either very shortly or not faster than CPython. For a larger example (with Python 2.7):
cd rpython/jit/tl python ../../bin/rpython -O2 --source targettlr # 24 secs pypy ../../bin/rpython -O2 --source targettlr # 39 secs
A bientôt,
Armin. _______________________________________________ pypy-dev mailing list pypy-dev@python.org https://mail.python.org/mailman/listinfo/pypy-dev
participants (3)
-
Armin Rigo
-
Maciej Fijalkowski
-
Singh, Yashwardhan