Profile Guided Optimization active by-default

Hi All, This is Alecsandru from Server Scripting Languages Optimization team at Intel Corporation. I would like to submit a request to turn-on Profile Guided Optimization or PGO as the default build option for Python (both 2.7 and 3.6), given its performance benefits on a wide variety of workloads and hardware. For instance, as shown from attached sample performance results from the Grand Unified Python Benchmark, >20% speed up was observed. In addition, we are seeing 2-9% performance boost from OpenStack/Swift where more than 60% of the codes are in Python 2.7. Our analysis indicates the performance gain was mainly due to reduction of icache misses and CPU front-end stalls. Attached is the Makefile patches that modify the all build target and adds a new one called "disable-profile-opt". We built and tested this patch for Python 2.7 and 3.6 on our Linux machines (CentOS 7/Ubuntu Server 14.04, Intel Xeon Haswell/Broadwell with 18/8 cores). We use "regrtest" suite for training as it provides the best performance improvement. Some of the test programs in the suite may fail which leads to build fail. One solution is to disable the specific failed test using the "-x " flag (as shown in the patch) Steps to apply the patch: 1. hg clone https://hg.python.org/cpython cpython 2. cd cpython 3. hg update 2.7 (needed for 2.7 only) 4. Copy *.patch to the current directory 5. patch < python2.7-pgo.patch (or patch < python3.6-pgo.patch) 6. ./configure 7. make To disable PGO 7b. make disable-profile-opt In the following, please find our sample performance results from latest XEON machine, XEON Broadwell EP. Hardware (HW): Intel XEON (Broadwell) 8 Cores BIOS settings: Intel Turbo Boost Technology: false Hyper-Threading: false Operating System: Ubuntu 14.04.3 LTS trusty OS configuration: CPU freq set at fixed: 2.6GHz by echo 2600000 > /sys/devices/system/cpu/cpu*/cpufreq/scaling_min_freq echo 2600000 > /sys/devices/system/cpu/cpu*/cpufreq/scaling_max_freq Address Space Layout Randomization (ASLR) disabled (to reduce run to run variation) by echo 0 > /proc/sys/kernel/randomize_va_space GCC version: gcc version 4.8.4 (Ubuntu 4.8.4-2ubuntu1~14.04) Benchmark: Grand Unified Python Benchmark (GUPB) GUPB Source: https://hg.python.org/benchmarks/ Python2.7 results: Python source: hg clone https://hg.python.org/cpython cpython Python Source: hg update 2.7 hg id: 0511b1165bb6 (2.7) hg id -r 'ancestors(.) and tag()': 15c95b7d81dc (2.7) v2.7.10 hg --debug id -i: 0511b1165bb6cf40ada0768a7efc7ba89316f6a5 Benchmarks Speedup(%) simple_logging 20 raytrace 20 silent_logging 19 richards 19 chaos 16 formatted_logging 16 json_dump 15 hexiom2 13 pidigits 12 slowunpickle 12 django_v2 12 unpack_sequence 11 float 11 mako 11 slowpickle 11 fastpickle 11 django 11 go 10 json_dump_v2 10 pathlib 10 regex_compile 10 pybench 9.9 etree_process 9 regex_v8 8 bzr_startup 8 2to3 8 slowspitfire 8 telco 8 pickle_list 8 fannkuch 8 etree_iterparse 8 nqueens 8 mako_v2 8 etree_generate 8 call_method_slots 7 html5lib_warmup 7 html5lib 7 nbody 7 spectral_norm 7 spambayes 7 fastunpickle 6 meteor_contest 6 chameleon 6 rietveld 6 tornado_http 5 unpickle_list 5 pickle_dict 4 regex_effbot 3 normal_startup 3 startup_nosite 3 etree_parse 2 call_method_unknown 2 call_simple 1 json_load 1 call_method 1 Python3.6 results Python source: hg clone https://hg.python.org/cpython cpython hg id: 96d016f78726 tip hg id -r 'ancestors(.) and tag()': 1a58b1227501 (3.5) v3.5.0rc1 hg --debug id -i: 96d016f78726afbf66d396f084b291ea43792af1 Benchmark Speedup(%) fastunpickle 22.94 fastpickle 21.67 json_load 17.64 simple_logging 17.49 meteor_contest 16.67 formatted_logging 15.33 etree_process 14.61 raytrace 13.57 etree_generate 13.56 chaos 12.09 hexiom2 12 nbody 11.88 json_dump_v2 11.24 richards 11.02 nqueens 10.96 fannkuch 10.79 go 10.77 float 10.26 regex_compile 9.8 silent_logging 9.63 pidigits 9.58 etree_iterparse 9.48 2to3 8.44 regex_v8 8.09 regex_effbot 7.88 call_simple 7.63 tornado_http 7.38 etree_parse 4.92 spectral_norm 4.72 normal_startup 4.39 telco 3.88 startup_nosite 3.7 call_method 3.63 unpack_sequence 3.6 call_method_slots 2.91 call_method_unknown 2.59 iterative_count 0.45 threaded_count -2.79 Thank you, Alecsandru

How about we first add a new Makefile target that enables PGO, without turning it on by default? Then later we can enable it by default. Also, I have my doubts about regrtest. How sure are we that it represents a typical Python load? Tests are often using a different mix of operations than production code. On Sat, Aug 22, 2015 at 7:46 AM, Patrascu, Alecsandru < alecsandru.patrascu@intel.com> wrote:
-- --Guido van Rossum (python.org/~guido)

On Sat, Aug 22, 2015, 09:17 Guido van Rossum <guido@python.org> wrote: How about we first add a new Makefile target that enables PGO, without turning it on by default? Then later we can enable it by default. I agree. Updating the Makefile so it's easier to use PGO is great, but we should do a release with it as opt-in and go from there. Also, I have my doubts about regrtest. How sure are we that it represents a typical Python load? Tests are often using a different mix of operations than production code. That was also my question. You said that "it provides the best performance improvement", but compared to what; what else was tried? And what difference does it make to e.g. a Django app that is trained on their own simulated workload compared to using regrtest? IOW is regrtest displaying the best across-the-board performance because it stresses the largest swath of Python and thus catches generic patterns in the code but individuals could get better performance with a simulated workload? -Brett On Sat, Aug 22, 2015 at 7:46 AM, Patrascu, Alecsandru < alecsandru.patrascu@intel.com> wrote: Hi All, This is Alecsandru from Server Scripting Languages Optimization team at Intel Corporation. I would like to submit a request to turn-on Profile Guided Optimization or PGO as the default build option for Python (both 2.7 and 3.6), given its performance benefits on a wide variety of workloads and hardware. For instance, as shown from attached sample performance results from the Grand Unified Python Benchmark, >20% speed up was observed. In addition, we are seeing 2-9% performance boost from OpenStack/Swift where more than 60% of the codes are in Python 2.7. Our analysis indicates the performance gain was mainly due to reduction of icache misses and CPU front-end stalls. Attached is the Makefile patches that modify the all build target and adds a new one called "disable-profile-opt". We built and tested this patch for Python 2.7 and 3.6 on our Linux machines (CentOS 7/Ubuntu Server 14.04, Intel Xeon Haswell/Broadwell with 18/8 cores). We use "regrtest" suite for training as it provides the best performance improvement. Some of the test programs in the suite may fail which leads to build fail. One solution is to disable the specific failed test using the "-x " flag (as shown in the patch) Steps to apply the patch: 1. hg clone https://hg.python.org/cpython cpython 2. cd cpython 3. hg update 2.7 (needed for 2.7 only) 4. Copy *.patch to the current directory 5. patch < python2.7-pgo.patch (or patch < python3.6-pgo.patch) 6. ./configure 7. make To disable PGO 7b. make disable-profile-opt In the following, please find our sample performance results from latest XEON machine, XEON Broadwell EP. Hardware (HW): Intel XEON (Broadwell) 8 Cores BIOS settings: Intel Turbo Boost Technology: false Hyper-Threading: false Operating System: Ubuntu 14.04.3 LTS trusty OS configuration: CPU freq set at fixed: 2.6GHz by echo 2600000 > /sys/devices/system/cpu/cpu*/cpufreq/scaling_min_freq echo 2600000 > /sys/devices/system/cpu/cpu*/cpufreq/scaling_max_freq Address Space Layout Randomization (ASLR) disabled (to reduce run to run variation) by echo 0 > /proc/sys/kernel/randomize_va_space GCC version: gcc version 4.8.4 (Ubuntu 4.8.4-2ubuntu1~14.04) Benchmark: Grand Unified Python Benchmark (GUPB) GUPB Source: https://hg.python.org/benchmarks/ Python2.7 results: Python source: hg clone https://hg.python.org/cpython cpython Python Source: hg update 2.7 hg id: 0511b1165bb6 (2.7) hg id -r 'ancestors(.) and tag()': 15c95b7d81dc (2.7) v2.7.10 hg --debug id -i: 0511b1165bb6cf40ada0768a7efc7ba89316f6a5 Benchmarks Speedup(%) simple_logging 20 raytrace 20 silent_logging 19 richards 19 chaos 16 formatted_logging 16 json_dump 15 hexiom2 13 pidigits 12 slowunpickle 12 django_v2 12 unpack_sequence 11 float 11 mako 11 slowpickle 11 fastpickle 11 django 11 go 10 json_dump_v2 10 pathlib 10 regex_compile 10 pybench 9.9 etree_process 9 regex_v8 8 bzr_startup 8 2to3 8 slowspitfire 8 telco 8 pickle_list 8 fannkuch 8 etree_iterparse 8 nqueens 8 mako_v2 8 etree_generate 8 call_method_slots 7 html5lib_warmup 7 html5lib 7 nbody 7 spectral_norm 7 spambayes 7 fastunpickle 6 meteor_contest 6 chameleon 6 rietveld 6 tornado_http 5 unpickle_list 5 pickle_dict 4 regex_effbot 3 normal_startup 3 startup_nosite 3 etree_parse 2 call_method_unknown 2 call_simple 1 json_load 1 call_method 1 Python3.6 results Python source: hg clone https://hg.python.org/cpython cpython hg id: 96d016f78726 tip hg id -r 'ancestors(.) and tag()': 1a58b1227501 (3.5) v3.5.0rc1 hg --debug id -i: 96d016f78726afbf66d396f084b291ea43792af1 Benchmark Speedup(%) fastunpickle 22.94 fastpickle 21.67 json_load 17.64 simple_logging 17.49 meteor_contest 16.67 formatted_logging 15.33 etree_process 14.61 raytrace 13.57 etree_generate 13.56 chaos 12.09 hexiom2 12 nbody 11.88 json_dump_v2 11.24 richards 11.02 nqueens 10.96 fannkuch 10.79 go 10.77 float 10.26 regex_compile 9.8 silent_logging 9.63 pidigits 9.58 etree_iterparse 9.48 2to3 8.44 regex_v8 8.09 regex_effbot 7.88 call_simple 7.63 tornado_http 7.38 etree_parse 4.92 spectral_norm 4.72 normal_startup 4.39 telco 3.88 startup_nosite 3.7 call_method 3.63 unpack_sequence 3.6 call_method_slots 2.91 call_method_unknown 2.59 iterative_count 0.45 threaded_count -2.79 Thank you, Alecsandru _______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/guido%40python.org -- --Guido van Rossum (python.org/~guido) _______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/brett%40python.org

This target replaces the existing one in the CPython Makefile, which now uses a quick run of pybench and the obtained binary does not perform well on general Python loads. I don't think is a good idea to add a by-default target that does PGO on dedicated workloads, like Django, because then it will perform better on that particular load and poorly on other. Of course, if any user has a dedicated workload for which he or she want to get the best benefit over PGO, it will have to run that training separately from the proposed one. Our proposal targets the broader audience that uses Python in various scenarios, and they will see an overall improvement after compiling Python from sources. Alecsandru From: Brett Cannon [mailto:brett@python.org] Sent: Saturday, August 22, 2015 7:25 PM To: guido@python.org; Patrascu, Alecsandru Cc: python-dev@python.org Subject: Re: [Python-Dev] Profile Guided Optimization active by-default On Sat, Aug 22, 2015, 09:17 Guido van Rossum <guido@python.org> wrote: How about we first add a new Makefile target that enables PGO, without turning it on by default? Then later we can enable it by default. I agree. Updating the Makefile so it's easier to use PGO is great, but we should do a release with it as opt-in and go from there. Also, I have my doubts about regrtest. How sure are we that it represents a typical Python load? Tests are often using a different mix of operations than production code. That was also my question. You said that "it provides the best performance improvement", but compared to what; what else was tried? And what difference does it make to e.g. a Django app that is trained on their own simulated workload compared to using regrtest? IOW is regrtest displaying the best across-the-board performance because it stresses the largest swath of Python and thus catches generic patterns in the code but individuals could get better performance with a simulated workload? -Brett On Sat, Aug 22, 2015 at 7:46 AM, Patrascu, Alecsandru <alecsandru.patrascu@intel.com> wrote: Hi All, This is Alecsandru from Server Scripting Languages Optimization team at Intel Corporation. I would like to submit a request to turn-on Profile Guided Optimization or PGO as the default build option for Python (both 2.7 and 3.6), given its performance benefits on a wide variety of workloads and hardware. For instance, as shown from attached sample performance results from the Grand Unified Python Benchmark, >20% speed up was observed. In addition, we are seeing 2-9% performance boost from OpenStack/Swift where more than 60% of the codes are in Python 2.7. Our analysis indicates the performance gain was mainly due to reduction of icache misses and CPU front-end stalls. Attached is the Makefile patches that modify the all build target and adds a new one called "disable-profile-opt". We built and tested this patch for Python 2.7 and 3.6 on our Linux machines (CentOS 7/Ubuntu Server 14.04, Intel Xeon Haswell/Broadwell with 18/8 cores). We use "regrtest" suite for training as it provides the best performance improvement. Some of the test programs in the suite may fail which leads to build fail. One solution is to disable the specific failed test using the "-x " flag (as shown in the patch) Steps to apply the patch: 1. hg clone https://hg.python.org/cpython cpython 2. cd cpython 3. hg update 2.7 (needed for 2.7 only) 4. Copy *.patch to the current directory 5. patch < python2.7-pgo.patch (or patch < python3.6-pgo.patch) 6. ./configure 7. make To disable PGO 7b. make disable-profile-opt In the following, please find our sample performance results from latest XEON machine, XEON Broadwell EP. Hardware (HW): Intel XEON (Broadwell) 8 Cores BIOS settings: Intel Turbo Boost Technology: false Hyper-Threading: false Operating System: Ubuntu 14.04.3 LTS trusty OS configuration: CPU freq set at fixed: 2.6GHz by echo 2600000 > /sys/devices/system/cpu/cpu*/cpufreq/scaling_min_freq echo 2600000 > /sys/devices/system/cpu/cpu*/cpufreq/scaling_max_freq Address Space Layout Randomization (ASLR) disabled (to reduce run to run variation) by echo 0 > /proc/sys/kernel/randomize_va_space GCC version: gcc version 4.8.4 (Ubuntu 4.8.4-2ubuntu1~14.04) Benchmark: Grand Unified Python Benchmark (GUPB) GUPB Source: https://hg.python.org/benchmarks/ Python2.7 results: Python source: hg clone https://hg.python.org/cpython cpython Python Source: hg update 2.7 hg id: 0511b1165bb6 (2.7) hg id -r 'ancestors(.) and tag()': 15c95b7d81dc (2.7) v2.7.10 hg --debug id -i: 0511b1165bb6cf40ada0768a7efc7ba89316f6a5 Benchmarks Speedup(%) simple_logging 20 raytrace 20 silent_logging 19 richards 19 chaos 16 formatted_logging 16 json_dump 15 hexiom2 13 pidigits 12 slowunpickle 12 django_v2 12 unpack_sequence 11 float 11 mako 11 slowpickle 11 fastpickle 11 django 11 go 10 json_dump_v2 10 pathlib 10 regex_compile 10 pybench 9.9 etree_process 9 regex_v8 8 bzr_startup 8 2to3 8 slowspitfire 8 telco 8 pickle_list 8 fannkuch 8 etree_iterparse 8 nqueens 8 mako_v2 8 etree_generate 8 call_method_slots 7 html5lib_warmup 7 html5lib 7 nbody 7 spectral_norm 7 spambayes 7 fastunpickle 6 meteor_contest 6 chameleon 6 rietveld 6 tornado_http 5 unpickle_list 5 pickle_dict 4 regex_effbot 3 normal_startup 3 startup_nosite 3 etree_parse 2 call_method_unknown 2 call_simple 1 json_load 1 call_method 1 Python3.6 results Python source: hg clone https://hg.python.org/cpython cpython hg id: 96d016f78726 tip hg id -r 'ancestors(.) and tag()': 1a58b1227501 (3.5) v3.5.0rc1 hg --debug id -i: 96d016f78726afbf66d396f084b291ea43792af1 Benchmark Speedup(%) fastunpickle 22.94 fastpickle 21.67 json_load 17.64 simple_logging 17.49 meteor_contest 16.67 formatted_logging 15.33 etree_process 14.61 raytrace 13.57 etree_generate 13.56 chaos 12.09 hexiom2 12 nbody 11.88 json_dump_v2 11.24 richards 11.02 nqueens 10.96 fannkuch 10.79 go 10.77 float 10.26 regex_compile 9.8 silent_logging 9.63 pidigits 9.58 etree_iterparse 9.48 2to3 8.44 regex_v8 8.09 regex_effbot 7.88 call_simple 7.63 tornado_http 7.38 etree_parse 4.92 spectral_norm 4.72 normal_startup 4.39 telco 3.88 startup_nosite 3.7 call_method 3.63 unpack_sequence 3.6 call_method_slots 2.91 call_method_unknown 2.59 iterative_count 0.45 threaded_count -2.79 Thank you, Alecsandru _______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/guido%40python.org -- --Guido van Rossum (python.org/~guido) _______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/brett%40python.org

On Sat, Aug 22, 2015, 09:58 Patrascu, Alecsandru < alecsandru.patrascu@intel.com> wrote: This target replaces the existing one in the CPython Makefile, which now uses a quick run of pybench and the obtained binary does not perform well on general Python loads. I don't think is a good idea to add a by-default target that does PGO on dedicated workloads, like Django, because then it will perform better on that particular load and poorly on other. Sorry for not being clearer, but I was not suggesting that the default be for Django, just whether making the Makefile easier to work with when generating a PGO build for a custom workload. If we already have a rule that uses pybench then it should definitely be changed to use regrtest (and honestly pybench should not be used for benchmarking anything since it doesn't reflect real world usage in any way; its just for quick checks while doing development on the core of Python and otherwise shouldn't be used to measure anything substantial). Of course, if any user has a dedicated workload for which he or she want to get the best benefit over PGO, it will have to run that training separately from the proposed one. Our proposal targets the broader audience that uses Python in various scenarios, and they will see an overall improvement after compiling Python from sources. Right, but my question was whether there was any benefit to making the Makefile rules generic to make building PGO binaries easier for people who do want to do a custom profile and it sounds like it isn't worth the effort. So I'm with Guido where I'm happy to see the build rules added/updated to use regrtest for a PGO build but have it be an opt-in flag and not on by default (at least for now). -Brett Alecsandru From: Brett Cannon [mailto:brett@python.org] Sent: Saturday, August 22, 2015 7:25 PM To: guido@python.org; Patrascu, Alecsandru Cc: python-dev@python.org Subject: Re: [Python-Dev] Profile Guided Optimization active by-default On Sat, Aug 22, 2015, 09:17 Guido van Rossum <guido@python.org> wrote: How about we first add a new Makefile target that enables PGO, without turning it on by default? Then later we can enable it by default. I agree. Updating the Makefile so it's easier to use PGO is great, but we should do a release with it as opt-in and go from there. Also, I have my doubts about regrtest. How sure are we that it represents a typical Python load? Tests are often using a different mix of operations than production code. That was also my question. You said that "it provides the best performance improvement", but compared to what; what else was tried? And what difference does it make to e.g. a Django app that is trained on their own simulated workload compared to using regrtest? IOW is regrtest displaying the best across-the-board performance because it stresses the largest swath of Python and thus catches generic patterns in the code but individuals could get better performance with a simulated workload? -Brett On Sat, Aug 22, 2015 at 7:46 AM, Patrascu, Alecsandru < alecsandru.patrascu@intel.com> wrote: Hi All, This is Alecsandru from Server Scripting Languages Optimization team at Intel Corporation. I would like to submit a request to turn-on Profile Guided Optimization or PGO as the default build option for Python (both 2.7 and 3.6), given its performance benefits on a wide variety of workloads and hardware. For instance, as shown from attached sample performance results from the Grand Unified Python Benchmark, >20% speed up was observed. In addition, we are seeing 2-9% performance boost from OpenStack/Swift where more than 60% of the codes are in Python 2.7. Our analysis indicates the performance gain was mainly due to reduction of icache misses and CPU front-end stalls. Attached is the Makefile patches that modify the all build target and adds a new one called "disable-profile-opt". We built and tested this patch for Python 2.7 and 3.6 on our Linux machines (CentOS 7/Ubuntu Server 14.04, Intel Xeon Haswell/Broadwell with 18/8 cores). We use "regrtest" suite for training as it provides the best performance improvement. Some of the test programs in the suite may fail which leads to build fail. One solution is to disable the specific failed test using the "-x " flag (as shown in the patch) Steps to apply the patch: 1. hg clone https://hg.python.org/cpython cpython 2. cd cpython 3. hg update 2.7 (needed for 2.7 only) 4. Copy *.patch to the current directory 5. patch < python2.7-pgo.patch (or patch < python3.6-pgo.patch) 6. ./configure 7. make To disable PGO 7b. make disable-profile-opt In the following, please find our sample performance results from latest XEON machine, XEON Broadwell EP. Hardware (HW): Intel XEON (Broadwell) 8 Cores BIOS settings: Intel Turbo Boost Technology: false Hyper-Threading: false Operating System: Ubuntu 14.04.3 LTS trusty OS configuration: CPU freq set at fixed: 2.6GHz by echo 2600000 > /sys/devices/system/cpu/cpu*/cpufreq/scaling_min_freq echo 2600000 > /sys/devices/system/cpu/cpu*/cpufreq/scaling_max_freq Address Space Layout Randomization (ASLR) disabled (to reduce run to run variation) by echo 0 > /proc/sys/kernel/randomize_va_space GCC version: gcc version 4.8.4 (Ubuntu 4.8.4-2ubuntu1~14.04) Benchmark: Grand Unified Python Benchmark (GUPB) GUPB Source: https://hg.python.org/benchmarks/ Python2.7 results: Python source: hg clone https://hg.python.org/cpython cpython Python Source: hg update 2.7 hg id: 0511b1165bb6 (2.7) hg id -r 'ancestors(.) and tag()': 15c95b7d81dc (2.7) v2.7.10 hg --debug id -i: 0511b1165bb6cf40ada0768a7efc7ba89316f6a5 Benchmarks Speedup(%) simple_logging 20 raytrace 20 silent_logging 19 richards 19 chaos 16 formatted_logging 16 json_dump 15 hexiom2 13 pidigits 12 slowunpickle 12 django_v2 12 unpack_sequence 11 float 11 mako 11 slowpickle 11 fastpickle 11 django 11 go 10 json_dump_v2 10 pathlib 10 regex_compile 10 pybench 9.9 etree_process 9 regex_v8 8 bzr_startup 8 2to3 8 slowspitfire 8 telco 8 pickle_list 8 fannkuch 8 etree_iterparse 8 nqueens 8 mako_v2 8 etree_generate 8 call_method_slots 7 html5lib_warmup 7 html5lib 7 nbody 7 spectral_norm 7 spambayes 7 fastunpickle 6 meteor_contest 6 chameleon 6 rietveld 6 tornado_http 5 unpickle_list 5 pickle_dict 4 regex_effbot 3 normal_startup 3 startup_nosite 3 etree_parse 2 call_method_unknown 2 call_simple 1 json_load 1 call_method 1 Python3.6 results Python source: hg clone https://hg.python.org/cpython cpython hg id: 96d016f78726 tip hg id -r 'ancestors(.) and tag()': 1a58b1227501 (3.5) v3.5.0rc1 hg --debug id -i: 96d016f78726afbf66d396f084b291ea43792af1 Benchmark Speedup(%) fastunpickle 22.94 fastpickle 21.67 json_load 17.64 simple_logging 17.49 meteor_contest 16.67 formatted_logging 15.33 etree_process 14.61 raytrace 13.57 etree_generate 13.56 chaos 12.09 hexiom2 12 nbody 11.88 json_dump_v2 11.24 richards 11.02 nqueens 10.96 fannkuch 10.79 go 10.77 float 10.26 regex_compile 9.8 silent_logging 9.63 pidigits 9.58 etree_iterparse 9.48 2to3 8.44 regex_v8 8.09 regex_effbot 7.88 call_simple 7.63 tornado_http 7.38 etree_parse 4.92 spectral_norm 4.72 normal_startup 4.39 telco 3.88 startup_nosite 3.7 call_method 3.63 unpack_sequence 3.6 call_method_slots 2.91 call_method_unknown 2.59 iterative_count 0.45 threaded_count -2.79 Thank you, Alecsandru _______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/guido%40python.org -- --Guido van Rossum (python.org/~guido) _______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/brett%40python.org

On Sat, Aug 22, 2015 at 9:27 AM Brett Cannon <brett@python.org> wrote:
There already is one and has been for many years. make profile-opt. I even setup a buildbot for it last year. The problem with the existing profile-opt build in our default Makefile.in is that is uses a horrible profiling workload (pybench, ugh) so it leaves a lot of improvements behind. What all Linux distros (Debian/Ubuntu and Redhat at least; nothing else matters) do for their Python builds is to use profile-opt but they replace the profiling workload with a stable set of the Python unittest suite itself. Results are much better all around. Generally a 20% speedup. Anyone deploying Python who is *not* using a profile-opt build is wasting CPU resources. Whether it should be *the default* or not *is a different question*. The Makefile is optimized for CPython developers who certainly do not want to run two separate builds and a profile-opt workload every time they type make to test out their changes. But all binary release builds should use it. I agree. Updating the Makefile so it's easier to use PGO is great, but we
This isn't something to argue about. Just use regrtest and compare the before and after with the benchmark suite. It really does exercise things well. People like to fear that it'll produce code optimized for the test suite itself or something. No. Python as an interpreter is very realistically exercised by running it as it is simply running a lot of code and a good variety of code including the extension modules that benefit most such as regexes, pickle, json, xml, etc. Thomas tried the test suite and a variety of other workloads when looking at what to use at work. The testsuite works out generally the best. Going beyond that seems to be a wash. What we tested and decided to use on our own builds after benchmarking at work was to build with: make profile-opt PROFILE_TASK="-m test.regrtest -w -uall,-audio -x test_gdb test_multiprocessing" In general if a test is unreliable or takes an extremely long time, exclude it for your sanity. (i'd also kick out test_subprocess on 2.7; we replaced subprocess with subprocess32 in our build so that wasn't an issue) -gps

On 25 August 2015 at 05:52, Gregory P. Smith <greg@krypto.org> wrote:
Having the "production ready" make target be "make profile-opt" doesn't strike me as the most intuitive thing in the world. I agree we want the "./configure && make" sequence to be oriented towards local development builds rather than highly optimised production ones, so perhaps we could provide a "make production" target that enables PGO with an appropriate training set from regrtest, and also complains if "--with-pydebug" is configured? Regards, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On Tue, Aug 25, 2015 at 11:17 AM, Brett Cannon <brett@python.org> wrote:
You need to be careful there. In my environment, I interface with a lot of Boost.Python-wrapped code which would be quite impractical to compile with --with-pydebug. I'd like to be able to throw in all the other development bells and whistles though, without changing the size of the object header. Maybe "develop-lite"? <whatever happened to wink?>-ly, y'rs, Skip

Pardon me if I'm not in the right place to ask the following naive question. (say me if it's the case) Does Profile Guided Optimization performance improvements are specific to the chip where the built is done or the performance is better on a larger set of chips?

PGO is unrelated to the particular CPU the profiling is done on. (It is conceivable that it'd make a small difference but I've never observed that in practice) On Tue, Aug 25, 2015, 9:28 AM Xavier Combelle <xavier.combelle@gmail.com> wrote: Pardon me if I'm not in the right place to ask the following naive question. (say me if it's the case) Does Profile Guided Optimization performance improvements are specific to the chip where the built is done or the performance is better on a larger set of chips?

Indeed, as Gregory well mentioned, PGO is unrelated to a particular CPU on which we do profiling. From: Python-Dev [mailto:python-dev-bounces+alecsandru.patrascu=intel.com@python.org] On Behalf Of Gregory P. Smith Sent: Tuesday, August 25, 2015 7:44 PM To: Xavier Combelle; python-dev@python.org Subject: Re: [Python-Dev] Profile Guided Optimization active by-default PGO is unrelated to the particular CPU the profiling is done on. (It is conceivable that it'd make a small difference but I've never observed that in practice) On Tue, Aug 25, 2015, 9:28 AM Xavier Combelle <xavier.combelle@gmail.com> wrote: Pardon me if I'm not in the right place to ask the following naive question. (say me if it's the case) Does Profile Guided Optimization performance improvements are specific to the chip where the built is done or the performance is better on a larger set of chips?

On Mon, Aug 24, 2015, 11:19 PM Nick Coghlan <ncoghlan@gmail.com> wrote: On 25 August 2015 at 05:52, Gregory P. Smith <greg@krypto.org> wrote:
Having the "production ready" make target be "make profile-opt" doesn't strike me as the most intuitive thing in the world. I agree we want the "./configure && make" sequence to be oriented towards local development builds rather than highly optimised production ones, so perhaps we could provide a "make production" target that enables PGO with an appropriate training set from regrtest, and also complains if "--with-pydebug" is configured? Regards, Nick. -- Nick Coghlan | ncoghlan@ <ncoghlan@gmail.com>gmail.com <ncoghlan@gmail.com> | Brisbane, Australia Agreed. Also, printing a message out at the end of a default make all build suggesting people use make production for additional performance instead might help advertise it. make install could possibly depend on make production as well?

The current pgo target just uses a very specific task to train for the feedback. For my Debian/Ubuntu builds I'm using the testsuite minus some problematic tests to train. Otoh I don't know if this is the best way to do it, however it gave better results at some time in the past. What I would like is a benchmark / a mixture of benchmarks on which to enable pgo/pdo. Based on that you could enable pgo based on some static decisions based on autofdo. For that you don't need any profile runs during your build; it just needs shipping the autofdo outcome together with a Python release. This doesn't give you the same performance as for for a GCC pgo build, but it would be a first step. And defining the probe for any pgo build would be welcome too. Matthias On 08/22/2015 06:25 PM, Brett Cannon wrote:

Hello and thank you for your feedback. We have measured PGO gain using other workloads also. Our initial choice for this optimization was pybench, but the speedup obtained was lower than using regrtest and it didn't cover a lot of Python scenarios. Instead, regrtest has an uniform distribution for the tests and the resulting binary is overall much faster than the default, or trained using other workloads, and thus covering a larger pool of Python loads. This optimization was also tested on a production environments running OpenStack Swift and got up to 9% improvements. The reason we proposed this target to be always on is that the obtained optimized binary is better out of the box for the general cases. Alecsandru From: gvanrossum@gmail.com [mailto:gvanrossum@gmail.com] On Behalf Of Guido van Rossum Sent: Saturday, August 22, 2015 7:15 PM To: Patrascu, Alecsandru Cc: python-dev@python.org Subject: Re: [Python-Dev] Profile Guided Optimization active by-default How about we first add a new Makefile target that enables PGO, without turning it on by default? Then later we can enable it by default. Also, I have my doubts about regrtest. How sure are we that it represents a typical Python load? Tests are often using a different mix of operations than production code. On Sat, Aug 22, 2015 at 7:46 AM, Patrascu, Alecsandru <alecsandru.patrascu@intel.com> wrote: Hi All, This is Alecsandru from Server Scripting Languages Optimization team at Intel Corporation. I would like to submit a request to turn-on Profile Guided Optimization or PGO as the default build option for Python (both 2.7 and 3.6), given its performance benefits on a wide variety of workloads and hardware. For instance, as shown from attached sample performance results from the Grand Unified Python Benchmark, >20% speed up was observed. In addition, we are seeing 2-9% performance boost from OpenStack/Swift where more than 60% of the codes are in Python 2.7. Our analysis indicates the performance gain was mainly due to reduction of icache misses and CPU front-end stalls. Attached is the Makefile patches that modify the all build target and adds a new one called "disable-profile-opt". We built and tested this patch for Python 2.7 and 3.6 on our Linux machines (CentOS 7/Ubuntu Server 14.04, Intel Xeon Haswell/Broadwell with 18/8 cores). We use "regrtest" suite for training as it provides the best performance improvement. Some of the test programs in the suite may fail which leads to build fail. One solution is to disable the specific failed test using the "-x " flag (as shown in the patch) Steps to apply the patch: 1. hg clone https://hg.python.org/cpython cpython 2. cd cpython 3. hg update 2.7 (needed for 2.7 only) 4. Copy *.patch to the current directory 5. patch < python2.7-pgo.patch (or patch < python3.6-pgo.patch) 6. ./configure 7. make To disable PGO 7b. make disable-profile-opt In the following, please find our sample performance results from latest XEON machine, XEON Broadwell EP. Hardware (HW): Intel XEON (Broadwell) 8 Cores BIOS settings: Intel Turbo Boost Technology: false Hyper-Threading: false Operating System: Ubuntu 14.04.3 LTS trusty OS configuration: CPU freq set at fixed: 2.6GHz by echo 2600000 > /sys/devices/system/cpu/cpu*/cpufreq/scaling_min_freq echo 2600000 > /sys/devices/system/cpu/cpu*/cpufreq/scaling_max_freq Address Space Layout Randomization (ASLR) disabled (to reduce run to run variation) by echo 0 > /proc/sys/kernel/randomize_va_space GCC version: gcc version 4.8.4 (Ubuntu 4.8.4-2ubuntu1~14.04) Benchmark: Grand Unified Python Benchmark (GUPB) GUPB Source: https://hg.python.org/benchmarks/ Python2.7 results: Python source: hg clone https://hg.python.org/cpython cpython Python Source: hg update 2.7 hg id: 0511b1165bb6 (2.7) hg id -r 'ancestors(.) and tag()': 15c95b7d81dc (2.7) v2.7.10 hg --debug id -i: 0511b1165bb6cf40ada0768a7efc7ba89316f6a5 Benchmarks Speedup(%) simple_logging 20 raytrace 20 silent_logging 19 richards 19 chaos 16 formatted_logging 16 json_dump 15 hexiom2 13 pidigits 12 slowunpickle 12 django_v2 12 unpack_sequence 11 float 11 mako 11 slowpickle 11 fastpickle 11 django 11 go 10 json_dump_v2 10 pathlib 10 regex_compile 10 pybench 9.9 etree_process 9 regex_v8 8 bzr_startup 8 2to3 8 slowspitfire 8 telco 8 pickle_list 8 fannkuch 8 etree_iterparse 8 nqueens 8 mako_v2 8 etree_generate 8 call_method_slots 7 html5lib_warmup 7 html5lib 7 nbody 7 spectral_norm 7 spambayes 7 fastunpickle 6 meteor_contest 6 chameleon 6 rietveld 6 tornado_http 5 unpickle_list 5 pickle_dict 4 regex_effbot 3 normal_startup 3 startup_nosite 3 etree_parse 2 call_method_unknown 2 call_simple 1 json_load 1 call_method 1 Python3.6 results Python source: hg clone https://hg.python.org/cpython cpython hg id: 96d016f78726 tip hg id -r 'ancestors(.) and tag()': 1a58b1227501 (3.5) v3.5.0rc1 hg --debug id -i: 96d016f78726afbf66d396f084b291ea43792af1 Benchmark Speedup(%) fastunpickle 22.94 fastpickle 21.67 json_load 17.64 simple_logging 17.49 meteor_contest 16.67 formatted_logging 15.33 etree_process 14.61 raytrace 13.57 etree_generate 13.56 chaos 12.09 hexiom2 12 nbody 11.88 json_dump_v2 11.24 richards 11.02 nqueens 10.96 fannkuch 10.79 go 10.77 float 10.26 regex_compile 9.8 silent_logging 9.63 pidigits 9.58 etree_iterparse 9.48 2to3 8.44 regex_v8 8.09 regex_effbot 7.88 call_simple 7.63 tornado_http 7.38 etree_parse 4.92 spectral_norm 4.72 normal_startup 4.39 telco 3.88 startup_nosite 3.7 call_method 3.63 unpack_sequence 3.6 call_method_slots 2.91 call_method_unknown 2.59 iterative_count 0.45 threaded_count -2.79 Thank you, Alecsandru _______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/guido%40python.org -- --Guido van Rossum (python.org/~guido)

I'm sorry, but we're just not going to turn this on by default without doing a trial period ourselves. Your (and Intel's) contribution is very welcome, but in order to establish trust in a feature like this, an optional trial period is absolutely required. Regarding the training set, I agree that regrtest sounds to be better than pybench. If we make this an opt-in change, we can experiment with different training sets easily. (Also, I haven't seen the patch yet, but I presume it's easy to use a different training set? Experimentation should be encouraged.) On Sat, Aug 22, 2015 at 9:40 AM, Patrascu, Alecsandru < alecsandru.patrascu@intel.com> wrote:
-- --Guido van Rossum (python.org/~guido)

A trial period on numerous other Python loads in which the provided patches are tested is welcomed, to be sure that it works as presented. Yes, it is easy to change it to use a different training set, or subsets of the regrtest by adding additional parameters to the line inside the Makefile that runs it. Now, the attached patches run the full regrtest suite. Alecsandru From: gvanrossum@gmail.com [mailto:gvanrossum@gmail.com] On Behalf Of Guido van Rossum Sent: Saturday, August 22, 2015 7:56 PM To: Patrascu, Alecsandru Cc: python-dev@python.org Subject: Re: [Python-Dev] Profile Guided Optimization active by-default I'm sorry, but we're just not going to turn this on by default without doing a trial period ourselves. Your (and Intel's) contribution is very welcome, but in order to establish trust in a feature like this, an optional trial period is absolutely required. Regarding the training set, I agree that regrtest sounds to be better than pybench. If we make this an opt-in change, we can experiment with different training sets easily. (Also, I haven't seen the patch yet, but I presume it's easy to use a different training set? Experimentation should be encouraged.) On Sat, Aug 22, 2015 at 9:40 AM, Patrascu, Alecsandru <alecsandru.patrascu@intel.com> wrote: Hello and thank you for your feedback. We have measured PGO gain using other workloads also. Our initial choice for this optimization was pybench, but the speedup obtained was lower than using regrtest and it didn't cover a lot of Python scenarios. Instead, regrtest has an uniform distribution for the tests and the resulting binary is overall much faster than the default, or trained using other workloads, and thus covering a larger pool of Python loads. This optimization was also tested on a production environments running OpenStack Swift and got up to 9% improvements. The reason we proposed this target to be always on is that the obtained optimized binary is better out of the box for the general cases. Alecsandru From: gvanrossum@gmail.com [mailto:gvanrossum@gmail.com] On Behalf Of Guido van Rossum Sent: Saturday, August 22, 2015 7:15 PM To: Patrascu, Alecsandru Cc: python-dev@python.org Subject: Re: [Python-Dev] Profile Guided Optimization active by-default How about we first add a new Makefile target that enables PGO, without turning it on by default? Then later we can enable it by default. Also, I have my doubts about regrtest. How sure are we that it represents a typical Python load? Tests are often using a different mix of operations than production code. On Sat, Aug 22, 2015 at 7:46 AM, Patrascu, Alecsandru <alecsandru.patrascu@intel.com> wrote: Hi All, This is Alecsandru from Server Scripting Languages Optimization team at Intel Corporation. I would like to submit a request to turn-on Profile Guided Optimization or PGO as the default build option for Python (both 2.7 and 3.6), given its performance benefits on a wide variety of workloads and hardware. For instance, as shown from attached sample performance results from the Grand Unified Python Benchmark, >20% speed up was observed. In addition, we are seeing 2-9% performance boost from OpenStack/Swift where more than 60% of the codes are in Python 2.7. Our analysis indicates the performance gain was mainly due to reduction of icache misses and CPU front-end stalls. Attached is the Makefile patches that modify the all build target and adds a new one called "disable-profile-opt". We built and tested this patch for Python 2.7 and 3.6 on our Linux machines (CentOS 7/Ubuntu Server 14.04, Intel Xeon Haswell/Broadwell with 18/8 cores). We use "regrtest" suite for training as it provides the best performance improvement. Some of the test programs in the suite may fail which leads to build fail. One solution is to disable the specific failed test using the "-x " flag (as shown in the patch) Steps to apply the patch: 1. hg clone https://hg.python.org/cpython cpython 2. cd cpython 3. hg update 2.7 (needed for 2.7 only) 4. Copy *.patch to the current directory 5. patch < python2.7-pgo.patch (or patch < python3.6-pgo.patch) 6. ./configure 7. make To disable PGO 7b. make disable-profile-opt In the following, please find our sample performance results from latest XEON machine, XEON Broadwell EP. Hardware (HW): Intel XEON (Broadwell) 8 Cores BIOS settings: Intel Turbo Boost Technology: false Hyper-Threading: false Operating System: Ubuntu 14.04.3 LTS trusty OS configuration: CPU freq set at fixed: 2.6GHz by echo 2600000 > /sys/devices/system/cpu/cpu*/cpufreq/scaling_min_freq echo 2600000 > /sys/devices/system/cpu/cpu*/cpufreq/scaling_max_freq Address Space Layout Randomization (ASLR) disabled (to reduce run to run variation) by echo 0 > /proc/sys/kernel/randomize_va_space GCC version: gcc version 4.8.4 (Ubuntu 4.8.4-2ubuntu1~14.04) Benchmark: Grand Unified Python Benchmark (GUPB) GUPB Source: https://hg.python.org/benchmarks/ Python2.7 results: Python source: hg clone https://hg.python.org/cpython cpython Python Source: hg update 2.7 hg id: 0511b1165bb6 (2.7) hg id -r 'ancestors(.) and tag()': 15c95b7d81dc (2.7) v2.7.10 hg --debug id -i: 0511b1165bb6cf40ada0768a7efc7ba89316f6a5 Benchmarks Speedup(%) simple_logging 20 raytrace 20 silent_logging 19 richards 19 chaos 16 formatted_logging 16 json_dump 15 hexiom2 13 pidigits 12 slowunpickle 12 django_v2 12 unpack_sequence 11 float 11 mako 11 slowpickle 11 fastpickle 11 django 11 go 10 json_dump_v2 10 pathlib 10 regex_compile 10 pybench 9.9 etree_process 9 regex_v8 8 bzr_startup 8 2to3 8 slowspitfire 8 telco 8 pickle_list 8 fannkuch 8 etree_iterparse 8 nqueens 8 mako_v2 8 etree_generate 8 call_method_slots 7 html5lib_warmup 7 html5lib 7 nbody 7 spectral_norm 7 spambayes 7 fastunpickle 6 meteor_contest 6 chameleon 6 rietveld 6 tornado_http 5 unpickle_list 5 pickle_dict 4 regex_effbot 3 normal_startup 3 startup_nosite 3 etree_parse 2 call_method_unknown 2 call_simple 1 json_load 1 call_method 1 Python3.6 results Python source: hg clone https://hg.python.org/cpython cpython hg id: 96d016f78726 tip hg id -r 'ancestors(.) and tag()': 1a58b1227501 (3.5) v3.5.0rc1 hg --debug id -i: 96d016f78726afbf66d396f084b291ea43792af1 Benchmark Speedup(%) fastunpickle 22.94 fastpickle 21.67 json_load 17.64 simple_logging 17.49 meteor_contest 16.67 formatted_logging 15.33 etree_process 14.61 raytrace 13.57 etree_generate 13.56 chaos 12.09 hexiom2 12 nbody 11.88 json_dump_v2 11.24 richards 11.02 nqueens 10.96 fannkuch 10.79 go 10.77 float 10.26 regex_compile 9.8 silent_logging 9.63 pidigits 9.58 etree_iterparse 9.48 2to3 8.44 regex_v8 8.09 regex_effbot 7.88 call_simple 7.63 tornado_http 7.38 etree_parse 4.92 spectral_norm 4.72 normal_startup 4.39 telco 3.88 startup_nosite 3.7 call_method 3.63 unpack_sequence 3.6 call_method_slots 2.91 call_method_unknown 2.59 iterative_count 0.45 threaded_count -2.79 Thank you, Alecsandru _______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/guido%40python.org -- --Guido van Rossum (python.org/~guido) -- --Guido van Rossum (python.org/~guido)

Guido van Rossum schrieb am 22.08.2015 um 18:55:
It's just one command in one line, yes.
Experimentation should be encouraged.)
A well chosen training set can have a notable impact on PGO compiled code in general, and switching from pybench to regrtests should make such a difference. However, since CPython's overall performance is mostly determined by the interpreter loop, general object operations (getattr!) and the basic builtin types, of which the regression test suite makes plenty of use, it is rather unlikely that other training sets would provide substantially better performance for Python code execution. Note also that Ubuntu has shipped PGO builds based on the regrtests for years, and they seemed to be quite happy with it. Stefan

Stefan Behnel schrieb am 22.08.2015 um 19:25:
Note that this doesn't mean that it's a good workload for the C code in the standard library (and I guess that's why Alecsandru initially excluded the hashlib tests). Improvements on that front might still be possible. But it's certainly a good workload for all the rest, i.e. for executing general Python code. Stefan

Thank you Stefan for also pointing out the importance of regrtest as a good training set for building Python. Indeed, Ubuntu delivers in their repos the Python2/3 binaries already optimized using PGO based on regrtest. Alecsandru -----Original Message----- From: Python-Dev [mailto:python-dev-bounces+alecsandru.patrascu=intel.com@python.org] On Behalf Of Stefan Behnel Sent: Saturday, August 22, 2015 8:25 PM To: python-dev@python.org Subject: Re: [Python-Dev] Profile Guided Optimization active by-default Guido van Rossum schrieb am 22.08.2015 um 18:55:
It's just one command in one line, yes.
Experimentation should be encouraged.)
A well chosen training set can have a notable impact on PGO compiled code in general, and switching from pybench to regrtests should make such a difference. However, since CPython's overall performance is mostly determined by the interpreter loop, general object operations (getattr!) and the basic builtin types, of which the regression test suite makes plenty of use, it is rather unlikely that other training sets would provide substantially better performance for Python code execution. Note also that Ubuntu has shipped PGO builds based on the regrtests for years, and they seemed to be quite happy with it. Stefan _______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/alecsandru.patrascu%40int...

On Aug 22, 2015 9:02 AM, "Patrascu, Alecsandru" < alecsandru.patrascu@intel.com> wrote: [snip]
For instance, as shown from attached sample performance results from the Grand Unified Python Benchmark, >20% speed up was observed.
Are you referring to the tests in the benchmarks repo? [1] How does the real-world performance improvement compare with other languages you are targeting for optimization? And thanks for working on this! I have several more questions: What sorts of future changes in CPython's code might interfere with your optimizations? What future additions might stand to benefit? What changes in existing code might improve optimization opportunities? What is the added maintenance burden of the optimizations on CPython, if any? What is the performance impact on non-Intel architectures? What about older Intel architectures? ...and future ones? What is Intel's commitment to supporting these (or other) optimizations in the future? How is the practical EOL of the optimizations managed? Finally, +1 on adding an opt-in Makefile target rather than enabling the optimizations by default. Thanks again! -eric [1] https://hg.python.org/benchmarks/

Yes, the results are measured from running the benchmarks from the repo [1]. Furthermore, this optimization is generic and can handle any kind of changes in hardware or the CPython 2/3 source code. We are not adding to or modifying regrtest and our rule will be applied on the latest tests existing in the CPython repo. Since they are up to date and being easy to be executed, this proposal makes sure that users will always take benefit from them. [1] https://hg.python.org/benchmarks/ Alecsandru From: Eric Snow [mailto:ericsnowcurrently@gmail.com] Sent: Saturday, August 22, 2015 8:26 PM To: Patrascu, Alecsandru Cc: Python-Dev Subject: Re: [Python-Dev] Profile Guided Optimization active by-default On Aug 22, 2015 9:02 AM, "Patrascu, Alecsandru" <alecsandru.patrascu@intel.com> wrote: [snip]

I'm sorry, I forgot to mention this, I already opened an issue and the patches are uploaded [1]. [1] http://bugs.python.org/issue24915 From: Brett Cannon [mailto:brett@python.org] Sent: Saturday, August 22, 2015 9:00 PM To: Patrascu, Alecsandru; python-dev@python.org Subject: Re: [Python-Dev] Profile Guided Optimization active by-default I just realized I didn't see anyone say it, but please upload the patches to bugs.Python.org for easier tracking and reviewing. On Sat, Aug 22, 2015, 08:01 Patrascu, Alecsandru <alecsandru.patrascu@intel.com> wrote: Hi All, This is Alecsandru from Server Scripting Languages Optimization team at Intel Corporation. I would like to submit a request to turn-on Profile Guided Optimization or PGO as the default build option for Python (both 2.7 and 3.6), given its performance benefits on a wide variety of workloads and hardware. For instance, as shown from attached sample performance results from the Grand Unified Python Benchmark, >20% speed up was observed. In addition, we are seeing 2-9% performance boost from OpenStack/Swift where more than 60% of the codes are in Python 2.7. Our analysis indicates the performance gain was mainly due to reduction of icache misses and CPU front-end stalls. Attached is the Makefile patches that modify the all build target and adds a new one called "disable-profile-opt". We built and tested this patch for Python 2.7 and 3.6 on our Linux machines (CentOS 7/Ubuntu Server 14.04, Intel Xeon Haswell/Broadwell with 18/8 cores). We use "regrtest" suite for training as it provides the best performance improvement. Some of the test programs in the suite may fail which leads to build fail. One solution is to disable the specific failed test using the "-x " flag (as shown in the patch) Steps to apply the patch: 1. hg clone https://hg.python.org/cpython cpython 2. cd cpython 3. hg update 2.7 (needed for 2.7 only) 4. Copy *.patch to the current directory 5. patch < python2.7-pgo.patch (or patch < python3.6-pgo.patch) 6. ./configure 7. make To disable PGO 7b. make disable-profile-opt In the following, please find our sample performance results from latest XEON machine, XEON Broadwell EP. Hardware (HW): Intel XEON (Broadwell) 8 Cores BIOS settings: Intel Turbo Boost Technology: false Hyper-Threading: false Operating System: Ubuntu 14.04.3 LTS trusty OS configuration: CPU freq set at fixed: 2.6GHz by echo 2600000 > /sys/devices/system/cpu/cpu*/cpufreq/scaling_min_freq echo 2600000 > /sys/devices/system/cpu/cpu*/cpufreq/scaling_max_freq Address Space Layout Randomization (ASLR) disabled (to reduce run to run variation) by echo 0 > /proc/sys/kernel/randomize_va_space GCC version: gcc version 4.8.4 (Ubuntu 4.8.4-2ubuntu1~14.04) Benchmark: Grand Unified Python Benchmark (GUPB) GUPB Source: https://hg.python.org/benchmarks/ Python2.7 results: Python source: hg clone https://hg.python.org/cpython cpython Python Source: hg update 2.7 hg id: 0511b1165bb6 (2.7) hg id -r 'ancestors(.) and tag()': 15c95b7d81dc (2.7) v2.7.10 hg --debug id -i: 0511b1165bb6cf40ada0768a7efc7ba89316f6a5 Benchmarks Speedup(%) simple_logging 20 raytrace 20 silent_logging 19 richards 19 chaos 16 formatted_logging 16 json_dump 15 hexiom2 13 pidigits 12 slowunpickle 12 django_v2 12 unpack_sequence 11 float 11 mako 11 slowpickle 11 fastpickle 11 django 11 go 10 json_dump_v2 10 pathlib 10 regex_compile 10 pybench 9.9 etree_process 9 regex_v8 8 bzr_startup 8 2to3 8 slowspitfire 8 telco 8 pickle_list 8 fannkuch 8 etree_iterparse 8 nqueens 8 mako_v2 8 etree_generate 8 call_method_slots 7 html5lib_warmup 7 html5lib 7 nbody 7 spectral_norm 7 spambayes 7 fastunpickle 6 meteor_contest 6 chameleon 6 rietveld 6 tornado_http 5 unpickle_list 5 pickle_dict 4 regex_effbot 3 normal_startup 3 startup_nosite 3 etree_parse 2 call_method_unknown 2 call_simple 1 json_load 1 call_method 1 Python3.6 results Python source: hg clone https://hg.python.org/cpython cpython hg id: 96d016f78726 tip hg id -r 'ancestors(.) and tag()': 1a58b1227501 (3.5) v3.5.0rc1 hg --debug id -i: 96d016f78726afbf66d396f084b291ea43792af1 Benchmark Speedup(%) fastunpickle 22.94 fastpickle 21.67 json_load 17.64 simple_logging 17.49 meteor_contest 16.67 formatted_logging 15.33 etree_process 14.61 raytrace 13.57 etree_generate 13.56 chaos 12.09 hexiom2 12 nbody 11.88 json_dump_v2 11.24 richards 11.02 nqueens 10.96 fannkuch 10.79 go 10.77 float 10.26 regex_compile 9.8 silent_logging 9.63 pidigits 9.58 etree_iterparse 9.48 2to3 8.44 regex_v8 8.09 regex_effbot 7.88 call_simple 7.63 tornado_http 7.38 etree_parse 4.92 spectral_norm 4.72 normal_startup 4.39 telco 3.88 startup_nosite 3.7 call_method 3.63 unpack_sequence 3.6 call_method_slots 2.91 call_method_unknown 2.59 iterative_count 0.45 threaded_count -2.79 Thank you, Alecsandru _______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/brett%40python.org

On Sat, 22 Aug 2015 at 11:10 Patrascu, Alecsandru < alecsandru.patrascu@intel.com> wrote:
Great, thanks Alecandru. Do please follow Stefan's comment, though, and upload the patch files directly and not as a zip file. That way we can use our code review tool to do a proper review of the patches. -Brett

I removed the zip file and uploaded the patches individually. Alecsandru From: Brett Cannon [mailto:brett@python.org] Sent: Sunday, August 23, 2015 4:47 AM To: Patrascu, Alecsandru; python-dev@python.org Subject: Re: [Python-Dev] Profile Guided Optimization active by-default On Sat, 22 Aug 2015 at 11:10 Patrascu, Alecsandru <alecsandru.patrascu@intel.com> wrote: I'm sorry, I forgot to mention this, I already opened an issue and the patches are uploaded [1]. [1] http://bugs.python.org/issue24915 Great, thanks Alecandru. Do please follow Stefan's comment, though, and upload the patch files directly and not as a zip file. That way we can use our code review tool to do a proper review of the patches. -Brett From: Brett Cannon [mailto:brett@python.org] Sent: Saturday, August 22, 2015 9:00 PM To: Patrascu, Alecsandru; python-dev@python.org Subject: Re: [Python-Dev] Profile Guided Optimization active by-default I just realized I didn't see anyone say it, but please upload the patches to bugs.Python.org for easier tracking and reviewing. On Sat, Aug 22, 2015, 08:01 Patrascu, Alecsandru <alecsandru.patrascu@intel.com> wrote: Hi All, This is Alecsandru from Server Scripting Languages Optimization team at Intel Corporation. I would like to submit a request to turn-on Profile Guided Optimization or PGO as the default build option for Python (both 2.7 and 3.6), given its performance benefits on a wide variety of workloads and hardware. For instance, as shown from attached sample performance results from the Grand Unified Python Benchmark, >20% speed up was observed. In addition, we are seeing 2-9% performance boost from OpenStack/Swift where more than 60% of the codes are in Python 2.7. Our analysis indicates the performance gain was mainly due to reduction of icache misses and CPU front-end stalls. Attached is the Makefile patches that modify the all build target and adds a new one called "disable-profile-opt". We built and tested this patch for Python 2.7 and 3.6 on our Linux machines (CentOS 7/Ubuntu Server 14.04, Intel Xeon Haswell/Broadwell with 18/8 cores). We use "regrtest" suite for training as it provides the best performance improvement. Some of the test programs in the suite may fail which leads to build fail. One solution is to disable the specific failed test using the "-x " flag (as shown in the patch) Steps to apply the patch: 1. hg clone https://hg.python.org/cpython cpython 2. cd cpython 3. hg update 2.7 (needed for 2.7 only) 4. Copy *.patch to the current directory 5. patch < python2.7-pgo.patch (or patch < python3.6-pgo.patch) 6. ./configure 7. make To disable PGO 7b. make disable-profile-opt In the following, please find our sample performance results from latest XEON machine, XEON Broadwell EP. Hardware (HW): Intel XEON (Broadwell) 8 Cores BIOS settings: Intel Turbo Boost Technology: false Hyper-Threading: false Operating System: Ubuntu 14.04.3 LTS trusty OS configuration: CPU freq set at fixed: 2.6GHz by echo 2600000 > /sys/devices/system/cpu/cpu*/cpufreq/scaling_min_freq echo 2600000 > /sys/devices/system/cpu/cpu*/cpufreq/scaling_max_freq Address Space Layout Randomization (ASLR) disabled (to reduce run to run variation) by echo 0 > /proc/sys/kernel/randomize_va_space GCC version: gcc version 4.8.4 (Ubuntu 4.8.4-2ubuntu1~14.04) Benchmark: Grand Unified Python Benchmark (GUPB) GUPB Source: https://hg.python.org/benchmarks/ Python2.7 results: Python source: hg clone https://hg.python.org/cpython cpython Python Source: hg update 2.7 hg id: 0511b1165bb6 (2.7) hg id -r 'ancestors(.) and tag()': 15c95b7d81dc (2.7) v2.7.10 hg --debug id -i: 0511b1165bb6cf40ada0768a7efc7ba89316f6a5 Benchmarks Speedup(%) simple_logging 20 raytrace 20 silent_logging 19 richards 19 chaos 16 formatted_logging 16 json_dump 15 hexiom2 13 pidigits 12 slowunpickle 12 django_v2 12 unpack_sequence 11 float 11 mako 11 slowpickle 11 fastpickle 11 django 11 go 10 json_dump_v2 10 pathlib 10 regex_compile 10 pybench 9.9 etree_process 9 regex_v8 8 bzr_startup 8 2to3 8 slowspitfire 8 telco 8 pickle_list 8 fannkuch 8 etree_iterparse 8 nqueens 8 mako_v2 8 etree_generate 8 call_method_slots 7 html5lib_warmup 7 html5lib 7 nbody 7 spectral_norm 7 spambayes 7 fastunpickle 6 meteor_contest 6 chameleon 6 rietveld 6 tornado_http 5 unpickle_list 5 pickle_dict 4 regex_effbot 3 normal_startup 3 startup_nosite 3 etree_parse 2 call_method_unknown 2 call_simple 1 json_load 1 call_method 1 Python3.6 results Python source: hg clone https://hg.python.org/cpython cpython hg id: 96d016f78726 tip hg id -r 'ancestors(.) and tag()': 1a58b1227501 (3.5) v3.5.0rc1 hg --debug id -i: 96d016f78726afbf66d396f084b291ea43792af1 Benchmark Speedup(%) fastunpickle 22.94 fastpickle 21.67 json_load 17.64 simple_logging 17.49 meteor_contest 16.67 formatted_logging 15.33 etree_process 14.61 raytrace 13.57 etree_generate 13.56 chaos 12.09 hexiom2 12 nbody 11.88 json_dump_v2 11.24 richards 11.02 nqueens 10.96 fannkuch 10.79 go 10.77 float 10.26 regex_compile 9.8 silent_logging 9.63 pidigits 9.58 etree_iterparse 9.48 2to3 8.44 regex_v8 8.09 regex_effbot 7.88 call_simple 7.63 tornado_http 7.38 etree_parse 4.92 spectral_norm 4.72 normal_startup 4.39 telco 3.88 startup_nosite 3.7 call_method 3.63 unpack_sequence 3.6 call_method_slots 2.91 call_method_unknown 2.59 iterative_count 0.45 threaded_count -2.79 Thank you, Alecsandru _______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/brett%40python.org

How about we first add a new Makefile target that enables PGO, without turning it on by default? Then later we can enable it by default. Also, I have my doubts about regrtest. How sure are we that it represents a typical Python load? Tests are often using a different mix of operations than production code. On Sat, Aug 22, 2015 at 7:46 AM, Patrascu, Alecsandru < alecsandru.patrascu@intel.com> wrote:
-- --Guido van Rossum (python.org/~guido)

On Sat, Aug 22, 2015, 09:17 Guido van Rossum <guido@python.org> wrote: How about we first add a new Makefile target that enables PGO, without turning it on by default? Then later we can enable it by default. I agree. Updating the Makefile so it's easier to use PGO is great, but we should do a release with it as opt-in and go from there. Also, I have my doubts about regrtest. How sure are we that it represents a typical Python load? Tests are often using a different mix of operations than production code. That was also my question. You said that "it provides the best performance improvement", but compared to what; what else was tried? And what difference does it make to e.g. a Django app that is trained on their own simulated workload compared to using regrtest? IOW is regrtest displaying the best across-the-board performance because it stresses the largest swath of Python and thus catches generic patterns in the code but individuals could get better performance with a simulated workload? -Brett On Sat, Aug 22, 2015 at 7:46 AM, Patrascu, Alecsandru < alecsandru.patrascu@intel.com> wrote: Hi All, This is Alecsandru from Server Scripting Languages Optimization team at Intel Corporation. I would like to submit a request to turn-on Profile Guided Optimization or PGO as the default build option for Python (both 2.7 and 3.6), given its performance benefits on a wide variety of workloads and hardware. For instance, as shown from attached sample performance results from the Grand Unified Python Benchmark, >20% speed up was observed. In addition, we are seeing 2-9% performance boost from OpenStack/Swift where more than 60% of the codes are in Python 2.7. Our analysis indicates the performance gain was mainly due to reduction of icache misses and CPU front-end stalls. Attached is the Makefile patches that modify the all build target and adds a new one called "disable-profile-opt". We built and tested this patch for Python 2.7 and 3.6 on our Linux machines (CentOS 7/Ubuntu Server 14.04, Intel Xeon Haswell/Broadwell with 18/8 cores). We use "regrtest" suite for training as it provides the best performance improvement. Some of the test programs in the suite may fail which leads to build fail. One solution is to disable the specific failed test using the "-x " flag (as shown in the patch) Steps to apply the patch: 1. hg clone https://hg.python.org/cpython cpython 2. cd cpython 3. hg update 2.7 (needed for 2.7 only) 4. Copy *.patch to the current directory 5. patch < python2.7-pgo.patch (or patch < python3.6-pgo.patch) 6. ./configure 7. make To disable PGO 7b. make disable-profile-opt In the following, please find our sample performance results from latest XEON machine, XEON Broadwell EP. Hardware (HW): Intel XEON (Broadwell) 8 Cores BIOS settings: Intel Turbo Boost Technology: false Hyper-Threading: false Operating System: Ubuntu 14.04.3 LTS trusty OS configuration: CPU freq set at fixed: 2.6GHz by echo 2600000 > /sys/devices/system/cpu/cpu*/cpufreq/scaling_min_freq echo 2600000 > /sys/devices/system/cpu/cpu*/cpufreq/scaling_max_freq Address Space Layout Randomization (ASLR) disabled (to reduce run to run variation) by echo 0 > /proc/sys/kernel/randomize_va_space GCC version: gcc version 4.8.4 (Ubuntu 4.8.4-2ubuntu1~14.04) Benchmark: Grand Unified Python Benchmark (GUPB) GUPB Source: https://hg.python.org/benchmarks/ Python2.7 results: Python source: hg clone https://hg.python.org/cpython cpython Python Source: hg update 2.7 hg id: 0511b1165bb6 (2.7) hg id -r 'ancestors(.) and tag()': 15c95b7d81dc (2.7) v2.7.10 hg --debug id -i: 0511b1165bb6cf40ada0768a7efc7ba89316f6a5 Benchmarks Speedup(%) simple_logging 20 raytrace 20 silent_logging 19 richards 19 chaos 16 formatted_logging 16 json_dump 15 hexiom2 13 pidigits 12 slowunpickle 12 django_v2 12 unpack_sequence 11 float 11 mako 11 slowpickle 11 fastpickle 11 django 11 go 10 json_dump_v2 10 pathlib 10 regex_compile 10 pybench 9.9 etree_process 9 regex_v8 8 bzr_startup 8 2to3 8 slowspitfire 8 telco 8 pickle_list 8 fannkuch 8 etree_iterparse 8 nqueens 8 mako_v2 8 etree_generate 8 call_method_slots 7 html5lib_warmup 7 html5lib 7 nbody 7 spectral_norm 7 spambayes 7 fastunpickle 6 meteor_contest 6 chameleon 6 rietveld 6 tornado_http 5 unpickle_list 5 pickle_dict 4 regex_effbot 3 normal_startup 3 startup_nosite 3 etree_parse 2 call_method_unknown 2 call_simple 1 json_load 1 call_method 1 Python3.6 results Python source: hg clone https://hg.python.org/cpython cpython hg id: 96d016f78726 tip hg id -r 'ancestors(.) and tag()': 1a58b1227501 (3.5) v3.5.0rc1 hg --debug id -i: 96d016f78726afbf66d396f084b291ea43792af1 Benchmark Speedup(%) fastunpickle 22.94 fastpickle 21.67 json_load 17.64 simple_logging 17.49 meteor_contest 16.67 formatted_logging 15.33 etree_process 14.61 raytrace 13.57 etree_generate 13.56 chaos 12.09 hexiom2 12 nbody 11.88 json_dump_v2 11.24 richards 11.02 nqueens 10.96 fannkuch 10.79 go 10.77 float 10.26 regex_compile 9.8 silent_logging 9.63 pidigits 9.58 etree_iterparse 9.48 2to3 8.44 regex_v8 8.09 regex_effbot 7.88 call_simple 7.63 tornado_http 7.38 etree_parse 4.92 spectral_norm 4.72 normal_startup 4.39 telco 3.88 startup_nosite 3.7 call_method 3.63 unpack_sequence 3.6 call_method_slots 2.91 call_method_unknown 2.59 iterative_count 0.45 threaded_count -2.79 Thank you, Alecsandru _______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/guido%40python.org -- --Guido van Rossum (python.org/~guido) _______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/brett%40python.org

This target replaces the existing one in the CPython Makefile, which now uses a quick run of pybench and the obtained binary does not perform well on general Python loads. I don't think is a good idea to add a by-default target that does PGO on dedicated workloads, like Django, because then it will perform better on that particular load and poorly on other. Of course, if any user has a dedicated workload for which he or she want to get the best benefit over PGO, it will have to run that training separately from the proposed one. Our proposal targets the broader audience that uses Python in various scenarios, and they will see an overall improvement after compiling Python from sources. Alecsandru From: Brett Cannon [mailto:brett@python.org] Sent: Saturday, August 22, 2015 7:25 PM To: guido@python.org; Patrascu, Alecsandru Cc: python-dev@python.org Subject: Re: [Python-Dev] Profile Guided Optimization active by-default On Sat, Aug 22, 2015, 09:17 Guido van Rossum <guido@python.org> wrote: How about we first add a new Makefile target that enables PGO, without turning it on by default? Then later we can enable it by default. I agree. Updating the Makefile so it's easier to use PGO is great, but we should do a release with it as opt-in and go from there. Also, I have my doubts about regrtest. How sure are we that it represents a typical Python load? Tests are often using a different mix of operations than production code. That was also my question. You said that "it provides the best performance improvement", but compared to what; what else was tried? And what difference does it make to e.g. a Django app that is trained on their own simulated workload compared to using regrtest? IOW is regrtest displaying the best across-the-board performance because it stresses the largest swath of Python and thus catches generic patterns in the code but individuals could get better performance with a simulated workload? -Brett On Sat, Aug 22, 2015 at 7:46 AM, Patrascu, Alecsandru <alecsandru.patrascu@intel.com> wrote: Hi All, This is Alecsandru from Server Scripting Languages Optimization team at Intel Corporation. I would like to submit a request to turn-on Profile Guided Optimization or PGO as the default build option for Python (both 2.7 and 3.6), given its performance benefits on a wide variety of workloads and hardware. For instance, as shown from attached sample performance results from the Grand Unified Python Benchmark, >20% speed up was observed. In addition, we are seeing 2-9% performance boost from OpenStack/Swift where more than 60% of the codes are in Python 2.7. Our analysis indicates the performance gain was mainly due to reduction of icache misses and CPU front-end stalls. Attached is the Makefile patches that modify the all build target and adds a new one called "disable-profile-opt". We built and tested this patch for Python 2.7 and 3.6 on our Linux machines (CentOS 7/Ubuntu Server 14.04, Intel Xeon Haswell/Broadwell with 18/8 cores). We use "regrtest" suite for training as it provides the best performance improvement. Some of the test programs in the suite may fail which leads to build fail. One solution is to disable the specific failed test using the "-x " flag (as shown in the patch) Steps to apply the patch: 1. hg clone https://hg.python.org/cpython cpython 2. cd cpython 3. hg update 2.7 (needed for 2.7 only) 4. Copy *.patch to the current directory 5. patch < python2.7-pgo.patch (or patch < python3.6-pgo.patch) 6. ./configure 7. make To disable PGO 7b. make disable-profile-opt In the following, please find our sample performance results from latest XEON machine, XEON Broadwell EP. Hardware (HW): Intel XEON (Broadwell) 8 Cores BIOS settings: Intel Turbo Boost Technology: false Hyper-Threading: false Operating System: Ubuntu 14.04.3 LTS trusty OS configuration: CPU freq set at fixed: 2.6GHz by echo 2600000 > /sys/devices/system/cpu/cpu*/cpufreq/scaling_min_freq echo 2600000 > /sys/devices/system/cpu/cpu*/cpufreq/scaling_max_freq Address Space Layout Randomization (ASLR) disabled (to reduce run to run variation) by echo 0 > /proc/sys/kernel/randomize_va_space GCC version: gcc version 4.8.4 (Ubuntu 4.8.4-2ubuntu1~14.04) Benchmark: Grand Unified Python Benchmark (GUPB) GUPB Source: https://hg.python.org/benchmarks/ Python2.7 results: Python source: hg clone https://hg.python.org/cpython cpython Python Source: hg update 2.7 hg id: 0511b1165bb6 (2.7) hg id -r 'ancestors(.) and tag()': 15c95b7d81dc (2.7) v2.7.10 hg --debug id -i: 0511b1165bb6cf40ada0768a7efc7ba89316f6a5 Benchmarks Speedup(%) simple_logging 20 raytrace 20 silent_logging 19 richards 19 chaos 16 formatted_logging 16 json_dump 15 hexiom2 13 pidigits 12 slowunpickle 12 django_v2 12 unpack_sequence 11 float 11 mako 11 slowpickle 11 fastpickle 11 django 11 go 10 json_dump_v2 10 pathlib 10 regex_compile 10 pybench 9.9 etree_process 9 regex_v8 8 bzr_startup 8 2to3 8 slowspitfire 8 telco 8 pickle_list 8 fannkuch 8 etree_iterparse 8 nqueens 8 mako_v2 8 etree_generate 8 call_method_slots 7 html5lib_warmup 7 html5lib 7 nbody 7 spectral_norm 7 spambayes 7 fastunpickle 6 meteor_contest 6 chameleon 6 rietveld 6 tornado_http 5 unpickle_list 5 pickle_dict 4 regex_effbot 3 normal_startup 3 startup_nosite 3 etree_parse 2 call_method_unknown 2 call_simple 1 json_load 1 call_method 1 Python3.6 results Python source: hg clone https://hg.python.org/cpython cpython hg id: 96d016f78726 tip hg id -r 'ancestors(.) and tag()': 1a58b1227501 (3.5) v3.5.0rc1 hg --debug id -i: 96d016f78726afbf66d396f084b291ea43792af1 Benchmark Speedup(%) fastunpickle 22.94 fastpickle 21.67 json_load 17.64 simple_logging 17.49 meteor_contest 16.67 formatted_logging 15.33 etree_process 14.61 raytrace 13.57 etree_generate 13.56 chaos 12.09 hexiom2 12 nbody 11.88 json_dump_v2 11.24 richards 11.02 nqueens 10.96 fannkuch 10.79 go 10.77 float 10.26 regex_compile 9.8 silent_logging 9.63 pidigits 9.58 etree_iterparse 9.48 2to3 8.44 regex_v8 8.09 regex_effbot 7.88 call_simple 7.63 tornado_http 7.38 etree_parse 4.92 spectral_norm 4.72 normal_startup 4.39 telco 3.88 startup_nosite 3.7 call_method 3.63 unpack_sequence 3.6 call_method_slots 2.91 call_method_unknown 2.59 iterative_count 0.45 threaded_count -2.79 Thank you, Alecsandru _______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/guido%40python.org -- --Guido van Rossum (python.org/~guido) _______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/brett%40python.org

On Sat, Aug 22, 2015, 09:58 Patrascu, Alecsandru < alecsandru.patrascu@intel.com> wrote: This target replaces the existing one in the CPython Makefile, which now uses a quick run of pybench and the obtained binary does not perform well on general Python loads. I don't think is a good idea to add a by-default target that does PGO on dedicated workloads, like Django, because then it will perform better on that particular load and poorly on other. Sorry for not being clearer, but I was not suggesting that the default be for Django, just whether making the Makefile easier to work with when generating a PGO build for a custom workload. If we already have a rule that uses pybench then it should definitely be changed to use regrtest (and honestly pybench should not be used for benchmarking anything since it doesn't reflect real world usage in any way; its just for quick checks while doing development on the core of Python and otherwise shouldn't be used to measure anything substantial). Of course, if any user has a dedicated workload for which he or she want to get the best benefit over PGO, it will have to run that training separately from the proposed one. Our proposal targets the broader audience that uses Python in various scenarios, and they will see an overall improvement after compiling Python from sources. Right, but my question was whether there was any benefit to making the Makefile rules generic to make building PGO binaries easier for people who do want to do a custom profile and it sounds like it isn't worth the effort. So I'm with Guido where I'm happy to see the build rules added/updated to use regrtest for a PGO build but have it be an opt-in flag and not on by default (at least for now). -Brett Alecsandru From: Brett Cannon [mailto:brett@python.org] Sent: Saturday, August 22, 2015 7:25 PM To: guido@python.org; Patrascu, Alecsandru Cc: python-dev@python.org Subject: Re: [Python-Dev] Profile Guided Optimization active by-default On Sat, Aug 22, 2015, 09:17 Guido van Rossum <guido@python.org> wrote: How about we first add a new Makefile target that enables PGO, without turning it on by default? Then later we can enable it by default. I agree. Updating the Makefile so it's easier to use PGO is great, but we should do a release with it as opt-in and go from there. Also, I have my doubts about regrtest. How sure are we that it represents a typical Python load? Tests are often using a different mix of operations than production code. That was also my question. You said that "it provides the best performance improvement", but compared to what; what else was tried? And what difference does it make to e.g. a Django app that is trained on their own simulated workload compared to using regrtest? IOW is regrtest displaying the best across-the-board performance because it stresses the largest swath of Python and thus catches generic patterns in the code but individuals could get better performance with a simulated workload? -Brett On Sat, Aug 22, 2015 at 7:46 AM, Patrascu, Alecsandru < alecsandru.patrascu@intel.com> wrote: Hi All, This is Alecsandru from Server Scripting Languages Optimization team at Intel Corporation. I would like to submit a request to turn-on Profile Guided Optimization or PGO as the default build option for Python (both 2.7 and 3.6), given its performance benefits on a wide variety of workloads and hardware. For instance, as shown from attached sample performance results from the Grand Unified Python Benchmark, >20% speed up was observed. In addition, we are seeing 2-9% performance boost from OpenStack/Swift where more than 60% of the codes are in Python 2.7. Our analysis indicates the performance gain was mainly due to reduction of icache misses and CPU front-end stalls. Attached is the Makefile patches that modify the all build target and adds a new one called "disable-profile-opt". We built and tested this patch for Python 2.7 and 3.6 on our Linux machines (CentOS 7/Ubuntu Server 14.04, Intel Xeon Haswell/Broadwell with 18/8 cores). We use "regrtest" suite for training as it provides the best performance improvement. Some of the test programs in the suite may fail which leads to build fail. One solution is to disable the specific failed test using the "-x " flag (as shown in the patch) Steps to apply the patch: 1. hg clone https://hg.python.org/cpython cpython 2. cd cpython 3. hg update 2.7 (needed for 2.7 only) 4. Copy *.patch to the current directory 5. patch < python2.7-pgo.patch (or patch < python3.6-pgo.patch) 6. ./configure 7. make To disable PGO 7b. make disable-profile-opt In the following, please find our sample performance results from latest XEON machine, XEON Broadwell EP. Hardware (HW): Intel XEON (Broadwell) 8 Cores BIOS settings: Intel Turbo Boost Technology: false Hyper-Threading: false Operating System: Ubuntu 14.04.3 LTS trusty OS configuration: CPU freq set at fixed: 2.6GHz by echo 2600000 > /sys/devices/system/cpu/cpu*/cpufreq/scaling_min_freq echo 2600000 > /sys/devices/system/cpu/cpu*/cpufreq/scaling_max_freq Address Space Layout Randomization (ASLR) disabled (to reduce run to run variation) by echo 0 > /proc/sys/kernel/randomize_va_space GCC version: gcc version 4.8.4 (Ubuntu 4.8.4-2ubuntu1~14.04) Benchmark: Grand Unified Python Benchmark (GUPB) GUPB Source: https://hg.python.org/benchmarks/ Python2.7 results: Python source: hg clone https://hg.python.org/cpython cpython Python Source: hg update 2.7 hg id: 0511b1165bb6 (2.7) hg id -r 'ancestors(.) and tag()': 15c95b7d81dc (2.7) v2.7.10 hg --debug id -i: 0511b1165bb6cf40ada0768a7efc7ba89316f6a5 Benchmarks Speedup(%) simple_logging 20 raytrace 20 silent_logging 19 richards 19 chaos 16 formatted_logging 16 json_dump 15 hexiom2 13 pidigits 12 slowunpickle 12 django_v2 12 unpack_sequence 11 float 11 mako 11 slowpickle 11 fastpickle 11 django 11 go 10 json_dump_v2 10 pathlib 10 regex_compile 10 pybench 9.9 etree_process 9 regex_v8 8 bzr_startup 8 2to3 8 slowspitfire 8 telco 8 pickle_list 8 fannkuch 8 etree_iterparse 8 nqueens 8 mako_v2 8 etree_generate 8 call_method_slots 7 html5lib_warmup 7 html5lib 7 nbody 7 spectral_norm 7 spambayes 7 fastunpickle 6 meteor_contest 6 chameleon 6 rietveld 6 tornado_http 5 unpickle_list 5 pickle_dict 4 regex_effbot 3 normal_startup 3 startup_nosite 3 etree_parse 2 call_method_unknown 2 call_simple 1 json_load 1 call_method 1 Python3.6 results Python source: hg clone https://hg.python.org/cpython cpython hg id: 96d016f78726 tip hg id -r 'ancestors(.) and tag()': 1a58b1227501 (3.5) v3.5.0rc1 hg --debug id -i: 96d016f78726afbf66d396f084b291ea43792af1 Benchmark Speedup(%) fastunpickle 22.94 fastpickle 21.67 json_load 17.64 simple_logging 17.49 meteor_contest 16.67 formatted_logging 15.33 etree_process 14.61 raytrace 13.57 etree_generate 13.56 chaos 12.09 hexiom2 12 nbody 11.88 json_dump_v2 11.24 richards 11.02 nqueens 10.96 fannkuch 10.79 go 10.77 float 10.26 regex_compile 9.8 silent_logging 9.63 pidigits 9.58 etree_iterparse 9.48 2to3 8.44 regex_v8 8.09 regex_effbot 7.88 call_simple 7.63 tornado_http 7.38 etree_parse 4.92 spectral_norm 4.72 normal_startup 4.39 telco 3.88 startup_nosite 3.7 call_method 3.63 unpack_sequence 3.6 call_method_slots 2.91 call_method_unknown 2.59 iterative_count 0.45 threaded_count -2.79 Thank you, Alecsandru _______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/guido%40python.org -- --Guido van Rossum (python.org/~guido) _______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/brett%40python.org

On Sat, Aug 22, 2015 at 9:27 AM Brett Cannon <brett@python.org> wrote:
There already is one and has been for many years. make profile-opt. I even setup a buildbot for it last year. The problem with the existing profile-opt build in our default Makefile.in is that is uses a horrible profiling workload (pybench, ugh) so it leaves a lot of improvements behind. What all Linux distros (Debian/Ubuntu and Redhat at least; nothing else matters) do for their Python builds is to use profile-opt but they replace the profiling workload with a stable set of the Python unittest suite itself. Results are much better all around. Generally a 20% speedup. Anyone deploying Python who is *not* using a profile-opt build is wasting CPU resources. Whether it should be *the default* or not *is a different question*. The Makefile is optimized for CPython developers who certainly do not want to run two separate builds and a profile-opt workload every time they type make to test out their changes. But all binary release builds should use it. I agree. Updating the Makefile so it's easier to use PGO is great, but we
This isn't something to argue about. Just use regrtest and compare the before and after with the benchmark suite. It really does exercise things well. People like to fear that it'll produce code optimized for the test suite itself or something. No. Python as an interpreter is very realistically exercised by running it as it is simply running a lot of code and a good variety of code including the extension modules that benefit most such as regexes, pickle, json, xml, etc. Thomas tried the test suite and a variety of other workloads when looking at what to use at work. The testsuite works out generally the best. Going beyond that seems to be a wash. What we tested and decided to use on our own builds after benchmarking at work was to build with: make profile-opt PROFILE_TASK="-m test.regrtest -w -uall,-audio -x test_gdb test_multiprocessing" In general if a test is unreliable or takes an extremely long time, exclude it for your sanity. (i'd also kick out test_subprocess on 2.7; we replaced subprocess with subprocess32 in our build so that wasn't an issue) -gps

On 25 August 2015 at 05:52, Gregory P. Smith <greg@krypto.org> wrote:
Having the "production ready" make target be "make profile-opt" doesn't strike me as the most intuitive thing in the world. I agree we want the "./configure && make" sequence to be oriented towards local development builds rather than highly optimised production ones, so perhaps we could provide a "make production" target that enables PGO with an appropriate training set from regrtest, and also complains if "--with-pydebug" is configured? Regards, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On Tue, Aug 25, 2015 at 11:17 AM, Brett Cannon <brett@python.org> wrote:
You need to be careful there. In my environment, I interface with a lot of Boost.Python-wrapped code which would be quite impractical to compile with --with-pydebug. I'd like to be able to throw in all the other development bells and whistles though, without changing the size of the object header. Maybe "develop-lite"? <whatever happened to wink?>-ly, y'rs, Skip

Pardon me if I'm not in the right place to ask the following naive question. (say me if it's the case) Does Profile Guided Optimization performance improvements are specific to the chip where the built is done or the performance is better on a larger set of chips?

PGO is unrelated to the particular CPU the profiling is done on. (It is conceivable that it'd make a small difference but I've never observed that in practice) On Tue, Aug 25, 2015, 9:28 AM Xavier Combelle <xavier.combelle@gmail.com> wrote: Pardon me if I'm not in the right place to ask the following naive question. (say me if it's the case) Does Profile Guided Optimization performance improvements are specific to the chip where the built is done or the performance is better on a larger set of chips?

Indeed, as Gregory well mentioned, PGO is unrelated to a particular CPU on which we do profiling. From: Python-Dev [mailto:python-dev-bounces+alecsandru.patrascu=intel.com@python.org] On Behalf Of Gregory P. Smith Sent: Tuesday, August 25, 2015 7:44 PM To: Xavier Combelle; python-dev@python.org Subject: Re: [Python-Dev] Profile Guided Optimization active by-default PGO is unrelated to the particular CPU the profiling is done on. (It is conceivable that it'd make a small difference but I've never observed that in practice) On Tue, Aug 25, 2015, 9:28 AM Xavier Combelle <xavier.combelle@gmail.com> wrote: Pardon me if I'm not in the right place to ask the following naive question. (say me if it's the case) Does Profile Guided Optimization performance improvements are specific to the chip where the built is done or the performance is better on a larger set of chips?

On Mon, Aug 24, 2015, 11:19 PM Nick Coghlan <ncoghlan@gmail.com> wrote: On 25 August 2015 at 05:52, Gregory P. Smith <greg@krypto.org> wrote:
Having the "production ready" make target be "make profile-opt" doesn't strike me as the most intuitive thing in the world. I agree we want the "./configure && make" sequence to be oriented towards local development builds rather than highly optimised production ones, so perhaps we could provide a "make production" target that enables PGO with an appropriate training set from regrtest, and also complains if "--with-pydebug" is configured? Regards, Nick. -- Nick Coghlan | ncoghlan@ <ncoghlan@gmail.com>gmail.com <ncoghlan@gmail.com> | Brisbane, Australia Agreed. Also, printing a message out at the end of a default make all build suggesting people use make production for additional performance instead might help advertise it. make install could possibly depend on make production as well?

The current pgo target just uses a very specific task to train for the feedback. For my Debian/Ubuntu builds I'm using the testsuite minus some problematic tests to train. Otoh I don't know if this is the best way to do it, however it gave better results at some time in the past. What I would like is a benchmark / a mixture of benchmarks on which to enable pgo/pdo. Based on that you could enable pgo based on some static decisions based on autofdo. For that you don't need any profile runs during your build; it just needs shipping the autofdo outcome together with a Python release. This doesn't give you the same performance as for for a GCC pgo build, but it would be a first step. And defining the probe for any pgo build would be welcome too. Matthias On 08/22/2015 06:25 PM, Brett Cannon wrote:

Hello and thank you for your feedback. We have measured PGO gain using other workloads also. Our initial choice for this optimization was pybench, but the speedup obtained was lower than using regrtest and it didn't cover a lot of Python scenarios. Instead, regrtest has an uniform distribution for the tests and the resulting binary is overall much faster than the default, or trained using other workloads, and thus covering a larger pool of Python loads. This optimization was also tested on a production environments running OpenStack Swift and got up to 9% improvements. The reason we proposed this target to be always on is that the obtained optimized binary is better out of the box for the general cases. Alecsandru From: gvanrossum@gmail.com [mailto:gvanrossum@gmail.com] On Behalf Of Guido van Rossum Sent: Saturday, August 22, 2015 7:15 PM To: Patrascu, Alecsandru Cc: python-dev@python.org Subject: Re: [Python-Dev] Profile Guided Optimization active by-default How about we first add a new Makefile target that enables PGO, without turning it on by default? Then later we can enable it by default. Also, I have my doubts about regrtest. How sure are we that it represents a typical Python load? Tests are often using a different mix of operations than production code. On Sat, Aug 22, 2015 at 7:46 AM, Patrascu, Alecsandru <alecsandru.patrascu@intel.com> wrote: Hi All, This is Alecsandru from Server Scripting Languages Optimization team at Intel Corporation. I would like to submit a request to turn-on Profile Guided Optimization or PGO as the default build option for Python (both 2.7 and 3.6), given its performance benefits on a wide variety of workloads and hardware. For instance, as shown from attached sample performance results from the Grand Unified Python Benchmark, >20% speed up was observed. In addition, we are seeing 2-9% performance boost from OpenStack/Swift where more than 60% of the codes are in Python 2.7. Our analysis indicates the performance gain was mainly due to reduction of icache misses and CPU front-end stalls. Attached is the Makefile patches that modify the all build target and adds a new one called "disable-profile-opt". We built and tested this patch for Python 2.7 and 3.6 on our Linux machines (CentOS 7/Ubuntu Server 14.04, Intel Xeon Haswell/Broadwell with 18/8 cores). We use "regrtest" suite for training as it provides the best performance improvement. Some of the test programs in the suite may fail which leads to build fail. One solution is to disable the specific failed test using the "-x " flag (as shown in the patch) Steps to apply the patch: 1. hg clone https://hg.python.org/cpython cpython 2. cd cpython 3. hg update 2.7 (needed for 2.7 only) 4. Copy *.patch to the current directory 5. patch < python2.7-pgo.patch (or patch < python3.6-pgo.patch) 6. ./configure 7. make To disable PGO 7b. make disable-profile-opt In the following, please find our sample performance results from latest XEON machine, XEON Broadwell EP. Hardware (HW): Intel XEON (Broadwell) 8 Cores BIOS settings: Intel Turbo Boost Technology: false Hyper-Threading: false Operating System: Ubuntu 14.04.3 LTS trusty OS configuration: CPU freq set at fixed: 2.6GHz by echo 2600000 > /sys/devices/system/cpu/cpu*/cpufreq/scaling_min_freq echo 2600000 > /sys/devices/system/cpu/cpu*/cpufreq/scaling_max_freq Address Space Layout Randomization (ASLR) disabled (to reduce run to run variation) by echo 0 > /proc/sys/kernel/randomize_va_space GCC version: gcc version 4.8.4 (Ubuntu 4.8.4-2ubuntu1~14.04) Benchmark: Grand Unified Python Benchmark (GUPB) GUPB Source: https://hg.python.org/benchmarks/ Python2.7 results: Python source: hg clone https://hg.python.org/cpython cpython Python Source: hg update 2.7 hg id: 0511b1165bb6 (2.7) hg id -r 'ancestors(.) and tag()': 15c95b7d81dc (2.7) v2.7.10 hg --debug id -i: 0511b1165bb6cf40ada0768a7efc7ba89316f6a5 Benchmarks Speedup(%) simple_logging 20 raytrace 20 silent_logging 19 richards 19 chaos 16 formatted_logging 16 json_dump 15 hexiom2 13 pidigits 12 slowunpickle 12 django_v2 12 unpack_sequence 11 float 11 mako 11 slowpickle 11 fastpickle 11 django 11 go 10 json_dump_v2 10 pathlib 10 regex_compile 10 pybench 9.9 etree_process 9 regex_v8 8 bzr_startup 8 2to3 8 slowspitfire 8 telco 8 pickle_list 8 fannkuch 8 etree_iterparse 8 nqueens 8 mako_v2 8 etree_generate 8 call_method_slots 7 html5lib_warmup 7 html5lib 7 nbody 7 spectral_norm 7 spambayes 7 fastunpickle 6 meteor_contest 6 chameleon 6 rietveld 6 tornado_http 5 unpickle_list 5 pickle_dict 4 regex_effbot 3 normal_startup 3 startup_nosite 3 etree_parse 2 call_method_unknown 2 call_simple 1 json_load 1 call_method 1 Python3.6 results Python source: hg clone https://hg.python.org/cpython cpython hg id: 96d016f78726 tip hg id -r 'ancestors(.) and tag()': 1a58b1227501 (3.5) v3.5.0rc1 hg --debug id -i: 96d016f78726afbf66d396f084b291ea43792af1 Benchmark Speedup(%) fastunpickle 22.94 fastpickle 21.67 json_load 17.64 simple_logging 17.49 meteor_contest 16.67 formatted_logging 15.33 etree_process 14.61 raytrace 13.57 etree_generate 13.56 chaos 12.09 hexiom2 12 nbody 11.88 json_dump_v2 11.24 richards 11.02 nqueens 10.96 fannkuch 10.79 go 10.77 float 10.26 regex_compile 9.8 silent_logging 9.63 pidigits 9.58 etree_iterparse 9.48 2to3 8.44 regex_v8 8.09 regex_effbot 7.88 call_simple 7.63 tornado_http 7.38 etree_parse 4.92 spectral_norm 4.72 normal_startup 4.39 telco 3.88 startup_nosite 3.7 call_method 3.63 unpack_sequence 3.6 call_method_slots 2.91 call_method_unknown 2.59 iterative_count 0.45 threaded_count -2.79 Thank you, Alecsandru _______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/guido%40python.org -- --Guido van Rossum (python.org/~guido)

I'm sorry, but we're just not going to turn this on by default without doing a trial period ourselves. Your (and Intel's) contribution is very welcome, but in order to establish trust in a feature like this, an optional trial period is absolutely required. Regarding the training set, I agree that regrtest sounds to be better than pybench. If we make this an opt-in change, we can experiment with different training sets easily. (Also, I haven't seen the patch yet, but I presume it's easy to use a different training set? Experimentation should be encouraged.) On Sat, Aug 22, 2015 at 9:40 AM, Patrascu, Alecsandru < alecsandru.patrascu@intel.com> wrote:
-- --Guido van Rossum (python.org/~guido)

A trial period on numerous other Python loads in which the provided patches are tested is welcomed, to be sure that it works as presented. Yes, it is easy to change it to use a different training set, or subsets of the regrtest by adding additional parameters to the line inside the Makefile that runs it. Now, the attached patches run the full regrtest suite. Alecsandru From: gvanrossum@gmail.com [mailto:gvanrossum@gmail.com] On Behalf Of Guido van Rossum Sent: Saturday, August 22, 2015 7:56 PM To: Patrascu, Alecsandru Cc: python-dev@python.org Subject: Re: [Python-Dev] Profile Guided Optimization active by-default I'm sorry, but we're just not going to turn this on by default without doing a trial period ourselves. Your (and Intel's) contribution is very welcome, but in order to establish trust in a feature like this, an optional trial period is absolutely required. Regarding the training set, I agree that regrtest sounds to be better than pybench. If we make this an opt-in change, we can experiment with different training sets easily. (Also, I haven't seen the patch yet, but I presume it's easy to use a different training set? Experimentation should be encouraged.) On Sat, Aug 22, 2015 at 9:40 AM, Patrascu, Alecsandru <alecsandru.patrascu@intel.com> wrote: Hello and thank you for your feedback. We have measured PGO gain using other workloads also. Our initial choice for this optimization was pybench, but the speedup obtained was lower than using regrtest and it didn't cover a lot of Python scenarios. Instead, regrtest has an uniform distribution for the tests and the resulting binary is overall much faster than the default, or trained using other workloads, and thus covering a larger pool of Python loads. This optimization was also tested on a production environments running OpenStack Swift and got up to 9% improvements. The reason we proposed this target to be always on is that the obtained optimized binary is better out of the box for the general cases. Alecsandru From: gvanrossum@gmail.com [mailto:gvanrossum@gmail.com] On Behalf Of Guido van Rossum Sent: Saturday, August 22, 2015 7:15 PM To: Patrascu, Alecsandru Cc: python-dev@python.org Subject: Re: [Python-Dev] Profile Guided Optimization active by-default How about we first add a new Makefile target that enables PGO, without turning it on by default? Then later we can enable it by default. Also, I have my doubts about regrtest. How sure are we that it represents a typical Python load? Tests are often using a different mix of operations than production code. On Sat, Aug 22, 2015 at 7:46 AM, Patrascu, Alecsandru <alecsandru.patrascu@intel.com> wrote: Hi All, This is Alecsandru from Server Scripting Languages Optimization team at Intel Corporation. I would like to submit a request to turn-on Profile Guided Optimization or PGO as the default build option for Python (both 2.7 and 3.6), given its performance benefits on a wide variety of workloads and hardware. For instance, as shown from attached sample performance results from the Grand Unified Python Benchmark, >20% speed up was observed. In addition, we are seeing 2-9% performance boost from OpenStack/Swift where more than 60% of the codes are in Python 2.7. Our analysis indicates the performance gain was mainly due to reduction of icache misses and CPU front-end stalls. Attached is the Makefile patches that modify the all build target and adds a new one called "disable-profile-opt". We built and tested this patch for Python 2.7 and 3.6 on our Linux machines (CentOS 7/Ubuntu Server 14.04, Intel Xeon Haswell/Broadwell with 18/8 cores). We use "regrtest" suite for training as it provides the best performance improvement. Some of the test programs in the suite may fail which leads to build fail. One solution is to disable the specific failed test using the "-x " flag (as shown in the patch) Steps to apply the patch: 1. hg clone https://hg.python.org/cpython cpython 2. cd cpython 3. hg update 2.7 (needed for 2.7 only) 4. Copy *.patch to the current directory 5. patch < python2.7-pgo.patch (or patch < python3.6-pgo.patch) 6. ./configure 7. make To disable PGO 7b. make disable-profile-opt In the following, please find our sample performance results from latest XEON machine, XEON Broadwell EP. Hardware (HW): Intel XEON (Broadwell) 8 Cores BIOS settings: Intel Turbo Boost Technology: false Hyper-Threading: false Operating System: Ubuntu 14.04.3 LTS trusty OS configuration: CPU freq set at fixed: 2.6GHz by echo 2600000 > /sys/devices/system/cpu/cpu*/cpufreq/scaling_min_freq echo 2600000 > /sys/devices/system/cpu/cpu*/cpufreq/scaling_max_freq Address Space Layout Randomization (ASLR) disabled (to reduce run to run variation) by echo 0 > /proc/sys/kernel/randomize_va_space GCC version: gcc version 4.8.4 (Ubuntu 4.8.4-2ubuntu1~14.04) Benchmark: Grand Unified Python Benchmark (GUPB) GUPB Source: https://hg.python.org/benchmarks/ Python2.7 results: Python source: hg clone https://hg.python.org/cpython cpython Python Source: hg update 2.7 hg id: 0511b1165bb6 (2.7) hg id -r 'ancestors(.) and tag()': 15c95b7d81dc (2.7) v2.7.10 hg --debug id -i: 0511b1165bb6cf40ada0768a7efc7ba89316f6a5 Benchmarks Speedup(%) simple_logging 20 raytrace 20 silent_logging 19 richards 19 chaos 16 formatted_logging 16 json_dump 15 hexiom2 13 pidigits 12 slowunpickle 12 django_v2 12 unpack_sequence 11 float 11 mako 11 slowpickle 11 fastpickle 11 django 11 go 10 json_dump_v2 10 pathlib 10 regex_compile 10 pybench 9.9 etree_process 9 regex_v8 8 bzr_startup 8 2to3 8 slowspitfire 8 telco 8 pickle_list 8 fannkuch 8 etree_iterparse 8 nqueens 8 mako_v2 8 etree_generate 8 call_method_slots 7 html5lib_warmup 7 html5lib 7 nbody 7 spectral_norm 7 spambayes 7 fastunpickle 6 meteor_contest 6 chameleon 6 rietveld 6 tornado_http 5 unpickle_list 5 pickle_dict 4 regex_effbot 3 normal_startup 3 startup_nosite 3 etree_parse 2 call_method_unknown 2 call_simple 1 json_load 1 call_method 1 Python3.6 results Python source: hg clone https://hg.python.org/cpython cpython hg id: 96d016f78726 tip hg id -r 'ancestors(.) and tag()': 1a58b1227501 (3.5) v3.5.0rc1 hg --debug id -i: 96d016f78726afbf66d396f084b291ea43792af1 Benchmark Speedup(%) fastunpickle 22.94 fastpickle 21.67 json_load 17.64 simple_logging 17.49 meteor_contest 16.67 formatted_logging 15.33 etree_process 14.61 raytrace 13.57 etree_generate 13.56 chaos 12.09 hexiom2 12 nbody 11.88 json_dump_v2 11.24 richards 11.02 nqueens 10.96 fannkuch 10.79 go 10.77 float 10.26 regex_compile 9.8 silent_logging 9.63 pidigits 9.58 etree_iterparse 9.48 2to3 8.44 regex_v8 8.09 regex_effbot 7.88 call_simple 7.63 tornado_http 7.38 etree_parse 4.92 spectral_norm 4.72 normal_startup 4.39 telco 3.88 startup_nosite 3.7 call_method 3.63 unpack_sequence 3.6 call_method_slots 2.91 call_method_unknown 2.59 iterative_count 0.45 threaded_count -2.79 Thank you, Alecsandru _______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/guido%40python.org -- --Guido van Rossum (python.org/~guido) -- --Guido van Rossum (python.org/~guido)

Guido van Rossum schrieb am 22.08.2015 um 18:55:
It's just one command in one line, yes.
Experimentation should be encouraged.)
A well chosen training set can have a notable impact on PGO compiled code in general, and switching from pybench to regrtests should make such a difference. However, since CPython's overall performance is mostly determined by the interpreter loop, general object operations (getattr!) and the basic builtin types, of which the regression test suite makes plenty of use, it is rather unlikely that other training sets would provide substantially better performance for Python code execution. Note also that Ubuntu has shipped PGO builds based on the regrtests for years, and they seemed to be quite happy with it. Stefan

Stefan Behnel schrieb am 22.08.2015 um 19:25:
Note that this doesn't mean that it's a good workload for the C code in the standard library (and I guess that's why Alecsandru initially excluded the hashlib tests). Improvements on that front might still be possible. But it's certainly a good workload for all the rest, i.e. for executing general Python code. Stefan

Thank you Stefan for also pointing out the importance of regrtest as a good training set for building Python. Indeed, Ubuntu delivers in their repos the Python2/3 binaries already optimized using PGO based on regrtest. Alecsandru -----Original Message----- From: Python-Dev [mailto:python-dev-bounces+alecsandru.patrascu=intel.com@python.org] On Behalf Of Stefan Behnel Sent: Saturday, August 22, 2015 8:25 PM To: python-dev@python.org Subject: Re: [Python-Dev] Profile Guided Optimization active by-default Guido van Rossum schrieb am 22.08.2015 um 18:55:
It's just one command in one line, yes.
Experimentation should be encouraged.)
A well chosen training set can have a notable impact on PGO compiled code in general, and switching from pybench to regrtests should make such a difference. However, since CPython's overall performance is mostly determined by the interpreter loop, general object operations (getattr!) and the basic builtin types, of which the regression test suite makes plenty of use, it is rather unlikely that other training sets would provide substantially better performance for Python code execution. Note also that Ubuntu has shipped PGO builds based on the regrtests for years, and they seemed to be quite happy with it. Stefan _______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/alecsandru.patrascu%40int...

On Aug 22, 2015 9:02 AM, "Patrascu, Alecsandru" < alecsandru.patrascu@intel.com> wrote: [snip]
For instance, as shown from attached sample performance results from the Grand Unified Python Benchmark, >20% speed up was observed.
Are you referring to the tests in the benchmarks repo? [1] How does the real-world performance improvement compare with other languages you are targeting for optimization? And thanks for working on this! I have several more questions: What sorts of future changes in CPython's code might interfere with your optimizations? What future additions might stand to benefit? What changes in existing code might improve optimization opportunities? What is the added maintenance burden of the optimizations on CPython, if any? What is the performance impact on non-Intel architectures? What about older Intel architectures? ...and future ones? What is Intel's commitment to supporting these (or other) optimizations in the future? How is the practical EOL of the optimizations managed? Finally, +1 on adding an opt-in Makefile target rather than enabling the optimizations by default. Thanks again! -eric [1] https://hg.python.org/benchmarks/

Yes, the results are measured from running the benchmarks from the repo [1]. Furthermore, this optimization is generic and can handle any kind of changes in hardware or the CPython 2/3 source code. We are not adding to or modifying regrtest and our rule will be applied on the latest tests existing in the CPython repo. Since they are up to date and being easy to be executed, this proposal makes sure that users will always take benefit from them. [1] https://hg.python.org/benchmarks/ Alecsandru From: Eric Snow [mailto:ericsnowcurrently@gmail.com] Sent: Saturday, August 22, 2015 8:26 PM To: Patrascu, Alecsandru Cc: Python-Dev Subject: Re: [Python-Dev] Profile Guided Optimization active by-default On Aug 22, 2015 9:02 AM, "Patrascu, Alecsandru" <alecsandru.patrascu@intel.com> wrote: [snip]

I'm sorry, I forgot to mention this, I already opened an issue and the patches are uploaded [1]. [1] http://bugs.python.org/issue24915 From: Brett Cannon [mailto:brett@python.org] Sent: Saturday, August 22, 2015 9:00 PM To: Patrascu, Alecsandru; python-dev@python.org Subject: Re: [Python-Dev] Profile Guided Optimization active by-default I just realized I didn't see anyone say it, but please upload the patches to bugs.Python.org for easier tracking and reviewing. On Sat, Aug 22, 2015, 08:01 Patrascu, Alecsandru <alecsandru.patrascu@intel.com> wrote: Hi All, This is Alecsandru from Server Scripting Languages Optimization team at Intel Corporation. I would like to submit a request to turn-on Profile Guided Optimization or PGO as the default build option for Python (both 2.7 and 3.6), given its performance benefits on a wide variety of workloads and hardware. For instance, as shown from attached sample performance results from the Grand Unified Python Benchmark, >20% speed up was observed. In addition, we are seeing 2-9% performance boost from OpenStack/Swift where more than 60% of the codes are in Python 2.7. Our analysis indicates the performance gain was mainly due to reduction of icache misses and CPU front-end stalls. Attached is the Makefile patches that modify the all build target and adds a new one called "disable-profile-opt". We built and tested this patch for Python 2.7 and 3.6 on our Linux machines (CentOS 7/Ubuntu Server 14.04, Intel Xeon Haswell/Broadwell with 18/8 cores). We use "regrtest" suite for training as it provides the best performance improvement. Some of the test programs in the suite may fail which leads to build fail. One solution is to disable the specific failed test using the "-x " flag (as shown in the patch) Steps to apply the patch: 1. hg clone https://hg.python.org/cpython cpython 2. cd cpython 3. hg update 2.7 (needed for 2.7 only) 4. Copy *.patch to the current directory 5. patch < python2.7-pgo.patch (or patch < python3.6-pgo.patch) 6. ./configure 7. make To disable PGO 7b. make disable-profile-opt In the following, please find our sample performance results from latest XEON machine, XEON Broadwell EP. Hardware (HW): Intel XEON (Broadwell) 8 Cores BIOS settings: Intel Turbo Boost Technology: false Hyper-Threading: false Operating System: Ubuntu 14.04.3 LTS trusty OS configuration: CPU freq set at fixed: 2.6GHz by echo 2600000 > /sys/devices/system/cpu/cpu*/cpufreq/scaling_min_freq echo 2600000 > /sys/devices/system/cpu/cpu*/cpufreq/scaling_max_freq Address Space Layout Randomization (ASLR) disabled (to reduce run to run variation) by echo 0 > /proc/sys/kernel/randomize_va_space GCC version: gcc version 4.8.4 (Ubuntu 4.8.4-2ubuntu1~14.04) Benchmark: Grand Unified Python Benchmark (GUPB) GUPB Source: https://hg.python.org/benchmarks/ Python2.7 results: Python source: hg clone https://hg.python.org/cpython cpython Python Source: hg update 2.7 hg id: 0511b1165bb6 (2.7) hg id -r 'ancestors(.) and tag()': 15c95b7d81dc (2.7) v2.7.10 hg --debug id -i: 0511b1165bb6cf40ada0768a7efc7ba89316f6a5 Benchmarks Speedup(%) simple_logging 20 raytrace 20 silent_logging 19 richards 19 chaos 16 formatted_logging 16 json_dump 15 hexiom2 13 pidigits 12 slowunpickle 12 django_v2 12 unpack_sequence 11 float 11 mako 11 slowpickle 11 fastpickle 11 django 11 go 10 json_dump_v2 10 pathlib 10 regex_compile 10 pybench 9.9 etree_process 9 regex_v8 8 bzr_startup 8 2to3 8 slowspitfire 8 telco 8 pickle_list 8 fannkuch 8 etree_iterparse 8 nqueens 8 mako_v2 8 etree_generate 8 call_method_slots 7 html5lib_warmup 7 html5lib 7 nbody 7 spectral_norm 7 spambayes 7 fastunpickle 6 meteor_contest 6 chameleon 6 rietveld 6 tornado_http 5 unpickle_list 5 pickle_dict 4 regex_effbot 3 normal_startup 3 startup_nosite 3 etree_parse 2 call_method_unknown 2 call_simple 1 json_load 1 call_method 1 Python3.6 results Python source: hg clone https://hg.python.org/cpython cpython hg id: 96d016f78726 tip hg id -r 'ancestors(.) and tag()': 1a58b1227501 (3.5) v3.5.0rc1 hg --debug id -i: 96d016f78726afbf66d396f084b291ea43792af1 Benchmark Speedup(%) fastunpickle 22.94 fastpickle 21.67 json_load 17.64 simple_logging 17.49 meteor_contest 16.67 formatted_logging 15.33 etree_process 14.61 raytrace 13.57 etree_generate 13.56 chaos 12.09 hexiom2 12 nbody 11.88 json_dump_v2 11.24 richards 11.02 nqueens 10.96 fannkuch 10.79 go 10.77 float 10.26 regex_compile 9.8 silent_logging 9.63 pidigits 9.58 etree_iterparse 9.48 2to3 8.44 regex_v8 8.09 regex_effbot 7.88 call_simple 7.63 tornado_http 7.38 etree_parse 4.92 spectral_norm 4.72 normal_startup 4.39 telco 3.88 startup_nosite 3.7 call_method 3.63 unpack_sequence 3.6 call_method_slots 2.91 call_method_unknown 2.59 iterative_count 0.45 threaded_count -2.79 Thank you, Alecsandru _______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/brett%40python.org

On Sat, 22 Aug 2015 at 11:10 Patrascu, Alecsandru < alecsandru.patrascu@intel.com> wrote:
Great, thanks Alecandru. Do please follow Stefan's comment, though, and upload the patch files directly and not as a zip file. That way we can use our code review tool to do a proper review of the patches. -Brett

I removed the zip file and uploaded the patches individually. Alecsandru From: Brett Cannon [mailto:brett@python.org] Sent: Sunday, August 23, 2015 4:47 AM To: Patrascu, Alecsandru; python-dev@python.org Subject: Re: [Python-Dev] Profile Guided Optimization active by-default On Sat, 22 Aug 2015 at 11:10 Patrascu, Alecsandru <alecsandru.patrascu@intel.com> wrote: I'm sorry, I forgot to mention this, I already opened an issue and the patches are uploaded [1]. [1] http://bugs.python.org/issue24915 Great, thanks Alecandru. Do please follow Stefan's comment, though, and upload the patch files directly and not as a zip file. That way we can use our code review tool to do a proper review of the patches. -Brett From: Brett Cannon [mailto:brett@python.org] Sent: Saturday, August 22, 2015 9:00 PM To: Patrascu, Alecsandru; python-dev@python.org Subject: Re: [Python-Dev] Profile Guided Optimization active by-default I just realized I didn't see anyone say it, but please upload the patches to bugs.Python.org for easier tracking and reviewing. On Sat, Aug 22, 2015, 08:01 Patrascu, Alecsandru <alecsandru.patrascu@intel.com> wrote: Hi All, This is Alecsandru from Server Scripting Languages Optimization team at Intel Corporation. I would like to submit a request to turn-on Profile Guided Optimization or PGO as the default build option for Python (both 2.7 and 3.6), given its performance benefits on a wide variety of workloads and hardware. For instance, as shown from attached sample performance results from the Grand Unified Python Benchmark, >20% speed up was observed. In addition, we are seeing 2-9% performance boost from OpenStack/Swift where more than 60% of the codes are in Python 2.7. Our analysis indicates the performance gain was mainly due to reduction of icache misses and CPU front-end stalls. Attached is the Makefile patches that modify the all build target and adds a new one called "disable-profile-opt". We built and tested this patch for Python 2.7 and 3.6 on our Linux machines (CentOS 7/Ubuntu Server 14.04, Intel Xeon Haswell/Broadwell with 18/8 cores). We use "regrtest" suite for training as it provides the best performance improvement. Some of the test programs in the suite may fail which leads to build fail. One solution is to disable the specific failed test using the "-x " flag (as shown in the patch) Steps to apply the patch: 1. hg clone https://hg.python.org/cpython cpython 2. cd cpython 3. hg update 2.7 (needed for 2.7 only) 4. Copy *.patch to the current directory 5. patch < python2.7-pgo.patch (or patch < python3.6-pgo.patch) 6. ./configure 7. make To disable PGO 7b. make disable-profile-opt In the following, please find our sample performance results from latest XEON machine, XEON Broadwell EP. Hardware (HW): Intel XEON (Broadwell) 8 Cores BIOS settings: Intel Turbo Boost Technology: false Hyper-Threading: false Operating System: Ubuntu 14.04.3 LTS trusty OS configuration: CPU freq set at fixed: 2.6GHz by echo 2600000 > /sys/devices/system/cpu/cpu*/cpufreq/scaling_min_freq echo 2600000 > /sys/devices/system/cpu/cpu*/cpufreq/scaling_max_freq Address Space Layout Randomization (ASLR) disabled (to reduce run to run variation) by echo 0 > /proc/sys/kernel/randomize_va_space GCC version: gcc version 4.8.4 (Ubuntu 4.8.4-2ubuntu1~14.04) Benchmark: Grand Unified Python Benchmark (GUPB) GUPB Source: https://hg.python.org/benchmarks/ Python2.7 results: Python source: hg clone https://hg.python.org/cpython cpython Python Source: hg update 2.7 hg id: 0511b1165bb6 (2.7) hg id -r 'ancestors(.) and tag()': 15c95b7d81dc (2.7) v2.7.10 hg --debug id -i: 0511b1165bb6cf40ada0768a7efc7ba89316f6a5 Benchmarks Speedup(%) simple_logging 20 raytrace 20 silent_logging 19 richards 19 chaos 16 formatted_logging 16 json_dump 15 hexiom2 13 pidigits 12 slowunpickle 12 django_v2 12 unpack_sequence 11 float 11 mako 11 slowpickle 11 fastpickle 11 django 11 go 10 json_dump_v2 10 pathlib 10 regex_compile 10 pybench 9.9 etree_process 9 regex_v8 8 bzr_startup 8 2to3 8 slowspitfire 8 telco 8 pickle_list 8 fannkuch 8 etree_iterparse 8 nqueens 8 mako_v2 8 etree_generate 8 call_method_slots 7 html5lib_warmup 7 html5lib 7 nbody 7 spectral_norm 7 spambayes 7 fastunpickle 6 meteor_contest 6 chameleon 6 rietveld 6 tornado_http 5 unpickle_list 5 pickle_dict 4 regex_effbot 3 normal_startup 3 startup_nosite 3 etree_parse 2 call_method_unknown 2 call_simple 1 json_load 1 call_method 1 Python3.6 results Python source: hg clone https://hg.python.org/cpython cpython hg id: 96d016f78726 tip hg id -r 'ancestors(.) and tag()': 1a58b1227501 (3.5) v3.5.0rc1 hg --debug id -i: 96d016f78726afbf66d396f084b291ea43792af1 Benchmark Speedup(%) fastunpickle 22.94 fastpickle 21.67 json_load 17.64 simple_logging 17.49 meteor_contest 16.67 formatted_logging 15.33 etree_process 14.61 raytrace 13.57 etree_generate 13.56 chaos 12.09 hexiom2 12 nbody 11.88 json_dump_v2 11.24 richards 11.02 nqueens 10.96 fannkuch 10.79 go 10.77 float 10.26 regex_compile 9.8 silent_logging 9.63 pidigits 9.58 etree_iterparse 9.48 2to3 8.44 regex_v8 8.09 regex_effbot 7.88 call_simple 7.63 tornado_http 7.38 etree_parse 4.92 spectral_norm 4.72 normal_startup 4.39 telco 3.88 startup_nosite 3.7 call_method 3.63 unpack_sequence 3.6 call_method_slots 2.91 call_method_unknown 2.59 iterative_count 0.45 threaded_count -2.79 Thank you, Alecsandru _______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/brett%40python.org
participants (11)
-
Brett Cannon
-
Eric Snow
-
Gregory P. Smith
-
Guido van Rossum
-
Matthias Klose
-
Nick Coghlan
-
Patrascu, Alecsandru
-
R. David Murray
-
Skip Montanaro
-
Stefan Behnel
-
Xavier Combelle