Help requested with Python 2.7 performance regression

Hello all, Over in Ubuntu, we've gotten reports about some performance regressions in Python 2.7 when moving from Trusty (14.04 LTS) to Xenial (16.04 LTS). Trusty's version is based on 2.7.6 while Xenial's version is based on 2.7.12 with bits of .13 cherry picked. We've not been able to identify any change in Python itself (or the Debian/Ubuntu deltas) which could account for this, so the investigation has led to various gcc compiler options and version differences. In particular disabling LTO (link-time optimization) seems to have a positive impact, but doesn't completely regain the loss. Louis (Cc'd here) has done a ton of work to measure and analyze the problem, but we've more or less hit a roadblock, so we're taking the issue public to see if anybody on this mailing list has further ideas. A detailed analysis is available in this Google doc: https://docs.google.com/document/d/1zrV3OIRSo99fd2Ty4YdGk_scmTRDmVauBprKL8ei... The document should be public for comments and editing. If you have any thoughts, or other lines of investigation you think are worthwhile pursuing, please add your comments to the document. Cheers, -Barry

On Wed, 1 Mar 2017 12:28:24 -0500 Barry Warsaw <barry@python.org> wrote:
I may be misunderstanding the document, but this lacks at least a comparison of the *same* interpreter version with different compiler versions. As for the high level: what if the training set used for PGO in Xenial has become skewed or inadequate? Just a thought, as it would imply that PGO+LTO uses wrong information for code placement and other optimizations. Regards Antoine.

On Wed, 1 Mar 2017 19:58:14 +0100 Matthias Klose <doko@ubuntu.com> wrote:
I did some tests a year or two ago, and running the whole test suite is not a good idea, as coverage varies wildly from one functionality to the other, so PGO will not infer the right information from it. You don't get very good benchmark results from it. (for example, decimal has an extensive test suite which might lead PGO to believe that code paths exercised by the decimal module are the hottest ones) Regards Antoine.

On Thu, Mar 2, 2017 at 4:07 AM, Antoine Pitrou <solipsis@pitrou.net> wrote:
FYI, there are "profile-opt" make target. It uses subset of regrtest. https://github.com/python/cpython/blob/2.7/Makefile.pre.in#L211-L214 Does Ubuntu (and Debian) use it?

We updated profile-opt to use the testsuite subset based on what distros had already been using for their training runs. As for the comment about the test suite not being good for training.... Mostly a myth. The test suite exercises the ceval loop well as well as things like re and json sufficiently to be a lot better than stupid workloads such as pybench (the previous default training run). Room for improvement in training? Likely in some corners. But I have yet to see anyone propose any evidence based patch as a workload that reliably improves on anything for PGO over what we train with today. -gpshead On Thu, Mar 2, 2017, 1:12 AM INADA Naoki <songofacandy@gmail.com> wrote:

Hello, Le 01/03/2017 à 18:51, Antoine Pitrou a écrit :
Indeed, this is something that is in the history of the LP bug so here is the URL where those comparison can be found : https://docs.google.com/spreadsheets/d/1MyNBPVZlBeic1OLqVKe_bcPk2deO_pQs9trI... Hope it can help, Kind regards, ...Louis -- Louis Bouchard Software engineer, Cloud & Sustaining eng. Canonical Ltd Ubuntu developer Debian Maintainer GPG : 429D 7A3B DD05 B6F8 AF63 B9C4 8B3D 867C 823E 7A61

On Wed, 1 Mar 2017 20:24:03 +0100 Louis Bouchard <louis.bouchard@canonical.com> wrote:
Some more questions: * what does "faster" or "slower" mean (that is, which one is faster)? * is it possible to have actual performance differences in percent? being 2% slower is not the same as being 30% slower... Regards Antoine.

Hello, Le 01/03/2017 à 20:40, Antoine Pitrou a écrit :
This means that the second element of the test is slower than the first. For instance if the test is Trusty stock .vs. Xenial stock and it shows slower, it means that Xenial stock is slower than Trusty stock. This is directly taken from the output of "pyperformance compare". The third column of each comparison (1.x) gives the proportion figure of the test. A test that shows slower 1.14 is 14% slower. HTH, Kind regards, ...Louis -- Louis Bouchard Software engineer, Cloud & Sustaining eng. Canonical Ltd Ubuntu developer Debian Maintainer GPG : 429D 7A3B DD05 B6F8 AF63 B9C4 8B3D 867C 823E 7A61

Hi, Your document doesn't explain how you configured the host to run benchmarks. Maybe you didn't tune Linux or anything else? Be careful with modern hardware which can make funny (or not) surprises. See my recent talk at FOSDEM (last month): "How to run a stable benchmark" https://fosdem.org/2017/schedule/event/python_stable_benchmark/ Factors impacting Python benchmarks: * Linux Address Space Layout Randomization (ASRL), /proc/sys/kernel/randomize_va_space * Python random hash function: PYTHONHASHSEED * Command line arguments and environmnet variables: enabling ASLR helps here (?) * CPU power saving and performance features: disable Intel Turbo Boost and/or use a fixed CPU frequency. * Temperature: temperature has a limited impact on benchmarks. If the CPU is below 95°C, Intel CPUs still run at full speed. With a correct cooling system, temperature is not an issue. * Linux perf probes: /proc/sys/kernel/perf_event_max_sample_rate * Code locality, CPU L1 instruction cache (L1c): Profiled Guided Optimization (PGO) helps here * Other processes and the kernel, CPU isolation (CPU pinning) helps here: use isolcpus=cpu_list and rcu_nocbs=cpu_list on the * Linux kernel command line * ... Reboot? Sadly, other unknown factors may still impact benchmarks. Sometimes, it helps to reboot to restore standard performances. https://haypo-notes.readthedocs.io/microbenchmark.html#factors-impacting-ben... Victor

On 2 March 2017 at 07:00, Victor Stinner <victor.stinner@gmail.com> wrote:
Victor, do you know if you or anyone else has compared the RHEL/CentOS 7.x binaries (Python 2.7.5 + patches, built with GCC 4.8.x) with the Fedora 25 binaries (Python 2.7.13 + patches, built with GCC 6.3.x)? I know you've been using perf to look for differences between *Python* major versions, but this would be more about using Python's benchmark suite to investigate the performance of *gcc*, since it appears that may be the culprit here. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

Hello, Le 03/03/2017 à 08:27, Nick Coghlan a écrit :
This was 'almost' intentional, as no specific O/S tuning was done. The intent is to compare performance between two specific versions of the interperter, not to target any gain in performance. Such tuning would suposedly have a linear impact on both version. If not, then the compiler definitively does some funky things that I want to be aware of.
Now this is an interesting test that I can probably do myself to a certain extent using containers and/or VM on the same hardware. While it will be no mean a strong validation of the performances, I may be able to confirm a similar trend in the results before going forward with tests on baremetal.
Cheers, Nick.
Thanks, ...Louis -- Louis Bouchard Software engineer, Cloud & Sustaining eng. Canonical Ltd Ubuntu developer Debian Maintainer GPG : 429D 7A3B DD05 B6F8 AF63 B9C4 8B3D 867C 823E 7A61

2017-03-03 8:27 GMT+01:00 Nick Coghlan <ncoghlan@gmail.com>:
I didn't and I'm not aware of anyone who did that. It would be nice to run performance since this benchmark suite should now be much more reliable. By the way, I always forget to check if Fedora and RHEL compile Python using PGO. Victor

PGO is not enabled in RHEL and Fedora. I did some initial testing for Fedora, however it increased the compilation time of the RPM by approximately two hours, so for the time being I left it out. Regards, Charalampos Stratakis Associate Software Engineer Python Maintenance Team, Red Hat ----- Original Message ----- From: "Victor Stinner" <victor.stinner@gmail.com> To: "Nick Coghlan" <ncoghlan@gmail.com> Cc: "Barry Warsaw" <barry@python.org>, "Python-Dev" <python-dev@python.org> Sent: Friday, March 3, 2017 11:21:49 AM Subject: Re: [Python-Dev] Help requested with Python 2.7 performance regression 2017-03-03 8:27 GMT+01:00 Nick Coghlan <ncoghlan@gmail.com>:
I didn't and I'm not aware of anyone who did that. It would be nice to run performance since this benchmark suite should now be much more reliable. By the way, I always forget to check if Fedora and RHEL compile Python using PGO. Victor _______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/cstratak%40redhat.com

2017-03-03 12:18 GMT+01:00 Charalampos Stratakis <cstratak@redhat.com>:
PGO is not enabled in RHEL and Fedora.
I did some initial testing for Fedora, however it increased the compilation time of the RPM by approximately two hours, so for the time being I left it out.
Two hours in a *single* build server is very cheap compared to the 10-20% speedup on *all* computers using this PGO build, no? Victor

And that is the reason I wanted to test this a bit more. However it adds a maintenance burden when fast fixes need to be applied (aka now that Fedora 26 alpha is being prepared). The build now due to the arm architecture and the huge test suite takes approximately 3 hours and 30 minutes. Increasing that by two hours is not something I would do during the development phase. On another note, RHEL's python does not have the PGO functionality backported to it. Regards, Charalampos Stratakis Associate Software Engineer Python Maintenance Team, Red Hat ----- Original Message ----- From: "Victor Stinner" <victor.stinner@gmail.com> To: "Charalampos Stratakis" <cstratak@redhat.com> Cc: "Nick Coghlan" <ncoghlan@gmail.com>, "Barry Warsaw" <barry@python.org>, "Python-Dev" <python-dev@python.org> Sent: Friday, March 3, 2017 12:53:00 PM Subject: Re: [Python-Dev] Help requested with Python 2.7 performance regression 2017-03-03 12:18 GMT+01:00 Charalampos Stratakis <cstratak@redhat.com>:
PGO is not enabled in RHEL and Fedora.
I did some initial testing for Fedora, however it increased the compilation time of the RPM by approximately two hours, so for the time being I left it out.
Two hours in a *single* build server is very cheap compared to the 10-20% speedup on *all* computers using this PGO build, no? Victor

Hello, Le 03/03/2017 à 08:27, Nick Coghlan a écrit :
Out of curiosity, I ran the set of benchmarks in two LXC containers running centos7 (2.7.5 + gcc 4.8.5) and Fedora 25 (2.7.13 + gcc 6.3.x). The benchmarks do run faster in 18 benchmarks, slower on 12 and insignificant for the rest (~33 from memory). Do take into account that this is run on baremetal system running an Ubuntu kernel (4.4.0-59) so this is by no mean a reference value but just for a quick test. Results were appended to the spreadsheet referred to in the analysis document. It is somewhat coherent with a previous test I ran where I disabled PGO on 2.7.6+gcc4.8 (Trusty). This made the 2.7.6+gcc4.8 (Trusty) interpreter to become slower than the Xenial reference. Unfortunately, I cannot redeploy my server on RHEL or Fedora at the moment so this is as far as I can go. Kind regards, ...Louis -- Louis Bouchard Software engineer, Cloud & Sustaining eng. Canonical Ltd Ubuntu developer Debian Maintainer GPG : 429D 7A3B DD05 B6F8 AF63 B9C4 8B3D 867C 823E 7A61

"faster" or "slower" is relative: I would like to see the ?.??x faster/slower or percent value. Can you please share the result? I don't know what is the best output: python3 -m performance compare centos.json fedora.json or the new: python3 -m perf compare_to centos.json fedora.json --table --quiet Victor

Hello, Le 03/03/2017 à 15:31, Victor Stinner a écrit :
All the results, including the latest are in the spreadsheet here (cited in the analysis document) : https://docs.google.com/spreadsheets/d/1pKCOpyu4HUyw9YtJugn6jzVGa_zeDmBVNzqm... Third column is the ?.??x value that you are looking for, taken directly out of the 'pyperformance analyze' results. I didn't know about the new options, I'll give it a spin & see if I can get a better format. Kind regards, ...Louis -- Louis Bouchard Software engineer, Cloud & Sustaining eng. Canonical Ltd Ubuntu developer Debian Maintainer GPG : 429D 7A3B DD05 B6F8 AF63 B9C4 8B3D 867C 823E 7A61

Hello, Le 03/03/2017 à 15:37, Louis Bouchard a écrit :
All the benchmark data using the new format have been uploaded to the spreadsheet. Each sheet is prefixed with pct_. HTH, Kind regards, ...Louis -- Louis Bouchard Software engineer, Cloud & Sustaining eng. Canonical Ltd Ubuntu developer Debian Maintainer GPG : 429D 7A3B DD05 B6F8 AF63 B9C4 8B3D 867C 823E 7A61

Hello, All, We have been tracking Python performance over the last 1.5 years, and results (along with other languages) are published daily at this site: https://languagesperformance.intel.com/ This general regression trend discussed is same as we observed. The Python source codes are being pulled, built, and results published daily following exactly the same process, on exactly the same hardware running with exactly the same operating system image. Take Django_v2 as an example, with 2.7 Default build: comparing 2/10/2017 commitID 54c93e0fe79b0ec7c9acccc35dabae2ffa4d563a, with 8/27/2015 commitID 514f5d6101752f10758c5b89e20941bc3d13008a, the regression is 2.5% PGO build: comparing 2/10/2017 commitID 54c93e0fe79b0ec7c9acccc35dabae2ffa4d563a, with 8/27/2015 commitID 514f5d6101752f10758c5b89e20941bc3d13008a, the regression is 0.47% We turned off hyperthreading, turbo, and ASLR, and set CPU frequency at a constant value to mitigate run to run variation. Currently we are only running limited number of micro-benchmarks, but planning to run a more broad range of benchmark/workload. The one that's under consideration to start with is the Python benchmark suite (all): https://github.com/python/performance We'd love to hear feedback on how to best monitor Python code changes and performance, how to present (look and feel, charts etc) and communicate the results. Thanks, Peter -----Original Message----- From: Python-Dev [mailto:python-dev-bounces+peter.xihong.wang=intel.com@python.org] On Behalf Of Louis Bouchard Sent: Friday, March 03, 2017 7:27 AM To: Victor Stinner <victor.stinner@gmail.com> Cc: Barry Warsaw <barry@python.org>; Nick Coghlan <ncoghlan@gmail.com>; Python-Dev <python-dev@python.org> Subject: Re: [Python-Dev] Help requested with Python 2.7 performance regression Hello, Le 03/03/2017 à 15:37, Louis Bouchard a écrit :
All the benchmark data using the new format have been uploaded to the spreadsheet. Each sheet is prefixed with pct_. HTH, Kind regards, ...Louis -- Louis Bouchard Software engineer, Cloud & Sustaining eng. Canonical Ltd Ubuntu developer Debian Maintainer GPG : 429D 7A3B DD05 B6F8 AF63 B9C4 8B3D 867C 823E 7A61 _______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/peter.xihong.wang%40intel...

After much testing I found what is causing the regression in 16.04 and later. There are several distinct causes which are attributed to the choices made in debian/rules and the changes in GCC. Cause #1: the decision to compile `Modules/_math.c` with `-fPIC` *and* link it statically into the python executable [1]. This causes the majority of the slowdown. This may be a bug in GCC or simply a constraint, I didn't find anything specific on this topic, although there are a lot of old bug reports regarding the interaction of -fPIC with -flto. Cause #2: the enablement of `fpectl` [2], specifically the passage of `--with-fpectl` to `configure`. fpectl is disabled in python.org builds by default and its use is discouraged. Yet, Debian builds enable it unconditionally, and it seems to cause a significant performance degradation. It's much less noticeable on 14.04 with GCC 4.8.0, but on more recent releases the performance difference seems to be larger. Plausible Cause #3: stronger stack smashing protection in 16.04, which uses --fstack-protector-strong, whereas 14.04 and earlier used --fstack-protector (with lesser performance overhead). Also, debian/rules limits the scope of PGO's PROFILE_TASK to 377 test suites vs upstream's 397, which affects performance somewhat negatively, but this is not definitive. What are the reasons behind the trimming of the tests used for PGO? Without fpectl, and without -fPIC on _math.c, 2.7.12 built on 16.04 is slower than stock 2.7.6 on 14.04 by about 0.9% in my pyperformance runs [3]. This is in contrast to a whopping 7.95% slowdown when comparing stock versions. Finally, a vanilla Python 2.7.12 build using GCC 5.4.0, default CFLAGS, default PROFILE_TASK and default Modules/Setup.local consistently runs faster in benchmarks than 2.7.6 (by about 0.7%), but I was not able to pinpoint the exact reason for this difference. Note: the percentages above are the relative change in the geometric mean of pyperformance benchmark results. [1] https://git.launchpad.net/~usd-import-team/ubuntu/+source/python2.7/tree/deb... [2] https://git.launchpad.net/~usd-import-team/ubuntu/+source/python2.7/tree/deb... [3] https://docs.google.com/spreadsheets/d/1L3_gxe-AOYJsXFwGZgFko8jaChB0dFPjK5oM... Elvis On Fri, Mar 3, 2017 at 10:27 AM, Louis Bouchard < louis.bouchard@canonical.com> wrote:

On Wed, 1 Mar 2017 12:28:24 -0500 Barry Warsaw <barry@python.org> wrote:
I may be misunderstanding the document, but this lacks at least a comparison of the *same* interpreter version with different compiler versions. As for the high level: what if the training set used for PGO in Xenial has become skewed or inadequate? Just a thought, as it would imply that PGO+LTO uses wrong information for code placement and other optimizations. Regards Antoine.

On Wed, 1 Mar 2017 19:58:14 +0100 Matthias Klose <doko@ubuntu.com> wrote:
I did some tests a year or two ago, and running the whole test suite is not a good idea, as coverage varies wildly from one functionality to the other, so PGO will not infer the right information from it. You don't get very good benchmark results from it. (for example, decimal has an extensive test suite which might lead PGO to believe that code paths exercised by the decimal module are the hottest ones) Regards Antoine.

On Thu, Mar 2, 2017 at 4:07 AM, Antoine Pitrou <solipsis@pitrou.net> wrote:
FYI, there are "profile-opt" make target. It uses subset of regrtest. https://github.com/python/cpython/blob/2.7/Makefile.pre.in#L211-L214 Does Ubuntu (and Debian) use it?

We updated profile-opt to use the testsuite subset based on what distros had already been using for their training runs. As for the comment about the test suite not being good for training.... Mostly a myth. The test suite exercises the ceval loop well as well as things like re and json sufficiently to be a lot better than stupid workloads such as pybench (the previous default training run). Room for improvement in training? Likely in some corners. But I have yet to see anyone propose any evidence based patch as a workload that reliably improves on anything for PGO over what we train with today. -gpshead On Thu, Mar 2, 2017, 1:12 AM INADA Naoki <songofacandy@gmail.com> wrote:

Hello, Le 01/03/2017 à 18:51, Antoine Pitrou a écrit :
Indeed, this is something that is in the history of the LP bug so here is the URL where those comparison can be found : https://docs.google.com/spreadsheets/d/1MyNBPVZlBeic1OLqVKe_bcPk2deO_pQs9trI... Hope it can help, Kind regards, ...Louis -- Louis Bouchard Software engineer, Cloud & Sustaining eng. Canonical Ltd Ubuntu developer Debian Maintainer GPG : 429D 7A3B DD05 B6F8 AF63 B9C4 8B3D 867C 823E 7A61

On Wed, 1 Mar 2017 20:24:03 +0100 Louis Bouchard <louis.bouchard@canonical.com> wrote:
Some more questions: * what does "faster" or "slower" mean (that is, which one is faster)? * is it possible to have actual performance differences in percent? being 2% slower is not the same as being 30% slower... Regards Antoine.

Hello, Le 01/03/2017 à 20:40, Antoine Pitrou a écrit :
This means that the second element of the test is slower than the first. For instance if the test is Trusty stock .vs. Xenial stock and it shows slower, it means that Xenial stock is slower than Trusty stock. This is directly taken from the output of "pyperformance compare". The third column of each comparison (1.x) gives the proportion figure of the test. A test that shows slower 1.14 is 14% slower. HTH, Kind regards, ...Louis -- Louis Bouchard Software engineer, Cloud & Sustaining eng. Canonical Ltd Ubuntu developer Debian Maintainer GPG : 429D 7A3B DD05 B6F8 AF63 B9C4 8B3D 867C 823E 7A61

Hi, Your document doesn't explain how you configured the host to run benchmarks. Maybe you didn't tune Linux or anything else? Be careful with modern hardware which can make funny (or not) surprises. See my recent talk at FOSDEM (last month): "How to run a stable benchmark" https://fosdem.org/2017/schedule/event/python_stable_benchmark/ Factors impacting Python benchmarks: * Linux Address Space Layout Randomization (ASRL), /proc/sys/kernel/randomize_va_space * Python random hash function: PYTHONHASHSEED * Command line arguments and environmnet variables: enabling ASLR helps here (?) * CPU power saving and performance features: disable Intel Turbo Boost and/or use a fixed CPU frequency. * Temperature: temperature has a limited impact on benchmarks. If the CPU is below 95°C, Intel CPUs still run at full speed. With a correct cooling system, temperature is not an issue. * Linux perf probes: /proc/sys/kernel/perf_event_max_sample_rate * Code locality, CPU L1 instruction cache (L1c): Profiled Guided Optimization (PGO) helps here * Other processes and the kernel, CPU isolation (CPU pinning) helps here: use isolcpus=cpu_list and rcu_nocbs=cpu_list on the * Linux kernel command line * ... Reboot? Sadly, other unknown factors may still impact benchmarks. Sometimes, it helps to reboot to restore standard performances. https://haypo-notes.readthedocs.io/microbenchmark.html#factors-impacting-ben... Victor

On 2 March 2017 at 07:00, Victor Stinner <victor.stinner@gmail.com> wrote:
Victor, do you know if you or anyone else has compared the RHEL/CentOS 7.x binaries (Python 2.7.5 + patches, built with GCC 4.8.x) with the Fedora 25 binaries (Python 2.7.13 + patches, built with GCC 6.3.x)? I know you've been using perf to look for differences between *Python* major versions, but this would be more about using Python's benchmark suite to investigate the performance of *gcc*, since it appears that may be the culprit here. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

Hello, Le 03/03/2017 à 08:27, Nick Coghlan a écrit :
This was 'almost' intentional, as no specific O/S tuning was done. The intent is to compare performance between two specific versions of the interperter, not to target any gain in performance. Such tuning would suposedly have a linear impact on both version. If not, then the compiler definitively does some funky things that I want to be aware of.
Now this is an interesting test that I can probably do myself to a certain extent using containers and/or VM on the same hardware. While it will be no mean a strong validation of the performances, I may be able to confirm a similar trend in the results before going forward with tests on baremetal.
Cheers, Nick.
Thanks, ...Louis -- Louis Bouchard Software engineer, Cloud & Sustaining eng. Canonical Ltd Ubuntu developer Debian Maintainer GPG : 429D 7A3B DD05 B6F8 AF63 B9C4 8B3D 867C 823E 7A61

2017-03-03 8:27 GMT+01:00 Nick Coghlan <ncoghlan@gmail.com>:
I didn't and I'm not aware of anyone who did that. It would be nice to run performance since this benchmark suite should now be much more reliable. By the way, I always forget to check if Fedora and RHEL compile Python using PGO. Victor

PGO is not enabled in RHEL and Fedora. I did some initial testing for Fedora, however it increased the compilation time of the RPM by approximately two hours, so for the time being I left it out. Regards, Charalampos Stratakis Associate Software Engineer Python Maintenance Team, Red Hat ----- Original Message ----- From: "Victor Stinner" <victor.stinner@gmail.com> To: "Nick Coghlan" <ncoghlan@gmail.com> Cc: "Barry Warsaw" <barry@python.org>, "Python-Dev" <python-dev@python.org> Sent: Friday, March 3, 2017 11:21:49 AM Subject: Re: [Python-Dev] Help requested with Python 2.7 performance regression 2017-03-03 8:27 GMT+01:00 Nick Coghlan <ncoghlan@gmail.com>:
I didn't and I'm not aware of anyone who did that. It would be nice to run performance since this benchmark suite should now be much more reliable. By the way, I always forget to check if Fedora and RHEL compile Python using PGO. Victor _______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/cstratak%40redhat.com

2017-03-03 12:18 GMT+01:00 Charalampos Stratakis <cstratak@redhat.com>:
PGO is not enabled in RHEL and Fedora.
I did some initial testing for Fedora, however it increased the compilation time of the RPM by approximately two hours, so for the time being I left it out.
Two hours in a *single* build server is very cheap compared to the 10-20% speedup on *all* computers using this PGO build, no? Victor

And that is the reason I wanted to test this a bit more. However it adds a maintenance burden when fast fixes need to be applied (aka now that Fedora 26 alpha is being prepared). The build now due to the arm architecture and the huge test suite takes approximately 3 hours and 30 minutes. Increasing that by two hours is not something I would do during the development phase. On another note, RHEL's python does not have the PGO functionality backported to it. Regards, Charalampos Stratakis Associate Software Engineer Python Maintenance Team, Red Hat ----- Original Message ----- From: "Victor Stinner" <victor.stinner@gmail.com> To: "Charalampos Stratakis" <cstratak@redhat.com> Cc: "Nick Coghlan" <ncoghlan@gmail.com>, "Barry Warsaw" <barry@python.org>, "Python-Dev" <python-dev@python.org> Sent: Friday, March 3, 2017 12:53:00 PM Subject: Re: [Python-Dev] Help requested with Python 2.7 performance regression 2017-03-03 12:18 GMT+01:00 Charalampos Stratakis <cstratak@redhat.com>:
PGO is not enabled in RHEL and Fedora.
I did some initial testing for Fedora, however it increased the compilation time of the RPM by approximately two hours, so for the time being I left it out.
Two hours in a *single* build server is very cheap compared to the 10-20% speedup on *all* computers using this PGO build, no? Victor

Hello, Le 03/03/2017 à 08:27, Nick Coghlan a écrit :
Out of curiosity, I ran the set of benchmarks in two LXC containers running centos7 (2.7.5 + gcc 4.8.5) and Fedora 25 (2.7.13 + gcc 6.3.x). The benchmarks do run faster in 18 benchmarks, slower on 12 and insignificant for the rest (~33 from memory). Do take into account that this is run on baremetal system running an Ubuntu kernel (4.4.0-59) so this is by no mean a reference value but just for a quick test. Results were appended to the spreadsheet referred to in the analysis document. It is somewhat coherent with a previous test I ran where I disabled PGO on 2.7.6+gcc4.8 (Trusty). This made the 2.7.6+gcc4.8 (Trusty) interpreter to become slower than the Xenial reference. Unfortunately, I cannot redeploy my server on RHEL or Fedora at the moment so this is as far as I can go. Kind regards, ...Louis -- Louis Bouchard Software engineer, Cloud & Sustaining eng. Canonical Ltd Ubuntu developer Debian Maintainer GPG : 429D 7A3B DD05 B6F8 AF63 B9C4 8B3D 867C 823E 7A61

"faster" or "slower" is relative: I would like to see the ?.??x faster/slower or percent value. Can you please share the result? I don't know what is the best output: python3 -m performance compare centos.json fedora.json or the new: python3 -m perf compare_to centos.json fedora.json --table --quiet Victor

Hello, Le 03/03/2017 à 15:31, Victor Stinner a écrit :
All the results, including the latest are in the spreadsheet here (cited in the analysis document) : https://docs.google.com/spreadsheets/d/1pKCOpyu4HUyw9YtJugn6jzVGa_zeDmBVNzqm... Third column is the ?.??x value that you are looking for, taken directly out of the 'pyperformance analyze' results. I didn't know about the new options, I'll give it a spin & see if I can get a better format. Kind regards, ...Louis -- Louis Bouchard Software engineer, Cloud & Sustaining eng. Canonical Ltd Ubuntu developer Debian Maintainer GPG : 429D 7A3B DD05 B6F8 AF63 B9C4 8B3D 867C 823E 7A61

Hello, Le 03/03/2017 à 15:37, Louis Bouchard a écrit :
All the benchmark data using the new format have been uploaded to the spreadsheet. Each sheet is prefixed with pct_. HTH, Kind regards, ...Louis -- Louis Bouchard Software engineer, Cloud & Sustaining eng. Canonical Ltd Ubuntu developer Debian Maintainer GPG : 429D 7A3B DD05 B6F8 AF63 B9C4 8B3D 867C 823E 7A61

Hello, All, We have been tracking Python performance over the last 1.5 years, and results (along with other languages) are published daily at this site: https://languagesperformance.intel.com/ This general regression trend discussed is same as we observed. The Python source codes are being pulled, built, and results published daily following exactly the same process, on exactly the same hardware running with exactly the same operating system image. Take Django_v2 as an example, with 2.7 Default build: comparing 2/10/2017 commitID 54c93e0fe79b0ec7c9acccc35dabae2ffa4d563a, with 8/27/2015 commitID 514f5d6101752f10758c5b89e20941bc3d13008a, the regression is 2.5% PGO build: comparing 2/10/2017 commitID 54c93e0fe79b0ec7c9acccc35dabae2ffa4d563a, with 8/27/2015 commitID 514f5d6101752f10758c5b89e20941bc3d13008a, the regression is 0.47% We turned off hyperthreading, turbo, and ASLR, and set CPU frequency at a constant value to mitigate run to run variation. Currently we are only running limited number of micro-benchmarks, but planning to run a more broad range of benchmark/workload. The one that's under consideration to start with is the Python benchmark suite (all): https://github.com/python/performance We'd love to hear feedback on how to best monitor Python code changes and performance, how to present (look and feel, charts etc) and communicate the results. Thanks, Peter -----Original Message----- From: Python-Dev [mailto:python-dev-bounces+peter.xihong.wang=intel.com@python.org] On Behalf Of Louis Bouchard Sent: Friday, March 03, 2017 7:27 AM To: Victor Stinner <victor.stinner@gmail.com> Cc: Barry Warsaw <barry@python.org>; Nick Coghlan <ncoghlan@gmail.com>; Python-Dev <python-dev@python.org> Subject: Re: [Python-Dev] Help requested with Python 2.7 performance regression Hello, Le 03/03/2017 à 15:37, Louis Bouchard a écrit :
All the benchmark data using the new format have been uploaded to the spreadsheet. Each sheet is prefixed with pct_. HTH, Kind regards, ...Louis -- Louis Bouchard Software engineer, Cloud & Sustaining eng. Canonical Ltd Ubuntu developer Debian Maintainer GPG : 429D 7A3B DD05 B6F8 AF63 B9C4 8B3D 867C 823E 7A61 _______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/peter.xihong.wang%40intel...

After much testing I found what is causing the regression in 16.04 and later. There are several distinct causes which are attributed to the choices made in debian/rules and the changes in GCC. Cause #1: the decision to compile `Modules/_math.c` with `-fPIC` *and* link it statically into the python executable [1]. This causes the majority of the slowdown. This may be a bug in GCC or simply a constraint, I didn't find anything specific on this topic, although there are a lot of old bug reports regarding the interaction of -fPIC with -flto. Cause #2: the enablement of `fpectl` [2], specifically the passage of `--with-fpectl` to `configure`. fpectl is disabled in python.org builds by default and its use is discouraged. Yet, Debian builds enable it unconditionally, and it seems to cause a significant performance degradation. It's much less noticeable on 14.04 with GCC 4.8.0, but on more recent releases the performance difference seems to be larger. Plausible Cause #3: stronger stack smashing protection in 16.04, which uses --fstack-protector-strong, whereas 14.04 and earlier used --fstack-protector (with lesser performance overhead). Also, debian/rules limits the scope of PGO's PROFILE_TASK to 377 test suites vs upstream's 397, which affects performance somewhat negatively, but this is not definitive. What are the reasons behind the trimming of the tests used for PGO? Without fpectl, and without -fPIC on _math.c, 2.7.12 built on 16.04 is slower than stock 2.7.6 on 14.04 by about 0.9% in my pyperformance runs [3]. This is in contrast to a whopping 7.95% slowdown when comparing stock versions. Finally, a vanilla Python 2.7.12 build using GCC 5.4.0, default CFLAGS, default PROFILE_TASK and default Modules/Setup.local consistently runs faster in benchmarks than 2.7.6 (by about 0.7%), but I was not able to pinpoint the exact reason for this difference. Note: the percentages above are the relative change in the geometric mean of pyperformance benchmark results. [1] https://git.launchpad.net/~usd-import-team/ubuntu/+source/python2.7/tree/deb... [2] https://git.launchpad.net/~usd-import-team/ubuntu/+source/python2.7/tree/deb... [3] https://docs.google.com/spreadsheets/d/1L3_gxe-AOYJsXFwGZgFko8jaChB0dFPjK5oM... Elvis On Fri, Mar 3, 2017 at 10:27 AM, Louis Bouchard < louis.bouchard@canonical.com> wrote:
participants (11)
-
Antoine Pitrou
-
Barry Warsaw
-
Charalampos Stratakis
-
Elvis Pranskevichus
-
Gregory P. Smith
-
INADA Naoki
-
Louis Bouchard
-
Matthias Klose
-
Nick Coghlan
-
Victor Stinner
-
Wang, Peter Xihong