Mailman 3 Possible performance regression - Python-Dev

newer
int() and math.trunc don't accept...

Possible performance regression

Raymond Hettinger

25 Feb 2019 25 Feb '19

4:54 a.m.

I'll been running benchmarks that have been stable for a while. But between today and yesterday, there has been an almost across the board performance regression. It's possible that this is a measurement error or something unique to my system (my Mac installed the 10.14.3 release today), so I'm hoping other folks can run checks as well. Raymond -- Yesterday ------------------------------------------------------------------------ $ ./python.exe Tools/scripts/var_access_benchmark.py Variable and attribute read access: 4.0 ns read_local 4.5 ns read_nonlocal 13.1 ns read_global 17.4 ns read_builtin 17.4 ns read_classvar_from_class 15.8 ns read_classvar_from_instance 24.6 ns read_instancevar 19.7 ns read_instancevar_slots 18.5 ns read_namedtuple 26.3 ns read_boundmethod Variable and attribute write access: 4.6 ns write_local 4.8 ns write_nonlocal 17.5 ns write_global 39.1 ns write_classvar 34.4 ns write_instancevar 25.3 ns write_instancevar_slots Data structure read access: 17.5 ns read_list 18.4 ns read_deque 19.2 ns read_dict Data structure write access: 19.0 ns write_list 22.0 ns write_deque 24.4 ns write_dict Stack (or queue) operations: 55.5 ns list_append_pop 46.3 ns deque_append_pop 46.7 ns deque_append_popleft Timing loop overhead: 0.3 ns loop_overhead -- Today --------------------------------------------------------------------------- $ ./python.exe py Tools/scripts/var_access_benchmark.py Variable and attribute read access: 5.0 ns read_local 5.3 ns read_nonlocal 14.7 ns read_global 18.6 ns read_builtin 19.9 ns read_classvar_from_class 17.7 ns read_classvar_from_instance 26.1 ns read_instancevar 21.0 ns read_instancevar_slots 21.7 ns read_namedtuple 27.8 ns read_boundmethod Variable and attribute write access: 6.1 ns write_local 7.3 ns write_nonlocal 18.9 ns write_global 40.7 ns write_classvar 36.2 ns write_instancevar 26.1 ns write_instancevar_slots Data structure read access: 19.1 ns read_list 19.6 ns read_deque 20.6 ns read_dict Data structure write access: 22.8 ns write_list 23.5 ns write_deque 27.8 ns write_dict Stack (or queue) operations: 54.8 ns list_append_pop 49.5 ns deque_append_pop 49.4 ns deque_append_popleft Timing loop overhead: 0.3 ns loop_overhead

Show replies by date

Eric Snow

25 Feb 25 Feb

5:04 a.m.

I'll take a look tonight. -eric On Sun, Feb 24, 2019, 21:54 Raymond Hettinger wrote:

...

I'll been running benchmarks that have been stable for a while. But between today and yesterday, there has been an almost across the board performance regression.

It's possible that this is a measurement error or something unique to my system (my Mac installed the 10.14.3 release today), so I'm hoping other folks can run checks as well.

Raymond

-- Yesterday ------------------------------------------------------------------------

$ ./python.exe Tools/scripts/var_access_benchmark.py Variable and attribute read access: 4.0 ns read_local 4.5 ns read_nonlocal 13.1 ns read_global 17.4 ns read_builtin 17.4 ns read_classvar_from_class 15.8 ns read_classvar_from_instance 24.6 ns read_instancevar 19.7 ns read_instancevar_slots 18.5 ns read_namedtuple 26.3 ns read_boundmethod

Variable and attribute write access: 4.6 ns write_local 4.8 ns write_nonlocal 17.5 ns write_global 39.1 ns write_classvar 34.4 ns write_instancevar 25.3 ns write_instancevar_slots

Data structure read access: 17.5 ns read_list 18.4 ns read_deque 19.2 ns read_dict

Data structure write access: 19.0 ns write_list 22.0 ns write_deque 24.4 ns write_dict

Stack (or queue) operations: 55.5 ns list_append_pop 46.3 ns deque_append_pop 46.7 ns deque_append_popleft

Timing loop overhead: 0.3 ns loop_overhead

-- Today ---------------------------------------------------------------------------

$ ./python.exe py Tools/scripts/var_access_benchmark.py

Variable and attribute read access: 5.0 ns read_local 5.3 ns read_nonlocal 14.7 ns read_global 18.6 ns read_builtin 19.9 ns read_classvar_from_class 17.7 ns read_classvar_from_instance 26.1 ns read_instancevar 21.0 ns read_instancevar_slots 21.7 ns read_namedtuple 27.8 ns read_boundmethod

Variable and attribute write access: 6.1 ns write_local 7.3 ns write_nonlocal 18.9 ns write_global 40.7 ns write_classvar 36.2 ns write_instancevar 26.1 ns write_instancevar_slots

Data structure read access: 19.1 ns read_list 19.6 ns read_deque 20.6 ns read_dict

Data structure write access: 22.8 ns write_list 23.5 ns write_deque 27.8 ns write_dict

Stack (or queue) operations: 54.8 ns list_append_pop 49.5 ns deque_append_pop 49.4 ns deque_append_popleft

Timing loop overhead: 0.3 ns loop_overhead

_______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/ericsnowcurrently%40gmail...

Eric Snow

6:06 a.m.

On Sun, Feb 24, 2019 at 10:04 PM Eric Snow wrote:

...

I'll take a look tonight.

I made 2 successive runs of the script (on my laptop) for a commit from early Saturday, and 2 runs from a commit this afternoon (close to master). The output is below, with the earlier commit first. That one is a little faster in places and a little slower in others. However, I also saw quite a bit of variability in the results for the same commit. So I'm not sure what to make of it. I'll look into it in more depth tomorrow. FWIW, I have a few commits in the range you described, so I want to make sure I didn't slow things down for us. :) -eric * commit 175421b58cc97a2555e474f479f30a6c5d2250b0 (HEAD) | Author: Pablo Galindo | Date: Sat Feb 23 03:02:06 2019 +0000 | | bpo-36016: Add generation option to gc.getobjects() (GH-11909) $ ./python Tools/scripts/var_access_benchmark.py Variable and attribute read access: 18.1 ns read_local 19.4 ns read_nonlocal 48.3 ns read_global 52.4 ns read_builtin 55.7 ns read_classvar_from_class 56.1 ns read_classvar_from_instance 78.6 ns read_instancevar 67.6 ns read_instancevar_slots 65.9 ns read_namedtuple 106.1 ns read_boundmethod Variable and attribute write access: 25.1 ns write_local 26.9 ns write_nonlocal ^[[A 78.0 ns write_global 154.1 ns write_classvar 132.0 ns write_instancevar 88.2 ns write_instancevar_slots Data structure read access: 69.6 ns read_list 69.0 ns read_deque 68.4 ns read_dict Data structure write access: 73.2 ns write_list 79.0 ns write_deque 103.5 ns write_dict Stack (or queue) operations: 348.3 ns list_append_pop 169.0 ns deque_append_pop 170.8 ns deque_append_popleft Timing loop overhead: 1.3 ns loop_overhead $ ./python Tools/scripts/var_access_benchmark.py Variable and attribute read access: 17.7 ns read_local 19.2 ns read_nonlocal 39.9 ns read_global 50.3 ns read_builtin 54.4 ns read_classvar_from_class 55.8 ns read_classvar_from_instance 80.3 ns read_instancevar 70.7 ns read_instancevar_slots 66.1 ns read_namedtuple 108.9 ns read_boundmethod Variable and attribute write access: 25.1 ns write_local 25.6 ns write_nonlocal 70.0 ns write_global 151.5 ns write_classvar 133.9 ns write_instancevar 90.7 ns write_instancevar_slots Data structure read access: 140.7 ns read_list 89.6 ns read_deque 86.6 ns read_dict Data structure write access: 97.9 ns write_list 100.5 ns write_deque 120.0 ns write_dict Stack (or queue) operations: 375.9 ns list_append_pop 179.3 ns deque_append_pop 179.4 ns deque_append_popleft Timing loop overhead: 1.5 ns loop_overhead * commit 3b0abb019662e42070f1d6f7e74440afb1808f03 (HEAD) | Author: Giampaolo Rodola | Date: Sun Feb 24 15:46:40 2019 -0800 | | bpo-33671: allow setting shutil.copyfile() bufsize globally (GH-12016) $ ./python Tools/scripts/var_access_benchmark.py Variable and attribute read access: 20.2 ns read_local 20.0 ns read_nonlocal 41.9 ns read_global 52.9 ns read_builtin 56.3 ns read_classvar_from_class 56.9 ns read_classvar_from_instance 80.2 ns read_instancevar 70.6 ns read_instancevar_slots 69.5 ns read_namedtuple 114.5 ns read_boundmethod Variable and attribute write access: 23.4 ns write_local 25.0 ns write_nonlocal 74.5 ns write_global 152.0 ns write_classvar 131.7 ns write_instancevar 90.1 ns write_instancevar_slots Data structure read access: 69.9 ns read_list 73.4 ns read_deque 77.8 ns read_dict Data structure write access: 83.3 ns write_list 94.9 ns write_deque 120.6 ns write_dict Stack (or queue) operations: 383.4 ns list_append_pop 187.1 ns deque_append_pop 182.2 ns deque_append_popleft Timing loop overhead: 1.4 ns loop_overhead $ ./python Tools/scripts/var_access_benchmark.py Variable and attribute read access: 19.1 ns read_local 20.9 ns read_nonlocal 43.8 ns read_global 57.8 ns read_builtin 58.4 ns read_classvar_from_class 61.3 ns read_classvar_from_instance 84.7 ns read_instancevar 72.9 ns read_instancevar_slots 69.7 ns read_namedtuple 109.9 ns read_boundmethod Variable and attribute write access: 23.1 ns write_local 23.7 ns write_nonlocal 72.8 ns write_global 149.9 ns write_classvar 133.3 ns write_instancevar 89.4 ns write_instancevar_slots Data structure read access: 69.0 ns read_list 69.6 ns read_deque 69.1 ns read_dict Data structure write access: 74.5 ns write_list 80.9 ns write_deque 105.4 ns write_dict Stack (or queue) operations: 338.2 ns list_append_pop 165.6 ns deque_append_pop 164.7 ns deque_append_popleft Timing loop overhead: 1.3 ns loop_overhead

Raymond Hettinger

9:22 a.m.

...

On Feb 24, 2019, at 10:06 PM, Eric Snow wrote:

I'll look into it in more depth tomorrow. FWIW, I have a few commits in the range you described, so I want to make sure I didn't slow things down for us. :)

Thanks for looking into it. FWIW, I can consistently reproduce the results several times in row. Here's the bash script I'm using: #!/bin/bash make clean ./configure make # Apple LLVM version 10.0.0 (clang-1000.11.45.5) for i in `seq 1 3`; do git checkout d610116a2e48b55788b62e11f2e6956af06b3de0 # Go back to 2/23 make # Rebuild sleep 30 # Let the system get quiet and cool echo '---- baseline ---' >> results.txt # Label output ./python.exe Tools/scripts/var_access_benchmark.py >> results.txt # Run benchmark git checkout 16323cb2c3d315e02637cebebdc5ff46be32ecdf # Go to end-of-day 2/24 make # Rebuild sleep 30 # Let the system get quiet and cool echo '---- end of day ---' >> results.txt # Label output ./python.exe Tools/scripts/var_access_benchmark.py >> results.txt # Run benchmark

...

-eric

* commit 175421b58cc97a2555e474f479f30a6c5d2250b0 (HEAD) | Author: Pablo Galindo | Date: Sat Feb 23 03:02:06 2019 +0000 | | bpo-36016: Add generation option to gc.getobjects() (GH-11909)

$ ./python Tools/scripts/var_access_benchmark.py Variable and attribute read access: 18.1 ns read_local 19.4 ns read_nonlocal

These timings are several times larger than they should be. Perhaps you're running a debug build? Or perhaps 32-bit? Or on VM or some such. Something looks way off because I'm getting 4 and 5 ns on my 2013 Haswell laptop. Raymond

Victor Stinner

9:42 a.m.

Hi, Le lun. 25 févr. 2019 à 05:57, Raymond Hettinger a écrit :

...

I'll been running benchmarks that have been stable for a while. But between today and yesterday, there has been an almost across the board performance regression.

How do you run your benchmarks? If you use Linux, are you using CPU isolation?

...

It's possible that this is a measurement error or something unique to my system (my Mac installed the 10.14.3 release today), so I'm hoping other folks can run checks as well.

Getting reproducible benchmark results on timing smaller than 1 ms is really hard. I wrote some advices to get more stable results: https://perf.readthedocs.io/en/latest/run_benchmark.html#how-to-get-reproduc...

...

Variable and attribute read access: 4.0 ns read_local

In my experience, for timing less than 100 ns, *everything* impacts the benchmark, and the result is useless without the standard deviation. On such microbenchmarks, the hash function hash a significant impact on performance. So you should run your benchmark on multiple different *processes* to get multiple different hash functions. Some people prefer to use PYTHONHASHSEED=0 (or another value), but I dislike using that since it's less representative of performance "on production" (with randomized hash function). For example, using 20 processes to test 20 randomized hash function is enough to compute the average cost of the hash function. More remark was more general, I didn't look at the specific case of var_access_benchmark.py. Maybe benchmarks on C depend on the hash function. For example, 4.0 ns +/- 10 ns or 4.0 ns +/- 0.1 ns is completely different to decide if "5.0 ns" is slower to faster. The "perf compare" command of my perf module "determines whether two samples differ significantly using a Student’s two-sample, two-tailed t-test with alpha equals to 0.95.": https://en.wikipedia.org/wiki/Student's_t-test I don't understand how these things work, I just copied the code from the old Python benchmark suite :-) See also my articles in my journey to stable benchmarks: * https://vstinner.github.io/journey-to-stable-benchmark-system.html # nosy applications / CPU isolation * https://vstinner.github.io/journey-to-stable-benchmark-deadcode.html # PGO * https://vstinner.github.io/journey-to-stable-benchmark-average.html # randomized hash function There are likely other parameters which impact benchmarks, that's why std dev and how the benchmark matter so much. Victor -- Night gathers, and now my watch begins. It shall not end until my death.

Antoine Pitrou

10:54 a.m.

On Sun, 24 Feb 2019 20:54:02 -0800 Raymond Hettinger wrote:

...

I'll been running benchmarks that have been stable for a while. But between today and yesterday, there has been an almost across the board performance regression.

Have you tried bisecting to find out the offending changeset, if there any? Regards Antoine.

Raymond Hettinger

5:31 p.m.

...

On Feb 25, 2019, at 2:54 AM, Antoine Pitrou wrote:

Have you tried bisecting to find out the offending changeset, if there any?

I got it down to two checkins before running out of time: Between git checkout 463572c8beb59fd9d6850440af48a5c5f4c0c0c9 And: git checkout 3b0abb019662e42070f1d6f7e74440afb1808f03 So the subinterpreter patch was likely the trigger. I can reproduce it over and over again on Clang, but not for a GCC-8 build, so it is compiler specific (and possibly macOS specific). Will look at it more after work this evening. I posted here to try to solicit independent confirmation. Raymond

Eric Snow

5:42 p.m.

On Mon, Feb 25, 2019 at 10:32 AM Raymond Hettinger wrote:

...

I got it down to two checkins before running out of time:

Between git checkout 463572c8beb59fd9d6850440af48a5c5f4c0c0c9

And: git checkout 3b0abb019662e42070f1d6f7e74440afb1808f03

So the subinterpreter patch was likely the trigger.

I can reproduce it over and over again on Clang, but not for a GCC-8 build, so it is compiler specific (and possibly macOS specific).

Will look at it more after work this evening. I posted here to try to solicit independent confirmation.

I'll look into it around then too. See https://bugs.python.org/issue33608. -eric

Eric Snow

26 Feb 26 Feb

4:23 a.m.

On Mon, Feb 25, 2019 at 10:42 AM Eric Snow wrote:

...

I'll look into it around then too. See https://bugs.python.org/issue33608.

I ran the "performance" suite (https://github.com/python/performance), which has 57 different benchmarks. In the results, 9 were marked as "significantly" different between the two commits.. 2 of the benchmarks showed a marginal slowdown and 7 showed a marginal speedup: +-------------------------+--------------+-------------+--------------+-----------------------+ | Benchmark | speed.before | speed.after | Change | Significance | +=========================+==============+=============+==============+=======================+ | django_template | 177 ms | 172 ms | 1.03x faster | Significant (t=3.66) | +-------------------------+--------------+-------------+--------------+-----------------------+ | html5lib | 126 ms | 122 ms | 1.03x faster | Significant (t=3.46) | +-------------------------+--------------+-------------+--------------+-----------------------+ | json_dumps | 17.6 ms | 17.2 ms | 1.02x faster | Significant (t=2.65) | +-------------------------+--------------+-------------+--------------+-----------------------+ | nbody | 157 ms | 161 ms | 1.03x slower | Significant (t=-3.85) | +-------------------------+--------------+-------------+--------------+-----------------------+ | pickle_dict | 29.5 us | 30.5 us | 1.03x slower | Significant (t=-6.37) | +-------------------------+--------------+-------------+--------------+-----------------------+ | scimark_monte_carlo | 144 ms | 139 ms | 1.04x faster | Significant (t=3.61) | +-------------------------+--------------+-------------+--------------+-----------------------+ | scimark_sparse_mat_mult | 5.41 ms | 5.25 ms | 1.03x faster | Significant (t=4.26) | +-------------------------+--------------+-------------+--------------+-----------------------+ | sqlite_synth | 3.99 us | 3.91 us | 1.02x faster | Significant (t=2.49) | +-------------------------+--------------+-------------+--------------+-----------------------+ | unpickle_pure_python | 497 us | 481 us | 1.03x faster | Significant (t=5.04) | +-------------------------+--------------+-------------+--------------+-----------------------+ (Issue #33608 has more detail.) So it looks like commit ef4ac967 is not responsible for a performance regression. -eric

Victor Stinner

10:51 a.m.

Hi, Le mar. 26 févr. 2019 à 05:27, Eric Snow a écrit :

...

I ran the "performance" suite (https://github.com/python/performance), which has 57 different benchmarks.

Ah yes, by the way: I also ran manually performance on speed.python.org yesterday: it added a new dot at Feb 25.

...

In the results, 9 were marked as "significantly" different between the two commits.. 2 of the benchmarks showed a marginal slowdown and 7 showed a marginal speedup:

I'm not surprised :-) Noise on micro-benchmark is usually "ignored by the std dev" (delta included in the std dev). At speed.python.org, you can see that basically the performances are stable since last summer. I let you have a look at https://speed.python.org/timeline/

...

| Benchmark | speed.before | speed.after | Change | Significance | +=========================+==============+=============+==============+=======================+ | django_template | 177 ms | 172 ms | 1.03x faster | Significant (t=3.66) | +-------------------------+--------------+-------------+--------------+-----------------------+ | html5lib | 126 ms | 122 ms | 1.03x faster | Significant (t=3.46) | +-------------------------+--------------+-------------+--------------+-----------------------+ | json_dumps | 17.6 ms | 17.2 ms | 1.02x faster | Significant (t=2.65) | +-------------------------+--------------+-------------+--------------+-----------------------+ | nbody | 157 ms | 161 ms | 1.03x slower | Significant (t=-3.85) | (...)

Usually, I just ignore changes which are smaller than 5% ;-) Victor -- Night gathers, and now my watch begins. It shall not end until my death.

Neil Schemenauer

7:35 p.m.

On 2019-02-25, Eric Snow wrote:

...

So it looks like commit ef4ac967 is not responsible for a performance regression.

I did a bit of exploration myself and that was my conclusion as well. Perhaps others would be interested in how to use "perf" so I did a little write up: https://discuss.python.org/t/profiling-cpython-with-perf/940 To me, it looks like using a register based VM could produce a pretty decent speedup. Research project for someone. ;-) Regards, Neil

Victor Stinner

8:36 p.m.

I made an attempt once and it was faster: https://faster-cpython.readthedocs.io/registervm.html But I had bugs and I didn't know how to implement correctly a compiler. Victor Le mardi 26 février 2019, Neil Schemenauer a écrit :

...

On 2019-02-25, Eric Snow wrote:

...
So it looks like commit ef4ac967 is not responsible for a performance regression.

I did a bit of exploration myself and that was my conclusion as well. Perhaps others would be interested in how to use "perf" so I did a little write up:

https://discuss.python.org/t/profiling-cpython-with-perf/940

To me, it looks like using a register based VM could produce a pretty decent speedup. Research project for someone. ;-)

Regards,

Neil _______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/vstinner%40redhat.com

-- Night gathers, and now my watch begins. It shall not end until my death.

Neil Schemenauer

8:58 p.m.

New subject: Register-based VM [Was: Possible performance regression]

On 2019-02-26, Victor Stinner wrote:

...

I made an attempt once and it was faster: https://faster-cpython.readthedocs.io/registervm.html

Interesting. I don't think I have seen that before. Were you aware of "Rattlesnake" before you started on that? It seems your approach is similar. Probably not because I don't think it is easy to find. I uploaded a tarfile I had on my PC to my web site: http://python.ca/nas/python/rattlesnake20010813/ It seems his name doesn't appear in the readme or source but I think Rattlesnake was Skip Montanaro's project. I suppose my idea of unifying the local variables and the registers could have came from Rattlesnake. Very little new in the world. ;-P Cheers, Neil

Guido van Rossum

9:42 p.m.

New subject: Register-based VM [Was: Possible performance regression]

Yes, this should totally be attempted. All the stack manipulation opcodes could be dropped if we just made (nearly) everything use 3-address codes, e.g. ADD would take the names of three registers, left, right and result. The compiler would keep track of which registers contain a live object (for reference counting) but that can't be much more complicated than checking for stack under- and over-flow. Also, nothing new indeed -- my first computer (a control data cyber mainframe) had 3-address code. https://en.wikipedia.org/wiki/CDC_6600#Central_Processor_(CP) On Tue, Feb 26, 2019 at 1:01 PM Neil Schemenauer wrote:

...

On 2019-02-26, Victor Stinner wrote:

...
I made an attempt once and it was faster: https://faster-cpython.readthedocs.io/registervm.html

Interesting. I don't think I have seen that before. Were you aware of "Rattlesnake" before you started on that? It seems your approach is similar. Probably not because I don't think it is easy to find. I uploaded a tarfile I had on my PC to my web site:

http://python.ca/nas/python/rattlesnake20010813/

It seems his name doesn't appear in the readme or source but I think Rattlesnake was Skip Montanaro's project. I suppose my idea of unifying the local variables and the registers could have came from Rattlesnake. Very little new in the world. ;-P

Cheers,

Neil _______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/guido%40python.org

-- --Guido van Rossum (python.org/~guido)

Victor Stinner

9:42 p.m.

New subject: Register-based VM [Was: Possible performance regression]

No, I wasn't aware of this project. My starting point was: http://static.usenix.org/events/vee05/full_papers/p153-yunhe.pdf Yunhe Shi, David Gregg, Andrew Beatty, M. Anton Ertl, 2005 See also my email to python-dev that I sent in 2012: https://mail.python.org/pipermail/python-dev/2012-November/122777.html Ah, my main issue was my implementation is that I started without taking care of clearing registers when the stack-based bytecode implicitly cleared a reference (decref), like "POP_TOP" operation. I added "CLEAR_REG" late in the development and it caused me troubles, and the "correct" register-based bytecode was less efficient than bytecode without CLEAR_REG. But my optimizer was very limited, too limited. Another implementation issue that I had was to understand some "implicit usage" of the stack like try/except which do black magic, whereas I wanted to make everything explicit for registers. I'm talking about things like "POP_BLOCK" and "SETUP_EXCEPT". In my implementation, I kept support for stack-based bytecode, and so I had some inefficient code and some corner cases. My approach was to convert stack-based bytecode to register-based bytecode on the fly. Having both in the same code allowed to me run some benchmarks. Maybe it wasn't the best approach, but I didn't feel able to write a real compiler (AST => bytecode). Victor Le mar. 26 févr. 2019 à 21:58, Neil Schemenauer a écrit :

...

On 2019-02-26, Victor Stinner wrote:

...
I made an attempt once and it was faster: https://faster-cpython.readthedocs.io/registervm.html

Interesting. I don't think I have seen that before. Were you aware of "Rattlesnake" before you started on that? It seems your approach is similar. Probably not because I don't think it is easy to find. I uploaded a tarfile I had on my PC to my web site:

http://python.ca/nas/python/rattlesnake20010813/

It seems his name doesn't appear in the readme or source but I think Rattlesnake was Skip Montanaro's project. I suppose my idea of unifying the local variables and the registers could have came from Rattlesnake. Very little new in the world. ;-P

Cheers,

Neil

-- Night gathers, and now my watch begins. It shall not end until my death.

Jeroen Demeyer

9:53 p.m.

New subject: Register-based VM [Was: Possible performance regression]

Let me just say that the code for METH_FASTCALL function/method calls is optimized for a stack layout: a piece of the stack is used directly for calling METH_FASTCALL functions (without any copying any PyObject* pointers). So this would probably be slower with a register-based VM (which doesn't imply that it's a bad idea, it's just a single point to take into account).

Greg Ewing

10:18 p.m.

New subject: Register-based VM [Was: Possible performance regression]

Jeroen Demeyer wrote:

...

Let me just say that the code for METH_FASTCALL function/method calls is optimized for a stack layout: a piece of the stack is used directly for calling METH_FASTCALL functions

We might be able to get some ideas for dealing with this kind of thing from register-window architectures such as the SPARC, where the registers containing the locals of a calling function become the input parameters to a called function. More generally, it's common to have a calling convention where the first N parameters are assumed to reside in a specific range of registers. If the compiler is smart enough, it can often arrange the evaluation of the parameter expressions so that the results end up in the right registers for making the call. -- Greg

Victor Stinner

9:53 p.m.

New subject: Register-based VM [Was: Possible performance regression]

Hum, I read again my old REGISTERVM.txt that I wrote a few years ago. A little bit more context. In my "registervm" fork I also tried to implement further optimizations like moving invariants out of the loop. Some optimizations could change the Python semantics, like remove "duplicated" LOAD_GLOBAL whereas the global might be modified in the middle. I wanted to experiment such optimizations. Maybe it was a bad idea to convert stack-based bytecode to register-based bytecode and experiment these optimizations at the same time. Victor Le mar. 26 févr. 2019 à 22:42, Victor Stinner a écrit :

...

No, I wasn't aware of this project. My starting point was:

http://static.usenix.org/events/vee05/full_papers/p153-yunhe.pdf Yunhe Shi, David Gregg, Andrew Beatty, M. Anton Ertl, 2005

See also my email to python-dev that I sent in 2012: https://mail.python.org/pipermail/python-dev/2012-November/122777.html

Ah, my main issue was my implementation is that I started without taking care of clearing registers when the stack-based bytecode implicitly cleared a reference (decref), like "POP_TOP" operation.

I added "CLEAR_REG" late in the development and it caused me troubles, and the "correct" register-based bytecode was less efficient than bytecode without CLEAR_REG. But my optimizer was very limited, too limited.

Another implementation issue that I had was to understand some "implicit usage" of the stack like try/except which do black magic, whereas I wanted to make everything explicit for registers. I'm talking about things like "POP_BLOCK" and "SETUP_EXCEPT". In my implementation, I kept support for stack-based bytecode, and so I had some inefficient code and some corner cases.

My approach was to convert stack-based bytecode to register-based bytecode on the fly. Having both in the same code allowed to me run some benchmarks. Maybe it wasn't the best approach, but I didn't feel able to write a real compiler (AST => bytecode).

Victor

Le mar. 26 févr. 2019 à 21:58, Neil Schemenauer a écrit :

...
On 2019-02-26, Victor Stinner wrote:

...
I made an attempt once and it was faster: https://faster-cpython.readthedocs.io/registervm.html

Interesting. I don't think I have seen that before. Were you aware of "Rattlesnake" before you started on that? It seems your approach is similar. Probably not because I don't think it is easy to find. I uploaded a tarfile I had on my PC to my web site:

http://python.ca/nas/python/rattlesnake20010813/

It seems his name doesn't appear in the readme or source but I think Rattlesnake was Skip Montanaro's project. I suppose my idea of unifying the local variables and the registers could have came from Rattlesnake. Very little new in the world. ;-P

Cheers,

Neil

-- Night gathers, and now my watch begins. It shall not end until my death.

-- Night gathers, and now my watch begins. It shall not end until my death.

Joe Jevnik

10:10 p.m.

New subject: Register-based VM [Was: Possible performance regression]

METH_FASTCALL passing arguments on the stack doesn't necessarily mean it will be slow. In x86 there are calling conventions that read all the arguments from the stack, but the rest of the machine is register based. Python could also look at ABI calling conventions for inspiration, like x86-64 where some arguments up to a fixed amount are passed on the stack and the rest are passed on the stack. One thing that I am wondering is would Python want to use a global set of registers and a global data stack, or continue to have a new data stack (and now registers) per call stack. If Python switched to a global stack and global registers we may be able to eliminate a lot of instructions that just shuffle data from the caller's stack to the callee's stack. On Tue, Feb 26, 2019 at 4:55 PM Victor Stinner wrote:

...

Hum, I read again my old REGISTERVM.txt that I wrote a few years ago.

A little bit more context. In my "registervm" fork I also tried to implement further optimizations like moving invariants out of the loop. Some optimizations could change the Python semantics, like remove "duplicated" LOAD_GLOBAL whereas the global might be modified in the middle. I wanted to experiment such optimizations. Maybe it was a bad idea to convert stack-based bytecode to register-based bytecode and experiment these optimizations at the same time.

Victor

Le mar. 26 févr. 2019 à 22:42, Victor Stinner a écrit :

...
No, I wasn't aware of this project. My starting point was:

http://static.usenix.org/events/vee05/full_papers/p153-yunhe.pdf Yunhe Shi, David Gregg, Andrew Beatty, M. Anton Ertl, 2005

See also my email to python-dev that I sent in 2012: https://mail.python.org/pipermail/python-dev/2012-November/122777.html

Ah, my main issue was my implementation is that I started without taking care of clearing registers when the stack-based bytecode implicitly cleared a reference (decref), like "POP_TOP" operation.

I added "CLEAR_REG" late in the development and it caused me troubles, and the "correct" register-based bytecode was less efficient than bytecode without CLEAR_REG. But my optimizer was very limited, too limited.

Another implementation issue that I had was to understand some "implicit usage" of the stack like try/except which do black magic, whereas I wanted to make everything explicit for registers. I'm talking about things like "POP_BLOCK" and "SETUP_EXCEPT". In my implementation, I kept support for stack-based bytecode, and so I had some inefficient code and some corner cases.

My approach was to convert stack-based bytecode to register-based bytecode on the fly. Having both in the same code allowed to me run some benchmarks. Maybe it wasn't the best approach, but I didn't feel able to write a real compiler (AST => bytecode).

Victor

Le mar. 26 févr. 2019 à 21:58, Neil Schemenauer

a écrit :

...
...
On 2019-02-26, Victor Stinner wrote:

...
I made an attempt once and it was faster: https://faster-cpython.readthedocs.io/registervm.html

Interesting. I don't think I have seen that before. Were you aware of "Rattlesnake" before you started on that? It seems your approach is similar. Probably not because I don't think it is easy to find. I uploaded a tarfile I had on my PC to my web site:

http://python.ca/nas/python/rattlesnake20010813/

It seems his name doesn't appear in the readme or source but I think Rattlesnake was Skip Montanaro's project. I suppose my idea of unifying the local variables and the registers could have came from Rattlesnake. Very little new in the world. ;-P

Cheers,

Neil

-- Night gathers, and now my watch begins. It shall not end until my death.

-- Night gathers, and now my watch begins. It shall not end until my death. _______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/joe%40quantopian.com

Greg Ewing

10:31 p.m.

New subject: Register-based VM [Was: Possible performance regression]

Joe Jevnik via Python-Dev wrote:

...

If Python switched to a global stack and global registers we may be able to eliminate a lot of instructions that just shuffle data from the caller's stack to the callee's stack.

That would make implementing generators more complicated. -- Greg

Neil Schemenauer

11:53 p.m.

New subject: Register-based VM [Was: Possible performance regression]

On 2019-02-27, Greg Ewing wrote:

...

Joe Jevnik via Python-Dev wrote:

...
If Python switched to a global stack and global registers we may be able to eliminate a lot of instructions that just shuffle data from the caller's stack to the callee's stack.

That would make implementing generators more complicated.

Right. I wonder though, could we avoid allocating the Python frame object until we actually need it? Two situations when you need a heap allocated frame come to mind immediately: generators that are suspended and frames as part of a traceback. I guess sys._getframe() is another. Any more? I'm thinking that perhaps for regular Python functions and regular calls, you could defer creating the full PyFrame object and put the locals, stack, etc on the C stack. That would make calling Python functions a lot similar to the machine calling convention and presumably could be much faster. If you do need the frame object, copy over the data from the C stack into the frame structure. I'm sure there are all kinds of reasons why this idea is not easy to implement or not possible. It seems somewhat possible though. I wonder how IronPython works in this respect? Apparently it doesn't support sys._getframe(). Regards, Neil

Dino Viehland

27 Feb 27 Feb

12:56 a.m.

New subject: Register-based VM [Was: Possible performance regression]

On Tue, Feb 26, 2019 at 3:56 PM Neil Schemenauer wrote:

...

Right. I wonder though, could we avoid allocating the Python frame object until we actually need it? Two situations when you need a heap allocated frame come to mind immediately: generators that are suspended and frames as part of a traceback. I guess sys._getframe() is another. Any more?

I've been thinking about that as well... I think in some ways the easy part of this is actually the easy part of this is the reification of the frame it's self. You can have a PyFrameObject which is just declared on the stack and add a new field to it which captures the address of the PyFrameObject* f (e.g. PyFrameObject **f_stackaddr). When you need to move to aheap allocated one you copy everything over as you say and update *f_stackaddr to point at the new heap address. It seems a little bit annoying with the various levels of indirection from the frame getting created in PyEval_EvalCodeEx and flowing down into _PyEval_EvalFrameDefault - so there may need to be some breakage there for certain low-level tools. I'm also a little bit worried about things which go looking at PyThreadState and might make nasty assumptions about the frames already being heap allocated. FYI IronPython does support sys._getframe(), you just need to run it with a special flag (and there are various levels - e.g. -X:Frames and -X:FullFrames, the latter which guarantees your locals are in the frame too). IronPython is more challenged here in that it always generates "safe" code from a CLR perspective and tracking the address of stack-allocated frame objects is therefore challenging (although maybe more possible now then before with various C# ref improvements). I'm not sure exactly how much this approach would get though... It seems like the frame caches are pretty effective, and a lot of the cost of them is initializing them / decref'ing the things which are still alive in them. But it doesn't seem a like a super complicated change to try out... It's actually something I'd at least like to try prototyping at some point.

Victor Stinner

26 Feb 26 Feb

10:08 p.m.

New subject: Register-based VM [Was: Possible performance regression]

Le mar. 26 févr. 2019 à 21:58, Neil Schemenauer a écrit :

...

It seems his name doesn't appear in the readme or source but I think Rattlesnake was Skip Montanaro's project. I suppose my idea of unifying the local variables and the registers could have came from Rattlesnake. Very little new in the world. ;-P

In my implementation, constants, local variables and registers live all in the same array: frame.f_localsplus. Technically, there isn't much difference between a constant, local variable or a register. It's just the disassembler which has to worry to display "R3" or "x" depending on the register index ;-) There was a LOAD_CONST_REG instruction in my implementation, but it was more to keep a smooth transition from existing LOAD_CONST instruction. LOAD_CONST_REG could be avoided to pass directly the constant (ex: as a function argument). For example, I compiled "range(2, n)" as: LOAD_CONST_REG R0, 2 (const#2) LOAD_GLOBAL_REG R1, 'range' (name#0) CALL_FUNCTION_REG 4, R1, R1, R0, 'n' Whereas it could be just: LOAD_GLOBAL_REG R1, 'range' (name#0) CALL_FUNCTION_REG 4, R1, R1, , 'n' Compare it to stack-based bytecode: LOAD_GLOBAL 0 (range) LOAD_CONST 2 (const#2) LOAD_FAST 'n' CALL_FUNCTION 2 (2 positional, 0 keyword pair) Victor -- Night gathers, and now my watch begins. It shall not end until my death.

Greg Ewing

10:32 p.m.

New subject: Register-based VM [Was: Possible performance regression]

Victor Stinner wrote:

...

LOAD_CONST_REG R0, 2 (const#2) LOAD_GLOBAL_REG R1, 'range' (name#0) CALL_FUNCTION_REG 4, R1, R1, R0, 'n'

Out of curiosity, why is the function being passed twice here? -- Greg

Victor Stinner

11:05 p.m.

New subject: Register-based VM [Was: Possible performance regression]

Le mar. 26 févr. 2019 à 23:40, Greg Ewing a écrit :

...

Victor Stinner wrote:

...
LOAD_CONST_REG R0, 2 (const#2) LOAD_GLOBAL_REG R1, 'range' (name#0) CALL_FUNCTION_REG 4, R1, R1, R0, 'n'

Out of curiosity, why is the function being passed twice here?

Ah, I should have explained that :-) The first argument of CALL_FUNCTION_REG is the name of the register used to store the result. The compiler begins with using static single assignment form (SSA) but then uses a register allocator to reduce the number of used registers. Usually, at the end you have less than 5 registers for a whole function. Since R1 was only used to store the function before the call and isn't used after, the R1 register can be re-used. Using a different register may require an explicit "CLEAR_REG R1" (decref the reference to the builtin range function) which is less efficient. Note: The CALL_FUNCTION instruction using the stack implicitly put the result into the stack (and "pop" function arguments from the stack). Victor -- Night gathers, and now my watch begins. It shall not end until my death.

Greg Ewing

27 Feb 27 Feb

5:09 a.m.

New subject: Register-based VM [Was: Possible performance regression]

Victor Stinner wrote:

...

Using a different register may require an explicit "CLEAR_REG R1" (decref the reference to the builtin range function) which is less efficient.

Maybe the source operand fields of the bytecodes could have a flag indicating whether to clear the register after use. -- Greg

Neil Schemenauer

11 Mar 11 Mar

9:56 p.m.

New subject: Register-based VM [Was: Possible performance regression]

On 2019-02-27, Victor Stinner wrote:

...

The compiler begins with using static single assignment form (SSA) but then uses a register allocator to reduce the number of used registers. Usually, at the end you have less than 5 registers for a whole function.

In case anyone is interested on working on this, I dug up some discussion from years ago. Advice from Tim Peters: [Python-Dev] Rattlesnake progress https://mail.python.org/pipermail/python-dev/2002-February/020172.html https://mail.python.org/pipermail/python-dev/2002-February/020182.html https://mail.python.org/pipermail/python-dev/2002-February/020146.html Doing a prototype register-based compiler in Python seems like a good idea. Using the 'compiler' package would give you a good start. I think this is the most recent version of that package: https://github.com/pfalcon/python-compiler Based on a little poking around, I think it has not been updated for the 16-bit word code. Shouldn't be too hard to make it work though. I was thinking about the code format on the weekend. Using three-register opcodes seems a good idea. We could could retain the 16-bit word code format. For opcodes that use three registers, use a second word for the last two registers. I.e. <8 bit opcode><8 bit register #> <8 bit register #><8 bit register #> Limit the number of registers to 256. If you run out, just push and pop from stack. You want to keep the instruction decode path in the evaluation loop simple and not confuse the CPU branch predictor. Regards, Neil

David Mertz

10:03 p.m.

New subject: Register-based VM [Was: Possible performance regression]

Parrot got rather further along than rattlesnake as a register based VM. I don't think it every really beat CPython in speed though. http://parrot.org/ On Mon, Mar 11, 2019, 5:57 PM Neil Schemenauer wrote:

...

On 2019-02-27, Victor Stinner wrote:

...
The compiler begins with using static single assignment form (SSA) but then uses a register allocator to reduce the number of used registers. Usually, at the end you have less than 5 registers for a whole function.

In case anyone is interested on working on this, I dug up some discussion from years ago. Advice from Tim Peters:

[Python-Dev] Rattlesnake progress https://mail.python.org/pipermail/python-dev/2002-February/020172.html https://mail.python.org/pipermail/python-dev/2002-February/020182.html https://mail.python.org/pipermail/python-dev/2002-February/020146.html

Doing a prototype register-based compiler in Python seems like a good idea. Using the 'compiler' package would give you a good start. I think this is the most recent version of that package:

https://github.com/pfalcon/python-compiler

Based on a little poking around, I think it has not been updated for the 16-bit word code. Shouldn't be too hard to make it work though.

I was thinking about the code format on the weekend. Using three-register opcodes seems a good idea. We could could retain the 16-bit word code format. For opcodes that use three registers, use a second word for the last two registers. I.e.

<8 bit opcode><8 bit register #> <8 bit register #><8 bit register #>

Limit the number of registers to 256. If you run out, just push and pop from stack. You want to keep the instruction decode path in the evaluation loop simple and not confuse the CPU branch predictor.

Regards,

Neil _______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/mertz%40gnosis.cx

Antoine Pitrou

15 Mar 15 Mar

10:50 a.m.

New subject: Register-based VM [Was: Possible performance regression]

On Mon, 11 Mar 2019 18:03:26 -0400 David Mertz wrote:

...

Parrot got rather further along than rattlesnake as a register based VM. I don't think it every really beat CPython in speed though.

http://parrot.org/

But Parrot also had a "generic" design that was supposed to cater for all dynamic programming languages. AFAIU, it was heavily over-engineered and under-staffed, and the design documents were difficult to understand. Regards Antoine.

Skip Montanaro

26 Feb 26 Feb

11:01 p.m.

New subject: Register-based VM [Was: Possible performance regression]

...

I uploaded a tarfile I had on my PC to my web site:

http://python.ca/nas/python/rattlesnake20010813/

It seems his name doesn't appear in the readme or source but I think Rattlesnake was Skip Montanaro's project. I suppose my idea of unifying the local variables and the registers could have came from Rattlesnake. Very little new in the world. ;-P

Lot of water under the bridge since then. I would have to poke around a bit, but I think "from module import *" stumped me long enough that I got distracted by some other shiny thing. S

Raymond Hettinger

9:38 p.m.

On Feb 25, 2019, at 8:23 PM, Eric Snow wrote:

...

So it looks like commit ef4ac967 is not responsible for a performance regression.

I did narrow it down to that commit and I can consistently reproduce the timing differences. That said, I'm only observing the effect when building with the Mac default Clang (Apple LLVM version 10.0.0 (clang-1000.11.45.5). When building GCC 8.3.0, there is no change in performance. I conclude this is only an issue for Mac builds.

...

I ran the "performance" suite (https://github.com/python/performance), which has 57 different benchmarks.

Many of those benchmarks don't measure eval-loop performance. Instead, they exercise json, pickle, sqlite etc. So, I would expect no change in many of those because they weren't touched. Victor said he generally doesn't care about 5% regressions. That makes sense for odd corners of Python. The reason I was concerned about this one is that it hits the eval-loop and seems to effect every single op code. The regression applies somewhat broadly (increasing the cost of reading and writing local variables by about 20%). The effect is somewhat broad based. That said, it seems to be compiler specific and only affects the Mac builds, so maybe we can decide that we don't care. Raymond

Victor Stinner

9:59 p.m.

Le mar. 26 févr. 2019 à 22:45, Raymond Hettinger a écrit :

...

Victor said he generally doesn't care about 5% regressions. That makes sense for odd corners of Python. The reason I was concerned about this one is that it hits the eval-loop and seems to effect every single op code. The regression applies somewhat broadly (increasing the cost of reading and writing local variables by about 20%). The effect is somewhat broad based.

I ignore changes smaller than 5% because they are usually what I call the "noise" of the benchmark. It means that testing 3 commits give 3 different timings, even if the commits don't touch anything used in the benchmark. There are multiple explanation: PGO compilation in not deterministic, some benchmarks are too close to the performance of the CPU L1-instruction cache and so are heavily impacted by the "code locality" (exact address in memory), and many other things. Hum, sometimes running the same benchmark on the same code on the same hardware with the same strict procedure gives different timings at each attempt. At some point, I decided to give up on these 5% to not loose my mind :-) Victor

Neil Schemenauer

10:28 p.m.

On 2019-02-26, Raymond Hettinger wrote:

...

That said, I'm only observing the effect when building with the Mac default Clang (Apple LLVM version 10.0.0 (clang-1000.11.45.5). When building GCC 8.3.0, there is no change in performance.

My guess is that the code in _PyEval_EvalFrameDefault() got changed enough that Clang started emitting a bit different machine code. If the conditional jumps are a bit different, I understand that could have a significant difference on performance. Are you compiling with --enable-optimizations (i.e. PGO)? In my experience, that is needed to get meaningful results. Victor also mentions that on his "how-to-get-stable-benchmarks" page. Building with PGO is really (really) slow so I supect you are not doing it when bisecting. You can speed it up greatly by using a simpler command for PROFILE_TASK in Makefile.pre.in. E.g. PROFILE_TASK=$(srcdir)/my_benchmark.py Now that you have narrowed it down to a single commit, it would be worth doing the comparison with PGO builds (assuming Clang supports that).

...

That said, it seems to be compiler specific and only affects the Mac builds, so maybe we can decide that we don't care.

I think the key question is if the ceval loop got a bit slower due to logic changes or if Clang just happened to generate a bit worse code due to source code details. A PGO build could help answer that. I suppose trying to compare machine code is going to produce too large of a diff. Could you try hoisting the eval_breaker expression, as suggested by Antoine: https://discuss.python.org/t/profiling-cpython-with-perf/940/2 If you think a slowdown affects most opcodes, I think the DISPATCH change looks like the only cause. Maybe I missed something though. Also, maybe there would be some value in marking key branches as likely/unlikely if it helps Clang generate better machine code. Then, even if you compile without PGO (as many people do), you still get the better machine code. Regards, Neil

Victor Stinner

11:17 p.m.

Hi, PGO compilation is very slow. I tried very hard to avoid it. I started to annotate the C code with various GCC attributes like "inline", "always_inline", "hot", etc.. I also experimented likely/unlikely Linux macros which use __builtin_expect(). At the end... my efforts were worthless. I still had *major* issue (benchmark *suddenly* 68% slower! WTF?) with code locality and I decided to give up. You can still find some macros like _Py_HOT_FUNCTION and _Py_NO_INLINE in Python ;-) (_Py_NO_INLINE is used to reduce stack memory usage, that's a different story.) My sad story with code placement: https://vstinner.github.io/analysis-python-performance-issue.html tl; dr Use PGO. -- Since that time, I removed call_method from pyperformance to fix the root issue: don't waste your time on micro-benchmarks ;-) ... But I kept these micro-benchmarks in a different project: https://github.com/vstinner/pymicrobench For some specific needs (take a decision on a specific optimizaton), sometimes micro-benchmarks are still useful ;-) Victor Le mar. 26 févr. 2019 à 23:31, Neil Schemenauer a écrit :

...

On 2019-02-26, Raymond Hettinger wrote:

...
That said, I'm only observing the effect when building with the Mac default Clang (Apple LLVM version 10.0.0 (clang-1000.11.45.5). When building GCC 8.3.0, there is no change in performance.

My guess is that the code in _PyEval_EvalFrameDefault() got changed enough that Clang started emitting a bit different machine code. If the conditional jumps are a bit different, I understand that could have a significant difference on performance.

Are you compiling with --enable-optimizations (i.e. PGO)? In my experience, that is needed to get meaningful results. Victor also mentions that on his "how-to-get-stable-benchmarks" page. Building with PGO is really (really) slow so I supect you are not doing it when bisecting. You can speed it up greatly by using a simpler command for PROFILE_TASK in Makefile.pre.in. E.g.

PROFILE_TASK=$(srcdir)/my_benchmark.py

Now that you have narrowed it down to a single commit, it would be worth doing the comparison with PGO builds (assuming Clang supports that).

...
That said, it seems to be compiler specific and only affects the Mac builds, so maybe we can decide that we don't care.

I think the key question is if the ceval loop got a bit slower due to logic changes or if Clang just happened to generate a bit worse code due to source code details. A PGO build could help answer that. I suppose trying to compare machine code is going to produce too large of a diff.

Could you try hoisting the eval_breaker expression, as suggested by Antoine:

https://discuss.python.org/t/profiling-cpython-with-perf/940/2

If you think a slowdown affects most opcodes, I think the DISPATCH change looks like the only cause. Maybe I missed something though.

Also, maybe there would be some value in marking key branches as likely/unlikely if it helps Clang generate better machine code. Then, even if you compile without PGO (as many people do), you still get the better machine code.

Regards,

Neil _______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/vstinner%40redhat.com

-- Night gathers, and now my watch begins. It shall not end until my death.

Victor Stinner

11:26 p.m.

Le mer. 27 févr. 2019 à 00:17, Victor Stinner a écrit :

...

My sad story with code placement: https://vstinner.github.io/analysis-python-performance-issue.html

tl; dr Use PGO.

Hum wait, this article isn't complete. You have to see the follow-up: https://bugs.python.org/issue28618#msg286662 """ Victor: "FYI I wrote an article about this issue: https://haypo.github.io/analysis-python-performance-issue.html Sadly, it seems like I was just lucky when adding __attribute__((hot)) fixed the issue, because call_method is slow again!" I upgraded speed-python server (running benchmarks) to Ubuntu 16.04 LTS to support PGO compilation. I removed all old benchmark results and ran again benchmarks with LTO+PGO. It seems like benchmark results are much better now. I'm not sure anymore that _Py_HOT_FUNCTION is really useful to get stable benchmarks, but it may help code placement a little bit. I don't think that it hurts, so I suggest to keep it. Since benchmarks were still unstable with _Py_HOT_FUNCTION, I'm not interested to continue to tag more functions with _Py_HOT_FUNCTION. I will now focus on LTO+PGO for stable benchmarks, and ignore small performance difference when PGO is not used. I close this issue now. """ Now I recall that I tried hard to avoid PGO: the server used by speed.python.org to run benchmarks didn't support PGO. I fixed the issue by upgrading Ubuntu :-) Now speed.python.org uses PGO. I stopped to stop to manually help the compiler with code placement. Victor

Raymond Hettinger

27 Feb 27 Feb

2:32 a.m.

On Feb 26, 2019, at 2:28 PM, Neil Schemenauer wrote:

...

Are you compiling with --enable-optimizations (i.e. PGO)? In my experience, that is needed to get meaningful results.

I'm not and I would worry that PGO would give less stable comparisons because it is highly sensitive to changes its training set as well as the actual CPython implementation (two moving targets instead of one). That said, it doesn't really matter to the world how I build *my* Python. We're trying to keep performant the ones that people actually use. For the Mac, I think there are only four that matter: 1) The one we distribute on the python.org website at https://www.python.org/ftp/python/3.8.0/python-3.8.0a2-macosx10.9.pkg 2) The one installed by homebrew 3) The way folks typically roll their own: $ ./configure && make (or some variant of make install) 4) The one shipped by Apple and put in /usr/bin Of the four, the ones I've been timing are #1 and #3. I'm happy to drop this. I was looking for independent confirmation and didn't get it. We can't move forward unless some else also observes a consistently measurable regression for a benchmark they care about on a build that they care about. If I'm the only who notices then it really doesn't matter. Also, it was reassuring to not see the same effect on a GCC-8 build. Since the effect seems to be compiler specific, it may be that we knocked it out of a local minimum and that performance will return the next time someone touches the eval-loop. Raymond

Stephen J. Turnbull

7:51 a.m.

Raymond Hettinger writes:

...

We're trying to keep performant the ones that people actually use. For the Mac, I think there are only four that matter:

1) The one we distribute on the python.org website at https://www.python.org/ftp/python/3.8.0/python-3.8.0a2-macosx10.9.pkg

2) The one installed by homebrew

3) The way folks typically roll their own: $ ./configure && make (or some variant of make install)

4) The one shipped by Apple and put in /usr/bin

I don't see the relevance of (4) since we're talking about the bleeding edge AFAICT. Not clear about Homebrew -- since I've been experimenting with it recently I use the bottled versions, which aren't bleeding edge. If prebuilt packages matter, I would add MacPorts (or substitute it for (4) since nothing seems to get Apple's attention) and Anaconda (which is what I recommend to my students). But I haven't looked at MacPorts' recent download stats, and maybe I'm just the odd one out. Steve -- Associate Professor Division of Policy and Planning Science http://turnbull.sk.tsukuba.ac.jp/ Faculty of Systems and Information Email: turnbull@sk.tsukuba.ac.jp University of Tsukuba Tel: 029-853-5175 Tennodai 1-1-1, Tsukuba 305-8573 JAPAN

francismb

25 Feb 25 Feb

5:57 p.m.

New subject: OT?: Re: Possible performance regression

...

I'll been running benchmarks that have been stable for a while. But between today and yesterday, there has been an almost across the board performance regression.

It's possible that this is a measurement error or something unique to my system (my Mac installed the 10.14.3 release today), so I'm hoping other folks can run checks as well. aren't the build boots caching/measuring those regressions? or what are

Hi, just curious on this, On 2/25/19 5:54 AM, Raymond Hettinger wrote: the current impediments here ? Thanks in advance! --francis

1867

Age (days ago)

1885

Last active (days ago)

List overview

Download

37 comments

15 participants

participants (15)

Antoine Pitrou
David Mertz
Dino Viehland
Eric Snow
francismb
Greg Ewing
Guido van Rossum
Jeroen Demeyer
Joe Jevnik
Neil Schemenauer
Neil Schemenauer
Raymond Hettinger
Skip Montanaro
Stephen J. Turnbull
Victor Stinner

Possible performance regression

tags

participants (15)