
Hi, PGO compilation is very slow. I tried very hard to avoid it. I started to annotate the C code with various GCC attributes like "inline", "always_inline", "hot", etc.. I also experimented likely/unlikely Linux macros which use __builtin_expect(). At the end... my efforts were worthless. I still had *major* issue (benchmark *suddenly* 68% slower! WTF?) with code locality and I decided to give up. You can still find some macros like _Py_HOT_FUNCTION and _Py_NO_INLINE in Python ;-) (_Py_NO_INLINE is used to reduce stack memory usage, that's a different story.) My sad story with code placement: https://vstinner.github.io/analysis-python-performance-issue.html tl; dr Use PGO. -- Since that time, I removed call_method from pyperformance to fix the root issue: don't waste your time on micro-benchmarks ;-) ... But I kept these micro-benchmarks in a different project: https://github.com/vstinner/pymicrobench For some specific needs (take a decision on a specific optimizaton), sometimes micro-benchmarks are still useful ;-) Victor Le mar. 26 févr. 2019 à 23:31, Neil Schemenauer <nas-python@python.ca> a écrit :
On 2019-02-26, Raymond Hettinger wrote:
That said, I'm only observing the effect when building with the Mac default Clang (Apple LLVM version 10.0.0 (clang-1000.11.45.5). When building GCC 8.3.0, there is no change in performance.
My guess is that the code in _PyEval_EvalFrameDefault() got changed enough that Clang started emitting a bit different machine code. If the conditional jumps are a bit different, I understand that could have a significant difference on performance.
Are you compiling with --enable-optimizations (i.e. PGO)? In my experience, that is needed to get meaningful results. Victor also mentions that on his "how-to-get-stable-benchmarks" page. Building with PGO is really (really) slow so I supect you are not doing it when bisecting. You can speed it up greatly by using a simpler command for PROFILE_TASK in Makefile.pre.in. E.g.
PROFILE_TASK=$(srcdir)/my_benchmark.py
Now that you have narrowed it down to a single commit, it would be worth doing the comparison with PGO builds (assuming Clang supports that).
That said, it seems to be compiler specific and only affects the Mac builds, so maybe we can decide that we don't care.
I think the key question is if the ceval loop got a bit slower due to logic changes or if Clang just happened to generate a bit worse code due to source code details. A PGO build could help answer that. I suppose trying to compare machine code is going to produce too large of a diff.
Could you try hoisting the eval_breaker expression, as suggested by Antoine:
https://discuss.python.org/t/profiling-cpython-with-perf/940/2
If you think a slowdown affects most opcodes, I think the DISPATCH change looks like the only cause. Maybe I missed something though.
Also, maybe there would be some value in marking key branches as likely/unlikely if it helps Clang generate better machine code. Then, even if you compile without PGO (as many people do), you still get the better machine code.
Regards,
Neil _______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/vstinner%40redhat.com
-- Night gathers, and now my watch begins. It shall not end until my death.