Re: [Python-Dev] Possible performance regression

Feb. 26, 2019

      Hi,

PGO compilation is very slow. I tried very hard to avoid it.

I started to annotate the C code with various GCC attributes like
"inline", "always_inline", "hot", etc.. I also experimented
likely/unlikely Linux macros which use __builtin_expect(). At the
end... my efforts were worthless. I still had *major* issue (benchmark
*suddenly* 68% slower! WTF?) with code locality and I decided to give
up. You can still find some macros like _Py_HOT_FUNCTION and
_Py_NO_INLINE in Python ;-) (_Py_NO_INLINE is used to reduce stack
memory usage, that's a different story.)

My sad story with code placement:
https://vstinner.github.io/analysis-python-performance-issue.html

tl; dr Use PGO.

--

Since that time, I removed call_method from pyperformance to fix the
root issue: don't waste your time on micro-benchmarks ;-) ... But I
kept these micro-benchmarks in a different project:
https://github.com/vstinner/pymicrobench

For some specific needs (take a decision on a specific optimizaton),
sometimes micro-benchmarks are still useful ;-)

Victor

Le mar. 26 févr. 2019 à 23:31, Neil Schemenauer <nas-python@python.ca> a écrit :
...
On 2019-02-26, Raymond Hettinger wrote:
...
That said, I'm only observing the effect when building with the
Mac default Clang (Apple LLVM version 10.0.0 (clang-1000.11.45.5).
When building GCC 8.3.0, there is no change in performance.
My guess is that the code in _PyEval_EvalFrameDefault() got changed
enough that Clang started emitting a bit different machine code.  If
the conditional jumps are a bit different, I understand that could
have a significant difference on performance.
Are you compiling with --enable-optimizations (i.e. PGO)?  In my
experience, that is needed to get meaningful results.  Victor also
mentions that on his "how-to-get-stable-benchmarks" page.  Building
with PGO is really (really) slow so I supect you are not doing it
when bisecting.  You can speed it up greatly by using a simpler
command for PROFILE_TASK in Makefile.pre.in.  E.g.
PROFILE_TASK=$(srcdir)/my_benchmark.py
Now that you have narrowed it down to a single commit, it would be
worth doing the comparison with PGO builds (assuming Clang supports
that).
...
That said, it seems to be compiler specific and only affects the
Mac builds, so maybe we can decide that we don't care.
I think the key question is if the ceval loop got a bit slower due
to logic changes or if Clang just happened to generate a bit worse
code due to source code details.  A PGO build could help answer
that.  I suppose trying to compare machine code is going to produce
too large of a diff.
Could you try hoisting the eval_breaker expression, as suggested by
Antoine:
https://discuss.python.org/t/profiling-cpython-with-perf/940/2
If you think a slowdown affects most opcodes, I think the DISPATCH
change looks like the only cause.  Maybe I missed something though.
Also, maybe there would be some value in marking key branches as
likely/unlikely if it helps Clang generate better machine code.
Then, even if you compile without PGO (as many people do), you still
get the better machine code.
Regards,
Neil
_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: https://mail.python.org/mailman/options/python-dev/vstinner%40redhat.com
-- 
Night gathers, and now my watch begins. It shall not end until my death.

Re: [Python-Dev] Possible performance regression

Victor Stinner