[Python-Dev] Possible performance regression

Tue Feb 26 18:17:33 EST 2019

Hi,

PGO compilation is very slow. I tried very hard to avoid it.

I started to annotate the C code with various GCC attributes like
"inline", "always_inline", "hot", etc.. I also experimented
likely/unlikely Linux macros which use __builtin_expect(). At the
end... my efforts were worthless. I still had *major* issue (benchmark
*suddenly* 68% slower! WTF?) with code locality and I decided to give
up. You can still find some macros like _Py_HOT_FUNCTION and
_Py_NO_INLINE in Python ;-) (_Py_NO_INLINE is used to reduce stack
memory usage, that's a different story.)

My sad story with code placement:
https://vstinner.github.io/analysis-python-performance-issue.html

tl; dr Use PGO.

--

Since that time, I removed call_method from pyperformance to fix the
root issue: don't waste your time on micro-benchmarks ;-) ... But I
kept these micro-benchmarks in a different project:
https://github.com/vstinner/pymicrobench

For some specific needs (take a decision on a specific optimizaton),
sometimes micro-benchmarks are still useful ;-)

Victor

Le mar. 26 févr. 2019 à 23:31, Neil Schemenauer <nas-python at python.ca> a écrit :
>
> On 2019-02-26, Raymond Hettinger wrote:
> > That said, I'm only observing the effect when building with the
> > Mac default Clang (Apple LLVM version 10.0.0 (clang-1000.11.45.5).
> > When building GCC 8.3.0, there is no change in performance.
>
> My guess is that the code in _PyEval_EvalFrameDefault() got changed
> enough that Clang started emitting a bit different machine code.  If
> the conditional jumps are a bit different, I understand that could
> have a significant difference on performance.
>
> Are you compiling with --enable-optimizations (i.e. PGO)?  In my
> experience, that is needed to get meaningful results.  Victor also
> mentions that on his "how-to-get-stable-benchmarks" page.  Building
> with PGO is really (really) slow so I supect you are not doing it
> when bisecting.  You can speed it up greatly by using a simpler
> command for PROFILE_TASK in Makefile.pre.in.  E.g.
>
>     PROFILE_TASK=$(srcdir)/my_benchmark.py
>
> Now that you have narrowed it down to a single commit, it would be
> worth doing the comparison with PGO builds (assuming Clang supports
> that).
>
> > That said, it seems to be compiler specific and only affects the
> > Mac builds, so maybe we can decide that we don't care.
>
> I think the key question is if the ceval loop got a bit slower due
> to logic changes or if Clang just happened to generate a bit worse
> code due to source code details.  A PGO build could help answer
> that.  I suppose trying to compare machine code is going to produce
> too large of a diff.
>
> Could you try hoisting the eval_breaker expression, as suggested by
> Antoine:
>
>     https://discuss.python.org/t/profiling-cpython-with-perf/940/2
>
> If you think a slowdown affects most opcodes, I think the DISPATCH
> change looks like the only cause.  Maybe I missed something though.
>
> Also, maybe there would be some value in marking key branches as
> likely/unlikely if it helps Clang generate better machine code.
> Then, even if you compile without PGO (as many people do), you still
> get the better machine code.
>
> Regards,
>
>   Neil
> _______________________________________________
> Python-Dev mailing list
> Python-Dev at python.org
> https://mail.python.org/mailman/listinfo/python-dev
> Unsubscribe: https://mail.python.org/mailman/options/python-dev/vstinner%40redhat.com

-- 
Night gathers, and now my watch begins. It shall not end until my death.