[Speed] When CPython performance depends on dead code...

Wed Apr 27 14:30:23 EDT 2016

My first intuition is some cache somewhere is unhappy w/ the varying sizes.
Have you tried any of this on another machine to see if the results are
consistent?

On Wed, 27 Apr 2016 at 08:06 Victor Stinner <victor.stinner at gmail.com>
wrote:

>  Hi,
>
> I'm working on an experimental change of CPython introducing a new
> "fast call" calling convention for Python and C functions. It pass an
> array of PyObject* and a number of arguments as an C int (PyObject
> **stack, int nargs) instead of using a temporary tuple (PyObject
> *args). The expectation is that avoiding the creation makes Python
> faster.
> http://bugs.python.org/issue26814
>
> First microbenchmarks on optimized code are promising: between 18% and
> 44% faster.
> http://bugs.python.org/issue26814#msg263999
> http://bugs.python.org/issue26814#msg264003
>
> But I was quickly blocked on "macrobenchmarks" (?): running the Python
> benchmark suite says that many benchmarks are between 2% and 15%
> slower. I spent hours (days) to investigate the issue using
> Cachegrind, Callgrind, Linux perf, strace, ltrace, etc., but I was
> unable to understand how my change can makes CPython slower.
>
> My change is quite big: "34 files changed, 3301 insertions(+), 730
> deletions(-)". In fact, the performance regression can be reproduced
> easily with a few lines of C code: see attached patches. You only have
> to add some *unused* (dead) code to see a "glitch" in performance.
> It's even worse: the performance change depends on the size of unused
> code.
>
> I done my best to isolate the microbenchmark to make it as reliable as
> possible. Results of bm_call_simple on my desktop PC:
>
> (a) Reference:
> Average: 1201.0 ms +/- 0.2 ms (min: 1200.7 ms, max: 1201.2 ms)
>
> (b) Add 2 unused functions, based on (a):
> Average: 1273.0 ms +/- 1.8 ms (min: 1270.1 ms, max: 1274.4 ms)
>
> (c) Add 1 unused short function ("return NULL;"), based on (a):
> Average: 1169.6 ms +/- 0.2 ms (min: 1169.3 ms, max: 1169.8 ms)
>
> (b) and (c) are 2 versions only adding unused code to (a). The
> difference between (b) and (c) is the size of unused code. The problem
> is that (b) makes the code slower and (c) makes the code faster (!),
> whereas I would not expect any performance change.
>
> A sane person should ignore such minor performance delta (+72 ms = +6%
> //  -31.4 ms = -3%). Right. But for optimization patches on CPython,
> we use the CPython benchmark suite as a proof that yeah, the change
> really makes CPython faster, as announced.
>
> I compiled the C code using GCC (5.3) and Clang (3.7) using various
> options: -O0, -O3, -fno-align-functions, -falign-functions=N (with
> N=1, 2, 6, 12), -fomit-frame-pointer, -flto, etc. In short, the
> performance looks "random". I'm unable to correlate the performance
> with any Linux perf event. IMHO the performance depends on something
> low level like L1 cache, CPU pipeline, branch prediction, etc. As I
> wrote, I'm unable to verify that.
>
> To reproduce my issue, you can use the following commands:
> ---------------------------
> hg clone https://hg.python.org/cpython fastcall
> # or: "hg clone (...)/cpython fastcall"
> # if you already have a local copy of cpython ;-)
> cd fastcall
> ./configure -C
>
> # build reference binary
> hg up -C -r 496e094f4734
> patch -p1 < prepare.patch
> make && mv python python-ref
>
> # build binary with deadcode 1
> hg up -C -r 496e094f4734
> patch -p1 < prepare.patch
> patch -p1 < deadcode1.patch
> make && mv python python-deadcode1
>
> # build binary with deadcode 2
> hg up -C -r 496e094f4734
> patch -p1 < prepare.patch
> patch -p1 < deadcode2.patch
> make && mv python python-deadcode2
>
> # run benchmark
> PYTHONHASHSEED=0 ./python-ref bm_call_simple.py
> PYTHONHASHSEED=0 ./python-deadcode1 bm_call_simple.py
> PYTHONHASHSEED=0 ./python-deadcode2 bm_call_simple.py
> ---------------------------
>
> It suggest you to isolate at least one CPU and run the benchmark on
> isolated CPUs to get reliable timings:
> ---------------------------
> # run benchmark on the CPU #2
> PYTHONHASHSEED=0 taskset -c 2 ./python-ref bm_call_simple.py
> PYTHONHASHSEED=0 taskset -c 2 ./python-deadcode1 bm_call_simple.py
> PYTHONHASHSEED=0 taskset -c 2 ./python-deadcode2 bm_call_simple.py
> ---------------------------
> My notes on CPU isolation:
> http://haypo-notes.readthedocs.org/microbenchmark.html
>
> If you don't want to try CPU isolation, try to get an idle system
> and/or run the benchmark many times until the standard deviation (the
> "+/- ..." part) looks small enough...
>
> Don't try to run the microbenchmark without PYTHONHASHSEED=0 or you
> will get random results depending on the secret hash key used by the
> randomized hash function. (Or modify the code to spawn enough child
> process to get an uniform distribution ;-))
>
> I don't expect that you get the same numbers than me. For example, on
> my laptop, the delta is very small (+/- 1%):
>
> $ PYTHONHASHSEED=0 taskset -c 2 ./python-ref bm_call_simple.py
> Average: 1096.1 ms +/- 12.9 ms (min: 1079.5 ms, max: 1110.3 ms)
>
> $ PYTHONHASHSEED=0 taskset -c 2 ./python-deadcode1 bm_call_simple.py
> Average: 1109.2 ms +/- 11.1 ms (min: 1095.8 ms, max: 1122.9 ms)
> => +1% (+13 ms)
>
> $ PYTHONHASHSEED=0 taskset -c 2 ./python-deadcode2 bm_call_simple.py
> Average: 1072.0 ms +/- 1.5 ms (min: 1070.0 ms, max: 1073.9 ms)
> => -2% (-24 ms)
>
> CPU of my desktop: Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz - 4
> physical cores with hyper-threading
> CPU of my laptop: Intel(R) Core(TM) i7-3520M CPU @ 2.90GHz - 2
> physical cores with hyper-threading
>
> I modified bm_call_simple.py to call foo() 100 times rather than 20 in
> the loop to see the issue more easily. I also removed dependencies and
> changed the output format to display average, standard deviation,
> minimum and maximum.
>
> For more benchmarks, see attached deadcode1.log and deadcode2.log:
> results of the CPython benchmark to compare deadcode1 VS reference,
> and deadcode2 VS reference run on my desktop PC (perf.py --fast & CPU
> isolation). Again, deadcode1 looks slower in most cases, whereas
> deadcode2 looks faster in most cases, whereas the difference is still
> dead code...
>
> Victor, disappointed
> _______________________________________________
> Speed mailing list
> Speed at python.org
> https://mail.python.org/mailman/listinfo/speed
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/speed/attachments/20160427/97fc8bd6/attachment.html>