I'm working on an experimental change of CPython introducing a new "fast call" calling convention for Python and C functions. It pass an array of PyObject* and a number of arguments as an C int (PyObject **stack, int nargs) instead of using a temporary tuple (PyObject *args). The expectation is that avoiding the creation makes Python faster. http://bugs.python.org/issue26814
But I was quickly blocked on "macrobenchmarks" (?): running the Python benchmark suite says that many benchmarks are between 2% and 15% slower. I spent hours (days) to investigate the issue using Cachegrind, Callgrind, Linux perf, strace, ltrace, etc., but I was unable to understand how my change can makes CPython slower.
My change is quite big: "34 files changed, 3301 insertions(+), 730 deletions(-)". In fact, the performance regression can be reproduced easily with a few lines of C code: see attached patches. You only have to add some *unused* (dead) code to see a "glitch" in performance. It's even worse: the performance change depends on the size of unused code.
I done my best to isolate the microbenchmark to make it as reliable as possible. Results of bm_call_simple on my desktop PC:
(a) Reference: Average: 1201.0 ms +/- 0.2 ms (min: 1200.7 ms, max: 1201.2 ms)
(b) Add 2 unused functions, based on (a): Average: 1273.0 ms +/- 1.8 ms (min: 1270.1 ms, max: 1274.4 ms)
(c) Add 1 unused short function ("return NULL;"), based on (a): Average: 1169.6 ms +/- 0.2 ms (min: 1169.3 ms, max: 1169.8 ms)
(b) and (c) are 2 versions only adding unused code to (a). The difference between (b) and (c) is the size of unused code. The problem is that (b) makes the code slower and (c) makes the code faster (!), whereas I would not expect any performance change.
A sane person should ignore such minor performance delta (+72 ms = +6% // -31.4 ms = -3%). Right. But for optimization patches on CPython, we use the CPython benchmark suite as a proof that yeah, the change really makes CPython faster, as announced.
I compiled the C code using GCC (5.3) and Clang (3.7) using various options: -O0, -O3, -fno-align-functions, -falign-functions=N (with N=1, 2, 6, 12), -fomit-frame-pointer, -flto, etc. In short, the performance looks "random". I'm unable to correlate the performance with any Linux perf event. IMHO the performance depends on something low level like L1 cache, CPU pipeline, branch prediction, etc. As I wrote, I'm unable to verify that.
hg clone https://hg.python.org/cpython fastcall # or: "hg clone (...)/cpython fastcall" # if you already have a local copy of cpython ;-) cd fastcall ./configure -C
# build reference binary hg up -C -r 496e094f4734 patch -p1 < prepare.patch make && mv python python-ref
# build binary with deadcode 1 hg up -C -r 496e094f4734 patch -p1 < prepare.patch patch -p1 < deadcode1.patch make && mv python python-deadcode1
# build binary with deadcode 2 hg up -C -r 496e094f4734 patch -p1 < prepare.patch patch -p1 < deadcode2.patch make && mv python python-deadcode2
# run benchmark PYTHONHASHSEED=0 ./python-ref bm_call_simple.py PYTHONHASHSEED=0 ./python-deadcode1 bm_call_simple.py PYTHONHASHSEED=0 ./python-deadcode2 bm_call_simple.py
It suggest you to isolate at least one CPU and run the benchmark on isolated CPUs to get reliable timings:
# run benchmark on the CPU #2 PYTHONHASHSEED=0 taskset -c 2 ./python-ref bm_call_simple.py PYTHONHASHSEED=0 taskset -c 2 ./python-deadcode1 bm_call_simple.py PYTHONHASHSEED=0 taskset -c 2 ./python-deadcode2 bm_call_simple.py
My notes on CPU isolation: http://haypo-notes.readthedocs.org/microbenchmark.html
If you don't want to try CPU isolation, try to get an idle system and/or run the benchmark many times until the standard deviation (the "+/- ..." part) looks small enough...
Don't try to run the microbenchmark without PYTHONHASHSEED=0 or you will get random results depending on the secret hash key used by the randomized hash function. (Or modify the code to spawn enough child process to get an uniform distribution ;-))
I don't expect that you get the same numbers than me. For example, on my laptop, the delta is very small (+/- 1%):
$ PYTHONHASHSEED=0 taskset -c 2 ./python-ref bm_call_simple.py Average: 1096.1 ms +/- 12.9 ms (min: 1079.5 ms, max: 1110.3 ms)
$ PYTHONHASHSEED=0 taskset -c 2 ./python-deadcode1 bm_call_simple.py Average: 1109.2 ms +/- 11.1 ms (min: 1095.8 ms, max: 1122.9 ms) => +1% (+13 ms)
$ PYTHONHASHSEED=0 taskset -c 2 ./python-deadcode2 bm_call_simple.py Average: 1072.0 ms +/- 1.5 ms (min: 1070.0 ms, max: 1073.9 ms) => -2% (-24 ms)
CPU of my desktop: Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz - 4 physical cores with hyper-threading CPU of my laptop: Intel(R) Core(TM) i7-3520M CPU @ 2.90GHz - 2 physical cores with hyper-threading
I modified bm_call_simple.py to call foo() 100 times rather than 20 in the loop to see the issue more easily. I also removed dependencies and changed the output format to display average, standard deviation, minimum and maximum.
For more benchmarks, see attached deadcode1.log and deadcode2.log: results of the CPython benchmark to compare deadcode1 VS reference, and deadcode2 VS reference run on my desktop PC (perf.py --fast & CPU isolation). Again, deadcode1 looks slower in most cases, whereas deadcode2 looks faster in most cases, whereas the difference is still dead code...