[Speed] External sources of noise changing call_simple "performance"

Victor Stinner victor.stinner at gmail.com
Tue May 17 17:11:50 EDT 2016


Hi,

I'm still (!) investigating the reasons why the benchmark call_simple
(ok, let's be honest: the *micro*benchmark) gets different results for
unknown reasons.

(*) Collisions in hash tables: perf.py already calls the benchmark
with PYTHONHASHSEED=1 to test the same hash function. A more generic
solution is to use multiple processes to test multiple hash seeds to
get a better uniform distribution.

(*) System load => CPU isolation, disable ASLR, set CPU affinity on
IRQs, etc. work around this issue --
http://haypo-notes.readthedocs.io/microbenchmark.html

(*) CPU heat => disable CPU Turbo Mode works around this issue

(*) Locale, size of the command line and/or the current working
directory => WTF?!

Examples with a system tuned to get reliable benchmark.

Example 1 using different locales:
---
$ env -i PYTHONHASHSEED=1 LANG=$LANG taskset -c 3
../fastcall/pgo/python performance/bm_call_simple.py -n 2 --timer
perf_counter
0.1914542349995827
0.1914668690005783

$ env -i PYTHONHASHSEED=1 taskset -c 3 ../fastcall/pgo/python
performance/bm_call_simple.py -n 2 --timer perf_counter
0.2037885540003117
0.20376207399931445
---

Example 2 using a different command line (the "xxx" is ignored but
changes the benchmark result):
--
$ env -i PYTHONHASHSEED=1 taskset -c 3 ../fastcall/pgo/python
performance/bm_call_simple.py -n 2 --timer perf_counter
0.20377227199969639
0.20376165899961052

$ env -i PYTHONHASHSEED=1 taskset -c 3 ../fastcall/pgo/python
performance/bm_call_simple.py -n 2 --timer perf_counter xxx
0.20814169400000537
0.20804374700037442
---

=> My bet is that the locale, current working directory, command line,
etc. impact how the heap memory is allocated, and this specific
benchmark depends on the locality of memory allocated on the heap...

For a microbenchmark, 191 ms, 203 ms or 208 ms are not the same
numbers... Such very subtle difference impacts the final "NNNx slower"
or "NNNx faster" line of perf.py.

I tried different values of $LANG environment variable and differerent
lengths of command lines. When the performance decreases, the
stalled-cycles-frontend Linux perf event increases while the LLC-loads
even increases.

=> The performance of the benchmark depends on the usage of low-level
memory caches (L1, L2, L3).

I understand that in some cases, more memory fits into the fatest
caches, and so the benchmark is faster. But sometimes, all memory
doesn't fit, and so the benchmark is slower.

Maybe the problem is that memory is close to memory pages boundaries,
or doesn't fit into L1 cache lines, or something like that.

Victor


More information about the Speed mailing list