[Speed] Disable hash randomization to get reliable benchmarks
Victor Stinner
victor.stinner at gmail.com
Sun Apr 24 18:49:20 EDT 2016
Hi,
Last months, I spent a lot of time on microbenchmarks. Probably too
much time :-) I found a great Linux config to get a much more stable
system to get reliable microbenchmarks:
https://haypo-notes.readthedocs.org/microbenchmark.html
* isolate some CPU cores
* force CPU to performance
* disable ASLR
* block IRQ on isolated CPU cores
With such Linux config, the system load doesn't impact benchmark results at all.
Last days, I almost lost my mind trying to figure out why a very tiny
change in C code makes a difference up to 8% slower.
My main issue was to get reliable benchmark since running the same
microbenchmark using perf.py gave me "random" results.
I finished to run directly the underlying script bm_call_simple.py:
taskset -c 7 ./python ../benchmarks/performance/bm_call_simple.py -n 5
--timer perf_counter
In a single run, timings of each loop iteration is very stable. Example:
0.22682707803323865
0.22741253697313368
0.227521265973337
0.22750743699725717
0.22752994997426867
0.22753606992773712
0.22742654103785753
0.22750875598285347
0.22752253606449813
0.22718404198531061
Problem: each new run gives a different result. Example:
* run 1: 0.226...
* run 2: 0.255...
* run 3: 0.248...
* run 4: 0.258...
* etc.
I saw 3 groups of values: ~0.226, ~0.248, ~0.255.
I didn't understand how running the same program can give so different
result. The reply is the randomization of the Python hash function.
Aaaaaaah! The last source of entropy in my microbenchmark!
The performance difference can be seen by forcing a specific hash function:
PYTHONHASHSEED=2 => 0.254...
PYTHONHASHSEED=1 => 0.246...
PYTHONHASHSEED=5 => 0.228...
Sadly, perf.py and timeit don't disable hash randomization for me. I
hacked perf.py to set PYTHONHASHSEED=0 and magically the result became
super stable!
Multiple runs of the command:
$ taskset_isolated.py python3 perf.py ../default/python-ref
../default/python -b call_simple --fast
Outputs:
### call_simple ###
Min: 0.232621 -> 0.247904: 1.07x slower
Avg: 0.232628 -> 0.247941: 1.07x slower
Significant (t=-591.78)
Stddev: 0.00001 -> 0.00010: 13.7450x larger
### call_simple ###
Min: 0.232619 -> 0.247904: 1.07x slower
Avg: 0.232703 -> 0.247955: 1.07x slower
Significant (t=-190.58)
Stddev: 0.00029 -> 0.00011: 2.6336x smaller
### call_simple ###
Min: 0.232621 -> 0.247903: 1.07x slower
Avg: 0.232629 -> 0.247918: 1.07x slower
Significant (t=-5896.14)
Stddev: 0.00001 -> 0.00001: 1.3350x larger
Even with --fast, the result is *very* stable. See the very good
standard deviation. In 3 runs, I got exactly the same "1.07x". Average
timings are the same +/-1 up to 4 digits!
No need to use the ultra slow --rigourous option. This option is
probably designed to hide the noise of a very unstable system. But
using my Linux config, it doesn't seem to be needed anymore, at least
on this very specific microbenchmark.
Ok, now I can investigate why my change on the C code introduced a
performance regression :-D
Victor
More information about the Speed
mailing list