Re: [Speed] Disable hash randomization to get reliable benchmarks
Hi,
The problem with not cheating on the statistical distribution is that I would like to get a quick feedback on my changes to know if my change is faster or not. Having to wait 1 hour to check a single change is not really convenient. I prefer to get a feedback is less than 5 minutes. Tune Linux and disable random hash allows to run less iterations and so take less time. If you don't tune anything, you need *a lot* of iterations to reduce the "noise" of the benchmark.
2016-04-25 8:25 GMT+02:00 Maciej Fijalkowski <fijall@gmail.com>:
The problem with disabled ASLR is that you change the measurment from a statistical distribution, to one draw from a statistical distribution repeatedly.
The problem is that perf.py only runs one process per benchmark and per Python binary. Let's that the binary A is run with no hash collision, all dict access succeed at the first iteration, whereas the binary B runs with many "hash collision" so get worse performance. Is it fair to compare only these two specific runs? What can we conclude from the result?
Note: The dict type of CPython uses open addressing, so even if two keys get the different hash value, a dict lookup may need more than one iteration to retrieve a dict entry.
Right now, using ASLR and randomized hash function with perf.py is not fair. I'm talking about only running perf.py once. I'm unable to combine/compare manually multiple runs of perf.py. Getting multiple different results is very confusing for me.
If you want to use ASRL and hash randomization, perf.py must be modified to run multiple processes (sequentially) to get a better statistical distribution. No?
I don't know how many processes do we have to run. Accoding to my quick analysis there are 3 different cases, the strict minimum would be to run 3 processes for my specific case. For bm_call_simple, the number of loops is 15 for fast, 150 by default, 300 using rigorous. Maybe we should only run one loop iteration per process?
FYI right now, I'm not using perf.py to prove that my patch makes CPython faster, but more to analyze why my change makes CPython slower :-) It *looks* like Python function calls are between 2 and 7% slower, but it also looks that bm_call_simple is an unstable microbenchmark :-(
There is no going around doing multiple runs and doing an average on that. Essentially for the same reason why using min is much worse than using average, with ASLR say you get: 2.0+-0.3 which we run 5 times and 1.8, 1.9, 2.2, 2.1, 2.1 now if you disable ASLR, you get one draw repeated 5 times, which might be 2.0, but also might be 1.8, 5 times. That just hides the problem, but does not actually fix it (because if you touch something, stuff might be allocated in a different order and then you get a different draw)
My practical problem is to get reliable benchmark. If you don't tune Linux and don't disable hash randomization, the results look "random".
Example of output without tuning. I ran "python3 perf.py ../default/python-revert ../default/python-commit -b call_simple -v --fast" 3 times.
[ fast, 15 iterations ]
### call_simple ### Min: 0.235318 -> 0.247203: 1.05x slower Avg: 0.237601 -> 0.251384: 1.06x slower Significant (t=-6.32) Stddev: 0.00214 -> 0.00817: 3.8069x larger
### call_simple ### Min: 0.234191 -> 0.247109: 1.06x slower Avg: 0.234660 -> 0.247480: 1.05x slower Significant (t=-36.14) Stddev: 0.00102 -> 0.00093: 1.0967x smaller
### call_simple ### Min: 0.235790 -> 0.247089: 1.05x slower Avg: 0.238978 -> 0.247562: 1.04x slower Significant (t=-9.38) Stddev: 0.00342 -> 0.00094: 3.6504x smaller
You ask to ignore the Min line. Ok. But the average line say 1.04, 1.05 and 1.06x slower. Which one is the "good" result? :-)
Usually, the difference is much larger like between 1.02x slower and 1.07x slower.
[ rigorous, 300 iterations ]
The --fast option of perf.py is just a toy, right? Serious dev must use the super slow --rigorous mode! Ok, let's try it.
### call_simple ### Min: 0.234102 -> 0.248098: 1.06x slower Avg: 0.236218 -> 0.254318: 1.08x slower Significant (t=-30.32) Stddev: 0.00561 -> 0.00869: 1.5475x larger
### call_simple ### Min: 0.234109 -> 0.248024: 1.06x slower Avg: 0.240069 -> 0.255194: 1.06x slower Significant (t=-15.21) Stddev: 0.01126 -> 0.01304: 1.1584x larger
### call_simple ### Min: 0.235272 -> 0.248225: 1.06x slower Avg: 0.244106 -> 0.258349: 1.06x slower Significant (t=-13.27) Stddev: 0.00830 -> 0.01663: 2.0053x larger
Again, I ignore the Min line. Average: hum... is it 1.06x or 1.08x slower? For me, it's not really the same :-/
In the reference Python, the average timing changes between 0.236 and 0.244 in the 3 runs. Since the difference between the reference and the patched Python is tiny, the stability of the microbenchmark matters.
FYI I'm running benchmarks on my desktop PC. It's more convenient to run benchmarks locally than transfering changes to an hypotetical dedicated benchmark server (tuned to run more reliable (micro)benchmarks). Since benchmarks are slow, I'm doing something else while the benchmark is running.
Victor
participants (1)
-
Victor Stinner