On Tue, Apr 26, 2016 at 11:46 AM, Victor Stinner firstname.lastname@example.org wrote:
2016-04-26 10:56 GMT+02:00 Armin Rigo email@example.com:
On 25 April 2016 at 08:25, Maciej Fijalkowski firstname.lastname@example.org wrote:
The problem with disabled ASLR is that you change the measurment from a statistical distribution, to one draw from a statistical distribution repeatedly. There is no going around doing multiple runs and doing an average on that.
You should mention that it is usually enough to do the following: instead of running once with PYTHONHASHSEED=0, run five or ten times with PYTHONHASHSEED in range(5 or 10). In this way, you get all benefits: not-too-long benchmarking, no randomness, but still some statistically relevant sampling.
I guess that the number of required runs to get a nice distribution depends on the size of the largest dictionary in the benchmark. I mean, the dictionaries that matter in performance.
The best would be to handle this transparently in perf.py. Either disable all source of randomness, or run mutliple processes to have an uniform distribution, rather than on only having one sample for one specific config. Maybe it could be an option: by default, run multiple processes, but have an option to only run one process using PYTHONHASHSEED=0.
By the way, timeit has a very similar issue. I'm quite sure that most Python developers run "python -m timeit ..." at least 3 times and take the minimum. "python -m timeit" could maybe be modified to also spawn child processes to get a better distribution, and maybe also modified to display the minimum, the average and the standard deviation? (not only the minimum)
taking the minimum is a terrible idea anyway, none of the statistical discussion makes sense if you do that
Well, the question is also if it's a good thing to have such really tiny microbenchmark like bm_call_simple in the Python benchmark suite. I spend 2 or 3 days to analyze CPython running bm_call_simple with Linux perf tool, callgrind and cachegrind. I'm still unable to understand the link between my changes on the C code and the result. IMHO this specific benchmark depends on very low-level things like the CPU L1 cache. Maybe bm_call_simple helps in some very specific use cases, like trying to make Python function calls faster. But in other cases, it can be a source of noise, confusion and frustration...
maybe it's just a terrible benchmark (it surely is for pypy for example)