[Speed] Disable hash randomization to get reliable benchmarks

Maciej Fijalkowski fijall at gmail.com
Tue Apr 26 12:28:32 EDT 2016


On Tue, Apr 26, 2016 at 11:46 AM, Victor Stinner
<victor.stinner at gmail.com> wrote:
> Hi,
>
> 2016-04-26 10:56 GMT+02:00 Armin Rigo <arigo at tunes.org>:
>> Hi,
>>
>> On 25 April 2016 at 08:25, Maciej Fijalkowski <fijall at gmail.com> wrote:
>>> The problem with disabled ASLR is that you change the measurment from
>>> a statistical distribution, to one draw from a statistical
>>> distribution repeatedly. There is no going around doing multiple runs
>>> and doing an average on that.
>>
>> You should mention that it is usually enough to do the following:
>> instead of running once with PYTHONHASHSEED=0, run five or ten times
>> with PYTHONHASHSEED in range(5 or 10).  In this way, you get all
>> benefits: not-too-long benchmarking, no randomness, but still some
>> statistically relevant sampling.
>
> I guess that the number of required runs to get a nice distribution
> depends on the size of the largest dictionary in the benchmark. I
> mean, the dictionaries that matter in performance.
>
> The best would be to handle this transparently in perf.py. Either
> disable all source of randomness, or run mutliple processes to have an
> uniform distribution, rather than on only having one sample for one
> specific config. Maybe it could be an option: by default, run multiple
> processes, but have an option to only run one process using
> PYTHONHASHSEED=0.
>
> By the way, timeit has a very similar issue. I'm quite sure that most
> Python developers run "python -m timeit ..." at least 3 times and take
> the minimum. "python -m timeit" could maybe be modified to also spawn
> child processes to get a better distribution, and maybe also modified
> to display the minimum, the average and the standard deviation? (not
> only the minimum)

taking the minimum is a terrible idea anyway, none of the statistical
discussion makes sense if you do that

>
> Well, the question is also if it's a good thing to have such really
> tiny microbenchmark like bm_call_simple in the Python benchmark suite.
> I spend 2 or 3 days to analyze CPython running bm_call_simple with
> Linux perf tool, callgrind and cachegrind. I'm still unable to
> understand the link between my changes on the C code and the result.
> IMHO this specific benchmark depends on very low-level things like the
> CPU L1 cache.  Maybe bm_call_simple helps in some very specific use
> cases, like trying to make Python function calls faster. But in other
> cases, it can be a source of noise, confusion and frustration...
>
> Victor

maybe it's just a terrible benchmark (it surely is for pypy for example)


More information about the Speed mailing list