Re: [Speed] Disable hash randomization to get reliable benchmarks
On Tue, 26 Apr 2016 18:28:32 +0200 Maciej Fijalkowski <fijall@gmail.com> wrote:
taking the minimum is a terrible idea anyway, none of the statistical discussion makes sense if you do that
The minimum is a reasonable metric for quick throwaway benchmarks as timeit is designed for, as it has a better hope of alleviating the impact of system load (as such throwaway benchmarks are often run on the developer's workstation).
For a persistent benchmarks suite, where we can afford longer benchmark runtimes and are able to keep system noise to a minimum, we might prefer another metric.
Regards
Antoine.
On Tue, Apr 26, 2016 at 6:36 PM, Antoine Pitrou <solipsis@pitrou.net> wrote:
On Tue, 26 Apr 2016 18:28:32 +0200 Maciej Fijalkowski <fijall@gmail.com> wrote:
taking the minimum is a terrible idea anyway, none of the statistical discussion makes sense if you do that
The minimum is a reasonable metric for quick throwaway benchmarks as timeit is designed for, as it has a better hope of alleviating the impact of system load (as such throwaway benchmarks are often run on the developer's workstation).
For a persistent benchmarks suite, where we can afford longer benchmark runtimes and are able to keep system noise to a minimum, we might prefer another metric.
Regards
Antoine.
No, it's not Antoine. Minimum is not better than one random measurment.
We had this discussion before, but you guys are happily dismissing all the papers written on the subject. It *does* get rid of random system stuff, but it *also* does get rid of all the effects related to gc/malloc/caches and infinite details that are not working in the same predictable fashion.
2016-04-26 18:36 GMT+02:00 Antoine Pitrou <solipsis@pitrou.net>:
The minimum is a reasonable metric for quick throwaway benchmarks as timeit is designed for, as it has a better hope of alleviating the impact of system load (as such throwaway benchmarks are often run on the developer's workstation).
IMHO we must at least display the standard deviation. Maybe we can do better and provide 4 numbers:
- Average
- Standard deviation
- Minimum
- Maximum
The maximum helps to detect rare events like Maciej said (something in the OS, GC collection, etc.).
For example, we can use this format:
Average: 293.5 ms +/- 143.2 ms (min: 213.9 ms, max: 629.7 ms)
It's the result of still the same microbenchmark, bm_call_simple.py, run on my laptop. As you can see, there is a large deviation: 143 ms / 293 ms is 49%, the benchmark is unstable. Maybe we should say explicitly that the result is not significant? Example:
Average: 293.5 ms +/- 143.2 ms (min: 213.9 ms, max: 629.7 ms) -- not significant The benchmark is unstable, maybe the system is heavily loaded?
By the way, "293.5 ms +/- 143.2 ms" is misleading. Maybe we should display it as "0.3 sec +/- 0.1 sec" to not show inaccurate digits?
Another example, same laptop but using CPU isolation:
Average: 219.5 ms +/- 1.6 ms (min: 215.9 ms, max: 223.8 ms)
In this example, we can see that "+/- 1.6" is is the standard deviation, it's unrelated to minimum and maximum.
Victor
participants (3)
-
Antoine Pitrou
-
Maciej Fijalkowski
-
Victor Stinner