On 6/10/2016 9:20 AM, Steven D'Aprano wrote:
On Fri, Jun 10, 2016 at 01:13:10PM +0200, Victor Stinner wrote:
Hi,
Last weeks, I made researchs on how to get stable and reliable benchmarks, especially for the corner case of microbenchmarks. The first result is a serie of article, here are the first three:
Thank you for this! I am very interested in benchmarking.
https://haypo.github.io/journey-to-stable-benchmark-system.html https://haypo.github.io/journey-to-stable-benchmark-deadcode.html https://haypo.github.io/journey-to-stable-benchmark-average.html
I strongly question your statement in the third:
[quote] But how can we compare performances if results are random? Take the minimum?
No! You must never (ever again) use the minimum for benchmarking! Compute the average and some statistics like the standard deviation: [end quote]
While I'm happy to see a real-world use for the statistics module, I disagree with your logic.
The problem is that random noise can only ever slow the code down, it cannot speed it up. To put it another way, the random errors in the timings are always positive.
Suppose you micro-benchmark some code snippet and get a series of timings. We can model the measured times as:
measured time t = T + ε
where T is the unknown "true" timing we wish to estimate,
For comparative timings, we do not care about T. So arguments about the best estimate of T mist the point. What we do wish to estimate is the relationship between two Ts, T0 for 'control', and T1 for 'treatment', in particular T1/T0. I suspect Viktor is correct that mean(t1)/mean(t0) is better than min(t1)/min(t0) as an estimate of the true ratio T1/T0 (for a particular machine). But given that we have matched pairs of measurements with the same hashseed and address, it may be better yet to estimate T1/T0 from the ratios t1i/t0i, where i indexes experimental conditions. But it has been a long time since I have read about estimation of ratios. What I remember is that this is a nasty subject. It is also the case that while an individual with one machine wants the best ratio for that machine, we need to make CPython patch decisions for the universe of machines that run Python.
and ε is some variable error due to noise in the system. But ε is always positive, never negative,
lognormal might be a first guess. But what we really have is contributions from multiple factors, -- Terry Jan Reedy