[Python-Dev] Stop using timeit, use perf.timeit!

Fri Jun 10 09:20:51 EDT 2016

On Fri, Jun 10, 2016 at 01:13:10PM +0200, Victor Stinner wrote:
> Hi,
> 
> Last weeks, I made researchs on how to get stable and reliable
> benchmarks, especially for the corner case of microbenchmarks. The
> first result is a serie of article, here are the first three:

Thank you for this! I am very interested in benchmarking.

> https://haypo.github.io/journey-to-stable-benchmark-system.html
> https://haypo.github.io/journey-to-stable-benchmark-deadcode.html
> https://haypo.github.io/journey-to-stable-benchmark-average.html

I strongly question your statement in the third:

    [quote]
    But how can we compare performances if results are random? 
    Take the minimum?

    No! You must never (ever again) use the minimum for 
    benchmarking! Compute the average and some statistics like
    the standard deviation:
    [end quote]

While I'm happy to see a real-world use for the statistics module, I 
disagree with your logic.

The problem is that random noise can only ever slow the code down, it 
cannot speed it up. To put it another way, the random errors in the 
timings are always positive.

Suppose you micro-benchmark some code snippet and get a series of 
timings. We can model the measured times as:

    measured time t = T + ε

where T is the unknown "true" timing we wish to estimate, and ε is some 
variable error due to noise in the system. But ε is always positive, 
never negative, and we always measure something larger than T.

Let's suppose we somehow (magically) know what the epsilons are:

measurements = [T + 0.01, T + 0.02, T + 0.04, T + 0.01]

The average is (4*T + 0.08)/4 = T + 0.02

But the minimum is T + 0.01, which is a better estimate than the 
average. Taking the average means that *worse* epsilons will effect your 
estimate, while the minimum means that only the smallest epsilon effects 
your estimate.

Taking the average is appropriate is if the error terms can be positive 
or negative, e.g. if they are *measurement error* rather than noise:

measurements = [T + 0.01, T - 0.02, T + 0.04, T - 0.01]

The average is (4*T + 0.02)/4 = T + 0.005

The minimum is T - 0.02, which is worse than the average.

Unless you have good reason to think that the timing variation is mostly 
caused by some error which can be both positive and negative, the 
minimum is the right statistic to use, not the average. But ask 
yourself: what sort of error, noise or external influence will cause the 
code snippet to run FASTER than the fastest the CPU can execute it?

-- 
Steve