[issue45261] Unreliable (?) results from timeit (cache issue?)

Wed Sep 22 06:26:07 EDT 2021

STINNER Victor <vstinner at python.org> added the comment:

PyPy emits a warning when the timeit module is used, suggesting to use pyperf.

timeit uses the minimum, whereas pyperf uses the average (arithmetic mean).

timeit uses a single process, pyperf spawns 21 processes: 1 just for the loop calibration, 20 to compute values.

timeit computes 5 values, pyperf computes 60 values.

timeit uses all computed values, pyperf ignores the first value considered as a "warmup value" (the number of warmup values can be configured).

timeit doesn't compute the standard deviation, pyperf does. The standard deviation gives an idea if the benchmark looks reliable or not. IMO results without standard deviation should not be trusted.

pyperf also emits warning when a benchmark doesn't look reliable. For example, if the user ran various workload while the benchmark was running.

pyperf also supports storing results in a JSON file which stores all values, but also metadata.

I cannot force people to stop using timeit. But there are reason why pyperf is more reliable than timeit.

Benchmarking is hard. See pyperf documentation giving hints how to get reproducible benchmark results:
https://pyperf.readthedocs.io/en/latest/run_benchmark.html#how-to-get-reproducible-benchmark-results

Read also this important article ;-)
"Biased Benchmarks (honesty is hard)"
http://matthewrocklin.com/blog/work/2017/03/09/biased-benchmarks

----------

_______________________________________
Python tracker <report at bugs.python.org>
<https://bugs.python.org/issue45261>
_______________________________________