Re: [Speed] New CPython benchmark suite based on perf
On Mon, 4 Jul 2016 22:51:11 +0200 Victor Stinner victor.stinner@gmail.com wrote:
2016-07-04 19:49 GMT+02:00 Antoine Pitrou solipsis@pitrou.net:
Median +- Std dev: 256 ms +- 3 ms -> 262 ms +- 4 ms: 1.03x slower
That doesn't sound like a terrific idea. Why do you think the median gives a more interesting figure here?
When the distribution is uniform, mean and median are the same. In my experience with Python benchmarks, usually the curse is skewed: the right tail is much longer.
When the system noise is high, the skewness is much larger. In this case, median looks "more correct".
It "looks" more correct?
Let's say your Python implementation has a flaw: it is almost always fast, but every 10 runs, it becomes 3x slower. Taking the mean will reflect the occasional slowness. Taking the median will completely hide it.
Then of course, since you have several processes and several runs per process, you could try something more convoluted, such as mean-of-medians or mean-of-mins or...
However, if you're concerned by system noise, there may be other ways to avoid it. For example, measure both CPU time and wall time, and if CPU time < 0.9 * wall time (for example), ignore the number and take another measurement.
(this assumes all benchmarks are CPU-bound - which they should be here
- and single-threaded - which they *probably* are, except in a hypothetical parallelizing Python implementation ;-)))
Regards
Antoine.
2016-07-05 10:08 GMT+02:00 Antoine Pitrou solipsis@pitrou.net:
When the system noise is high, the skewness is much larger. In this case, median looks "more correct".
It "looks" more correct?
My main worry is to get reproductible "stable" benchmark results. I started to work on perf because most results of the CPython benchmark suite just looked like pure noise. It became very hard for me to decide if it's my fault, if my change makes Python slower and faster. I'm not talking of specific benchmarks which are obviously much faster or much slower, but all small changes between -5% and +5%.
It looks like median helps to reduce the effect of outliers.
Let's say your Python implementation has a flaw: it is almost always fast, but every 10 runs, it becomes 3x slower. Taking the mean will reflect the occasional slowness. Taking the median will completely hide it.
I'm not sure that the median will completly hide such behaviour. Moreover, I modified the benchmark suite to always display the standard deviation just after the median. The standard deviation should help to detect a large variation.
In practice, it almost never occurs to have all samples with the same value. There is always a statistic distribution, usually as a gaussian curse. The question is what is the best way to "summary" a curve with two numbers. I add a constraint: I also want to reduce the system noise.
Then of course, since you have several processes and several runs per process, you could try something more convoluted, such as mean-of-medians or mean-of-mins or...
I don't know these functions. I also prefer consider each sample as individual and only apply a function on the whole serie of all samples.
However, if you're concerned by system noise, there may be other ways to avoid it. For example, measure both CPU time and wall time, and if CPU time < 0.9 * wall time (for example), ignore the number and take another measurement.
(this assumes all benchmarks are CPU-bound - which they should be here
- and single-threaded - which they *probably* are, except in a hypothetical parallelizing Python implementation ;-)))
CPU isolation helps a lot to reduce the system noise, but it requires "complex" system tuning. I don't understand that users will use it, especially users of timeit.
I don't think that CPU time is generic enough to put it in the perf module. I would prefer to not restrict myself to CPU-bound benchmarks.
But the perf module already warns users when it detects that the benchmark looks too unstable. See the example at the end of: http://perf.readthedocs.io/en/latest/perf.html#runs-samples-warmups-outter-a...
Or try: "python3 -m perf.timeit --loops=10 pass".
Currently, I'm using the shortest raw sample (>= 1 ms) and standard deviation / median (< 10%).
Someone suggested me to compare the minimum and the maximum to the median. You get already see that using perf stats:
$ python3 -m perf show --stats perf/tests/telco.json Number of samples: 250 (50 runs x 5 samples; 1 warmup) Standard deviation / median: 1% Shortest raw sample: 264 ms (10 loops)
Minimum: 26.4 ms (-1.8%) Median +- std dev: 26.9 ms +- 0.2 ms Maximum: 27.3 ms (+1.7%)
Median +- std dev: 26.9 ms +- 0.2 ms
=> -1.8% and +1.7% numbers for minimum and maximum
When you get outliers, numbers are up to 20% for the maximum or much more.
Victor
participants (2)
-
Antoine Pitrou
-
Victor Stinner