
On Mon, Oct 1, 2012 at 2:35 PM, Steven D'Aprano <steve@pearwood.info> wrote:
On Sun, Sep 30, 2012 at 07:12:47PM -0400, Brett Cannon wrote:
python3 perf.py -T --basedir ../benchmarks -f -b py3k ../cpython/builds/2.7-wide/bin/python ../cpython/builds/3.3/bin/python3.3
### call_method ### Min: 0.491433 -> 0.414841: 1.18x faster Avg: 0.493640 -> 0.416564: 1.19x faster Significant (t=127.21) Stddev: 0.00170 -> 0.00162: 1.0513x smaller
I'm not sure if this is the right place to discuss this, but what is the justification for recording the average and std deviation of the benchmarks?
If the benchmarks are based on timeit, the timeit docs warn against taking any statistic other than the minimum.
Also because timeit is wrong to give that recommendation. There are factors - such as garbage collection - that affect operations on average, even though they may not kick in in every run. If you want to know how something will perform as part of a larger system, taking the best possible and extrapolating from it is a mistake. As a concrete example, consider an algorithm that generates cycles with several hundred MB of memory in them. Best case the RAM is available, nothing swaps, and gc doesn't kick in during the algorithm's execution. However, the larger program has to deal with those several hundred MB of memory sitting around until gc *does* kick in, has to pay the price of a gc run over a large heap, and deal with the impact on disk read cache. When you do enough runs to see those effects *that will affect the whole program* kick in, then you can extrapolate from that basis. e.g. the question timeit optimises itself to answer isn't the question most folk need most of the time. -Rob