Re: [Python-Dev] Benchmarking Python 3.3 against Python 2.7 (wide build)

Sept. 30, 2012

      On Mon, Oct 1, 2012 at 2:35 PM, Steven D'Aprano <steve@pearwood.info> wrote:
...
On Sun, Sep 30, 2012 at 07:12:47PM -0400, Brett Cannon wrote:
...
...
python3 perf.py -T --basedir ../benchmarks -f -b py3k
../cpython/builds/2.7-wide/bin/python ../cpython/builds/3.3/bin/python3.3
...
### call_method ###
Min: 0.491433 -> 0.414841: 1.18x faster
Avg: 0.493640 -> 0.416564: 1.19x faster
Significant (t=127.21)
Stddev: 0.00170 -> 0.00162: 1.0513x smaller
I'm not sure if this is the right place to discuss this, but what is the
justification for recording the average and std deviation of the
benchmarks?
If the benchmarks are based on timeit, the timeit docs warn against
taking any statistic other than the minimum.
Also because timeit is wrong to give that recommendation.

There are factors - such as garbage collection - that affect
operations on average, even though they may not kick in in every run.
If you want to know how something will perform as part of a larger
system, taking the best possible and extrapolating from it is a
mistake. As a concrete example, consider an algorithm that generates
cycles with several hundred MB of memory in them. Best case the RAM is
available, nothing swaps, and gc doesn't kick in during the
algorithm's execution. However, the larger program has to deal with
those several hundred MB of memory sitting around until gc *does* kick
in, has to pay the price of a gc run over a large heap, and deal with
the impact on disk read cache. When you do enough runs to see those
effects *that will affect the whole program* kick in, then you can
extrapolate from that basis. e.g. the question timeit optimises itself
to answer isn't the question most folk need most of the time.

-Rob

Re: [Python-Dev] Benchmarking Python 3.3 against Python 2.7 (wide build)

Robert Collins