[Python-Dev] Benchmarking Python 3.3 against Python 2.7 (wide build)

Mon Oct 1 07:07:15 CEST 2012

On Mon, Oct 1, 2012 at 2:35 PM, Steven D'Aprano <steve at pearwood.info> wrote:
> On Sun, Sep 30, 2012 at 07:12:47PM -0400, Brett Cannon wrote:
>
>> > python3 perf.py -T --basedir ../benchmarks -f -b py3k
>> ../cpython/builds/2.7-wide/bin/python ../cpython/builds/3.3/bin/python3.3
>
>> ### call_method ###
>> Min: 0.491433 -> 0.414841: 1.18x faster
>> Avg: 0.493640 -> 0.416564: 1.19x faster
>> Significant (t=127.21)
>> Stddev: 0.00170 -> 0.00162: 1.0513x smaller
>
> I'm not sure if this is the right place to discuss this, but what is the
> justification for recording the average and std deviation of the
> benchmarks?
>
> If the benchmarks are based on timeit, the timeit docs warn against
> taking any statistic other than the minimum.

Also because timeit is wrong to give that recommendation.

There are factors - such as garbage collection - that affect
operations on average, even though they may not kick in in every run.
If you want to know how something will perform as part of a larger
system, taking the best possible and extrapolating from it is a
mistake. As a concrete example, consider an algorithm that generates
cycles with several hundred MB of memory in them. Best case the RAM is
available, nothing swaps, and gc doesn't kick in during the
algorithm's execution. However, the larger program has to deal with
those several hundred MB of memory sitting around until gc *does* kick
in, has to pay the price of a gc run over a large heap, and deal with
the impact on disk read cache. When you do enough runs to see those
effects *that will affect the whole program* kick in, then you can
extrapolate from that basis. e.g. the question timeit optimises itself
to answer isn't the question most folk need most of the time.

-Rob