[Python-Dev] Python Benchmarks

Andrew Dalke andrewdalke at gmail.com
Sat Jun 3 19:10:52 CEST 2006

On 6/3/06, john.m.camara at comcast.net <john.m.camara at comcast.net> wrote:
> - I would average the timings of runs instead of taking the minimum value as
> sometimes bench marks could be running code that is not deterministic in its
> calculations (could be using random numbers that effect convergence).

I would rewrite those to be deterministic.  Any benchmarks of mine which
use random numbers initializes the generator with a fixed seed and does
it in such a way that the order or choice of subbenchmarks does not affect
the individual results.  Any other way is madness.

> - Before calculating the average number I would throw out samples
> outside 3 sigmas (the outliers).

As I've insisted, talking about sigmas assumes a Gaussian distribution.
It's more likely that the timing variations (at least in stringbench) are
closer to a gamma distribution.

> Here is a passage I found ...
>   Unfortunately, defining an outlier is subjective (as it should be), and the
>   decisions concerning how to identify them must be made on an individual
>   basis (taking into account specific experimental paradigms

The experimental paradigm I've been using is:
  - precise and accurate clock on timescales much smaller than the
       benchmark (hence continuous distributions)
  - rare, random, short and uncorrelated interruptions

This leads to a gamma distribution (plus constant offset for minimum
compute time)

Gamma distributions have longer tails than Gaussians and hence
more "outliers".  If you think that averaging is useful then throwing
those outliers away will artificially lower the average value.

To me, using the minimum time, given the paradigm, makes sense.
How fast is the fastest runner in the world?  Do you have him run
a dozen times and get the average, or use the shortest time?

> I usually use this approach while reviewing data obtained by fairly
> accurate sensors so being being conservative using 3 sigmas works
> well for these cases.

That uses a different underlying physical process which is better
modeled by Gaussians.

Consider this.  For a given benchmark there is an absolute minimum time
for it to run on a given machine.  Suppose this is 10 seconds and the
benchmark timing comes out 10.2 seconds.  The +0.2 comes from
background overhead, though you don't know exactly what's due to overhead
and what's real.

If the times were Gaussian then there's as much chance of getting
benchmark times of 10.5 seconds as of 9.9 seconds.  But repeat the
benchmark as many times as you want and you'll never see 9.9 seconds,
though you will see 10.5.

> - Another improvement to bench marks can be obtained when both
> the old and new code is available to be benched mark together.

That's what stringbench does, comparing unicode and 8-bit strings.
However, how do you benchmark changes which are more complicated
than that?  For example, benchmark changes to the exception
mechanism, or builds under gcc 3.x and 4.x.

                                dalke at dalkescientific.com

More information about the Python-Dev mailing list