[Python-Dev] Python Benchmarks

Fri Jun 2 15:29:14 CEST 2006

Andrew Dalke wrote:
> M.-A. Lemburg:
>> The approach pybench is using is as follows:
> ...
>>  The calibration step is run multiple times and is used
>>  to calculate an average test overhead time.
> 
> One of the changes that occured during the sprint was to change this
> algorithm
> to use the best time rather than the average.  Using the average assumes a
> Gaussian distribution.  Timing results are not.  There is an absolute
> best but
> that's rarely reached due to background noise.  It's more like a gamma
> distribution
> plus the minimum time.
> 
> To show the distribution is non-Gaussian I ran the following
> 
> def compute():
>    x = 0
>    for i in range(1000):
>        for j in range(1000):
>            x += 1
> 
> def bench():
>    t1 = time.time()
>    compute()
>    t2 = time.time()
>    return t2-t1
> 
> times = []
> for i in range(1000):
>    times.append(bench())
> 
> print times
> 
> The full distribution is attached as 'plot1.png' and the close up
> (range 0.45-0.65)
> as 'plot2.png'.  Not a clean gamma function, but that's a closer match
> than an
> exponential.
> 
> The gamma distribution looks more like a exponential function when the
> shape
> parameter is large.  This corresponds to a large amount of noise in the
> system,
> so the run time is not close to the best time.  This means the average
> approach
> works better when there is a lot of random background activity, which is
> not the
> usual case when I try to benchmark.
> 
> When averaging a gamma distribution you'll end up with a bit of a
> skew, and I think
> the skew depends on the number of samples, reaching a limit point.
> 
> Using the minimum time should be more precise because there is a
> definite lower bound and the machine should be stable.  In my test
> above the first few results are
> 
> 0.472838878632
> 0.473038911819
> 0.473326921463
> 0.473494052887
> 0.473829984665
> 
> I'm pretty certain the best time is 0.4725, or very close to that.
> But the average
> time is 0.58330151391 because of the long tail.  Here are the last 6
> results in
> my population of 1000
> 
> 1.76353311539
> 1.79937505722
> 1.82750201225
> 2.01710510254
> 2.44861507416
> 2.90868496895
> 
> Granted, I hit a couple of web pages while doing this and my spam
> filter processed
> my mailbox in the background...
> 
> There's probably some Markov modeling which would look at the number
> and distribution of samples so far and assuming a gamma distribution
> determine how many more samples are needed to get a good estimate of
> the absolute minumum time.  But min(large enough samples) should work
> fine.

Thanks for the great analysis !

Using the minimum looks like the way to go for calibration.

I wonder whether the same is true for the actual tests; since
you're looking for the expected run-time, the minimum may
not necessarily be the choice. Then again, in both cases
you are only looking at a small number of samples (20 for
the calibration, 10 for the number of rounds), so this
may be irrelevant.

BTW, did you run this test on Windows or a Unix machine ?

There's also an interesting second high at around 0.53.
What could be causing this ?

>> If the whole suite runs in 50 seconds, the per-test
>> run-times are far too small to be accurate. I usually
>> adjust the warp factor so that each *round* takes
>> 50 seconds.
> 
> The stringbench.py I wrote uses the timeit algorithm which
> dynamically adjusts the test to run between 0.2 and 2 seconds.
>
>> That's why the timers being used by pybench will become a
>> parameter that you can then select to adapt pybench it to
>> the OS your running pybench on.
> 
> Wasn't that decision a consequence of the problems found during
> the sprint?

It's a consequence of a discussion I had with Steve Holden
and Tim Peters:

I believe that using wall-clock timers
for benchmarking is not a good approach due to the high
noise level. Process time timers typically have a lower
resolution, but give a better picture of the actual
run-time of your code and also don't exhibit as much noise
as the wall-clock timer approach. Of course, you have
to run the tests somewhat longer to get reasonable
accuracy of the timings.

Tim thinks that it's better to use short running tests and
an accurate timer, accepting the added noise and counting
on the user making sure that the noise level is at a
minimum.

Since I like to give users the option of choosing for
themselves, I'm going to make the choice of timer an
option.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Jun 02 2006)
>>> Python/Zope Consulting and Support ...        http://www.egenix.com/
>>> mxODBC.Zope.Database.Adapter ...             http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ...        http://python.egenix.com/
________________________________________________________________________
2006-07-03: EuroPython 2006, CERN, Switzerland              30 days left

::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::