[Python-checkins] r46505 -python/trunk/Tools/pybench/systimes.py

Wed Jun 7 12:21:50 CEST 2006

M.-A. Lemburg wrote:
> Fredrik Lundh wrote:
> 
>>M.-A. Lemburg wrote:
[...]
>>>In fact, if you run time.time() vs. resource.getrusage() on
>>>a Linux box, you'll find that both more or less show the same
>>>flux in timings - with an error interval of around 15ms.
>>
>>which is easily explained by "cycle stealers", and is destroying the 
>>benchmark's precision. 
> 
> 
> Right, but there's nothing much you can do about, I'm afraid.
> 
And therein lies the crux of this discussion. It's pointless talking 
about "1.56% accuracy" under circumstances like this. It's also, in my 
opinion, a bit pointless deriving a notional "per operation" figure for 
each class of operation, but let's let that slide.

Benchmarks are useful to discover whether one system is faster than 
another for a given processing load. The days of determinism are pretty 
much long gone, so we have to accept that.
> 
>>and as usual, if you don't have precision, you 
>>don't really have accuracy (unless you have a good statistical model, 
>>and enough data to use it; see Andrew's posts for more on that).
> 
> 
> If you know how big your error interval is, then you are
> already in a very good position. If you can narrow down
> that interval, you're in an even better position. How
> this can be done depends on the method of timing you're
> using and whether you run the benchmark using many short
> runs, a few long ones or many long ones.
> 

But if the benchmark gives radically different results under each 
circumstance then there isn't a lot of point providing comparison features.

This latest conversation all started because we observed at the need for 
Speed sprint that there didn't seem to be any reliable way to determine 
whether a given change in the interpreter resulted in speed increases. 
When Tim timed things that affected the pystone benchmark he discovered 
that among other things he observed a difference of up to 50% simply due 
to CPU core temperature.

Ultimately I suspect that the answer is to have more available 
benchmarks and to persuade people to run them more frequently, and place 
less absolute trust in the output from a single run.

regards
  Steve
-- 
Steve Holden       +44 150 684 7255  +1 800 494 3119
Holden Web LLC/Ltd          http://www.holdenweb.com
Love me, love my blog  http://holdenweb.blogspot.com
Recent Ramblings     http://del.icio.us/steve.holden