[Python-Dev] Python Benchmarks

Mon Jun 5 17:20:44 CEST 2006

M.-A. Lemburg wrote:
> Fredrik Lundh wrote:
> 
>>M.-A. Lemburg wrote:
>>
>>
>>>Seriously, I've been using and running pybench for years
>>>and even though tweaks to the interpreter do sometimes
>>>result in speedups or slow-downs where you wouldn't expect
>>>them (due to the interpreter using the Python objects),
>>>they are reproducable and often enough have uncovered
>>>that optimizations in one area may well result in slow-downs
>>>in other areas.
>>
>> > Often enough the results are related to low-level features
>> > of the architecture you're using to run the code such as
>> > cache size, cache lines, number of registers in the CPU or
>> > on the FPU stack, etc. etc.
>>
>>and that observation has never made you stop and think about whether 
>>there might be some problem with the benchmarking approach you're using? 
> 
> 
> The approach pybench is using is as follows:
> 
> * Run a calibration step which does the same as the actual
>   test without the operation being tested (ie. call the
>   function running the test, setup the for-loop, constant
>   variables, etc.)
> 
>   The calibration step is run multiple times and is used
>   to calculate an average test overhead time.
> 
I believe my recent changes now take the minimum time rather than 
computing an average, since the minimum seems to be the best reflection 
of achievable speed. I assumed that we wanted to measure achievable 
speed rather than average speed as our benchmark of performance.

> * Run the actual test which runs the operation multiple
>   times.
> 
>   The test is then adjusted to make sure that the
>   test overhead / test run ratio remains within
>   reasonable bounds.
> 
>   If needed, the operation code is repeated verbatim in
>   the for-loop, to decrease the ratio.
> 
> * Repeat the above for each test in the suite
> 
> * Repeat the suite N number of rounds
> 
> * Calculate the average run time of all test runs in all rounds.
> 
Again, we are now using the minimum value. The reasons are similar: if 
extraneous processes interfere with timings then we don't want that to 
be reflected in the given timings. That's why we now report "notional 
minimum round time", since it's highly unlikely that any specific test 
round will give the minimum time for all tests.

Even with these changes we still see some disturbing variations in 
timing both on Windows and on Unix-like platforms.
> 
>>  after all, if a change to e.g. the try/except code slows things down 
>>or speed things up, is it really reasonable to expect that the time it
>>takes to convert Unicode strings to uppercase should suddenly change due 
>>to cache effects or a changing number of registers in the CPU?  real 
>>hardware doesn't work that way...
> 
> 
> Of course, but then changes to try-except logic can interfere
> with the performance of setting up method calls. This is what
> pybench then uncovers.
> 
> The only problem I see in the above approach is the way
> calibration is done. The run-time of the calibration code
> may be to small w/r to the resolution of the used timers.
> 
> Again, please provide the parameters you've used to run the
> test case and the output. Things like warp factor, overhead,
> etc. could hint to the problem you're seeing.
> 
> 
>>is PyBench perhaps using the following approach:
>>
>>     T = set of tests
>>     for N in range(number of test runs):
>>         for t in T:
>>             t0 = get_process_time()
>>             t()
>>             t1 = get_process_time()
>>             assign t1 - t0 to test t
>>             print assigned time
>>
>>where t1 - t0 is very short?
> 
> 
> See above (or the code in pybench.py). t1-t0 is usually
> around 20-50 seconds:
> 
> """
>         The tests must set .rounds to a value high enough to let the
>         test run between 20-50 seconds. This is needed because
>         clock()-timing only gives rather inaccurate values (on Linux,
>         for example, it is accurate to a few hundreths of a
>         second). If you don't want to wait that long, use a warp
>         factor larger than 1.
> """
> 
First, I'm not sure that this is the case for the default test 
parameters on modern machines. On my current laptop, for example, I see 
a round time of roughly four seconds and a notional minimum round time 
of 3.663 seconds.

Secondly, while this recommendation may be very sensible, with 50 
individual tests a decrease in the warp factor to 1 (the default is 
currently 20) isn't sufficient to increase individual test times to your 
recommended value, and decreasing the warp factor tends also to decrease 
reliability and repeatability.

Thirdly, since each round of the suite at warp factor 1 takes between 80 
and 90 seconds, pybench run this way isn't something one can usefully 
use to quickly evaluate the impact of a single change - particularly 
since even continuing development work on the benchmark machine 
potentially affects the benchmark results in unknown ways.

> 
>>that's not a very good idea, given how get_process_time tends to be 
>>implemented on current-era systems (google for "jiffies")...  but it 
>>definitely explains the bogus subtest results I'm seeing, and the "magic 
>>hardware" behaviour you're seeing.
> 
> 
> That's exactly the reason why tests run for a relatively long
> time - to minimize these effects. Of course, using wall time
> make this approach vulnerable to other effects such as current
> load of the system, other processes having a higher priority
> interfering with the timed process, etc.
> 
> For this reason, I'm currently looking for ways to measure the
> process time on Windows.
> 
I wish you luck with this search, as we clearly do need to improve 
repeatability of pybench results across all platforms, and particularly 
on Windows.

regards
  Steve
-- 
Steve Holden       +44 150 684 7255  +1 800 494 3119
Holden Web LLC/Ltd          http://www.holdenweb.com
Love me, love my blog  http://holdenweb.blogspot.com
Recent Ramblings     http://del.icio.us/steve.holden