[Python-Dev] Python Benchmarks

Sat Jun 3 17:25:12 CEST 2006

Here are my suggestions:

- While running bench marks don't listen to music, watch videos, use the keyboard/mouse, or run anything other than the bench mark code.  Seams like common sense to me.

- I would average the timings of runs instead of taking the minimum value as sometimes bench marks could be running code that is not deterministic in its calculations (could be using random numbers that effect convergence).

- Before calculating the average number I would throw out samples outside 3 sigmas (the outliers).  This would eliminate the samples that are out of wack due to events that are out of our control.  To use this approach it would be necessary to run some minimum number of times.  I believe 30-40 samples would be necessary but I'm no expert in statistics.  I base this on my recollection  of a study on this I did some time in the late 90s.  I use to have a better feel for the number of samples that is required based on the number of sigmas that is used to determine the outliers but I have to confess that I just normally use a minimum of 100 samples to play it safe.  I'm sure with a little experimentation with bench marks the proper number of samples could be determined.

Here is a passage I found at http://www.statsoft.com/textbook/stbasic.html#Correlationsf that is related.

'''Quantitative Approach to Outliers. Some researchers use quantitative methods to exclude outliers. For example, they exclude observations that are outside the range of ï¿½2 standard deviations (or even ï¿½1.5 sd's) around the group or design cell mean. In some areas of research, such "cleaning" of the data is absolutely necessary. For example, in cognitive psychology research on reaction times, even if almost all scores in an experiment are in the range of 300-700 milliseconds, just a few "distracted reactions" of 10-15 seconds will completely change the overall picture. Unfortunately, defining an outlier is subjective (as it should be), and the decisions concerning how to identify them must be made on an individual basis (taking into account specific experimental paradigms and/or "accepted practice" and general research experience in the respective area). It should also be noted that in some rare cases, the relative frequency of outliers across a number of groups or cells of a d
esign can be subjected to analysis and provide interpretable results. For example, outliers could be indicative of the occurrence of a phenomenon that is qualitatively different than the typical pattern observed or expected in the sample, thus the relative frequency of outliers could provide evidence of a relative frequency of departure from the process or phenomenon that is typical for the majority of cases in a group.'''

Now I personally feel that using 1.5 or 2 sigma approach is rather loose for the case of bench marks and the suggestion I gave of 3 might be too tight.  From experimentation we might find that 2.5 is more appropriate. I usually use this approach while reviewing data obtained by fairly accurate sensors so being being conservative using 3 sigmas works well for these cases.

The last statement in the passage is worthy to note as a high ratio of outliers could be used as an indication that the bench mark results for a particular run are invalid.  This could be used to throw out bad results due to some one starting to listen to music while the bench marks are running, anti virus software starts to run, etc.

- Another improvement to bench marks can be obtained when both the old and new code is available to be benched mark together.  By running the bench marks of both codes together we could eliminate effects of noise if we assume noise at a given point of time would be applied to both sets of code.  Here is a modified version of the code that Andrew wrote previously to show this clearer than my words.

def compute_old():
    x = 0
    for i in range(1000):
        for j in range(1000):
            x = x + 1

def compute_new():
    x = 0
    for i in range(1000):
        for j in range(1000):
            x += 1

def bench():
    t1 = time.clock()
    compute_old()
    t2 = time.clock()
    compute_new()
    t3 = time.clock()
    return t2-t1, t3-t2

times_old = []
times_new = []
for i in range(1000):
    time_old, time_new = bench()
    times_old.append(time_old)
    times_new.append(time_new)

John