Re: [Speed] Are benchmarks and libraries mutable?
On Sat, Sep 1, 2012 at 10:21 AM, Brett Cannon <brett@python.org> wrote:
Now that I can run benchmarks against Python 2.7 and 3.3 simultaneously, I'm ready to start updating the benchmarks. This involves two parts.
One is moving benchmarks from PyPy over to the unladen repo on hg.python.org/benchmarks. But I wanted to first make sure people don't view the benchmarks as immutable (e.g. as Octane does: https://developers.google.com/octane/faq). Since the benchmarks are always relative between two interpreters their immutability isn't critical compared to if we were to report some overall score. But it also means that any changes made would throw off historical comparisons. For instance, if I take PyPy's Mako benchmark (which does a lot more work), should it be named mako_v2, or should we just replace mako wholesale?
I dislike benchmark immutability. The rest of the world including your local computing environment where benchmarks run continues to change around benchmarks which really makes using historical benchmark data from a run on an old version for comparison to a recent modern run pointless.
What is needed more is benchmark *rerunability* and *repeatability*. So that an old version of a Python implementation can be built and run the current benchmark suite today within the exact same environment as a current version of a python implementation. They key is that they ran the same thing on the same hardware in the same configuration at around the same time.
Nothing else is a valid comparison as too many untracked unquantified variables have changed in the interim.
Where the above clearly fails: creating historical trend graphs. If you want a setup that runs the benchmarks after every commit, or at least runs them as continuously as possible _that_ benchmark suite needs to be as immutable as possible. The machine on which they are run also needs to be locked down to have no updates applied and nothing else running on it *ever*. Whenever either the bechmark suite or the historical trend benchmark running os, distro or hardware is mutated it needs to be clearly noted so deltas at that time in the results can be flagged to mark a discontinuity in the trend data as being due to the external changes. ONE way to do this is always version benchmark names. Any time one is updated, give it a new versioned name so it can't be compared with past results.
Otherwise for historical data, periodically rerunning the benchmark suite on older versions (releases and betas) for use in modern comparisons is ideal.
-gps
And the second is the same question for libraries. For instance, the unladen benchmarks have Django 1.1a0 as the version which is rather ancient. And with 1.5 coming out with provisional Python 3 support I obviously would like to update it. But the same questions as with benchmarks crops up in reference to immutability. Another thing is that 2to3 can't actually be ported using 2to3 ( http://bugs.python.org/issue15834) and so that itself will require two versions -- a 2.x version (probably from Python 2.7's stdlib) and a 3.x version (from the 3.2 stdlib) -- which already starts to add interesting issues for me in terms of comparing performance (e.g. I will have to probably update the 2.7 code to use io.BytesIO instead of StringIO.StringIO to be on more equal footing). Similar thing goes for html5lib which has developed its Python 3 support separately from its Python 2 code.
If we can't find a reasonable way to handle all of this then what I will do is branch the unladen benchmarks for 2.x/3.x benchmarking, and then create another branch of the benchmark suite to just be for Python 3.x so that we can start fresh with a new set of benchmarks that will never change themselves for benchmarking Python 3 itself. That would also mean we could start of with whatever is needed from PyPy and unladen to have the optimal benchmark runner for speed.python.org.
Speed mailing list Speed@python.org http://mail.python.org/mailman/listinfo/speed
participants (1)
-
Gregory P. Smith