Mailman 3 Re: [Speed] Are benchmarks and libraries mutable? - Speed

1 Sep 2012 · *rerunability*


      On Sat, Sep 1, 2012 at 10:21 AM, Brett Cannon <brett@python.org> wrote:
...
Now that I can run benchmarks against Python 2.7 and 3.3 simultaneously,
I'm ready to start updating the benchmarks. This involves two parts.
One is moving benchmarks from PyPy over to the unladen repo on
hg.python.org/benchmarks. But I wanted to first make sure people don't
view the benchmarks as immutable (e.g. as Octane does:
https://developers.google.com/octane/faq). Since the benchmarks are
always relative between two interpreters their immutability isn't critical
compared to if we were to report some overall score. But it also means that
any changes made would throw off historical comparisons. For instance, if I
take PyPy's Mako benchmark (which does a lot more work), should it be named
mako_v2, or should we just replace mako wholesale?
I dislike benchmark immutability.  The rest of the world including your
local computing environment where benchmarks run continues to change around
benchmarks which really makes using historical benchmark data from a run on
an old version for comparison to a recent modern run pointless.
What is needed more is benchmark *rerunability* and *repeatability*.  So
that an old version of a Python implementation can be built and run the
current benchmark suite today within the exact same environment as a
current version of a python implementation.  They key is that they ran the
same thing on the same hardware in the same configuration at around the
same time.
Nothing else is a valid comparison as too many untracked unquantified
variables have changed in the interim.
Where the above clearly fails: creating historical trend graphs.  If you
want a setup that runs the benchmarks after every commit, or at least runs
them as continuously as possible _that_ benchmark suite needs to be as
immutable as possible.  The machine on which they are run also needs to be
locked down to have no updates applied and nothing else running on it
*ever*. Whenever either the bechmark suite or the historical trend
benchmark running os, distro or hardware is mutated it needs to be clearly
noted so deltas at that time in the results can be flagged to mark a
discontinuity in the trend data as being due to the external changes. ONE
way to do this is always version benchmark names.  Any time one is updated,
give it a new versioned name so it can't be compared with past results.
Otherwise for historical data, periodically rerunning the benchmark suite
on older versions (releases and betas) for use in modern comparisons is
ideal.
-gps
...
And the second is the same question for libraries. For instance, the
unladen benchmarks have Django 1.1a0 as the version which is rather
ancient. And with 1.5 coming out with provisional Python 3 support I
obviously would like to update it. But the same questions as with
benchmarks crops up in reference to immutability. Another thing is that
2to3 can't actually be ported using 2to3 (
http://bugs.python.org/issue15834) and so that itself will require two
versions -- a 2.x version (probably from Python 2.7's stdlib) and a 3.x
version (from the 3.2 stdlib) -- which already starts to add interesting
issues for me in terms of comparing performance (e.g. I will have to
probably update the 2.7 code to use io.BytesIO instead of StringIO.StringIO
to be on more equal footing). Similar thing goes for html5lib which has
developed its Python 3 support separately from its Python 2 code.
If we can't find a reasonable way to handle all of this then what I will
do is branch the unladen benchmarks for 2.x/3.x benchmarking, and then
create another branch of the benchmark suite to just be for Python 3.x so
that we can start fresh with a new set of benchmarks that will never change
themselves for benchmarking Python 3 itself. That would also mean we could
start of with whatever is needed from PyPy and unladen to have the optimal
benchmark runner for speed.python.org.

Speed mailing list
Speed@python.org
http://mail.python.org/mailman/listinfo/speed

Re: [Speed] Are benchmarks and libraries mutable?

Gregory P. Smith

tags

participants (1)