On Tue, Jan 31, 2012 at 15:04, Maciej Fijalkowski email@example.com wrote:
On Tue, Jan 31, 2012 at 9:55 PM, Brett Cannon firstname.lastname@example.org wrote:
On Tue, Jan 31, 2012 at 13:44, Maciej Fijalkowski email@example.com
On Tue, Jan 31, 2012 at 7:40 PM, Brett Cannon firstname.lastname@example.org wrote:
On Tue, Jan 31, 2012 at 11:58, Paul Graydon email@example.com wrote:
And this is a fundamental issue with tying benchmarks to real applications and libraries; if the code the benchmark relies on
changes to Python 3, then the benchmark is dead in the water. As Daniel pointed out, if spitfire simply never converts then either we need to convert them ourselves *just* for the benchmark (yuck), live w/o
benchmark (ok, but if this happens to a bunch of benchmarks then we are going to not have a lot of data), or we look at making new benchmarks based on apps/libraries that _have_ made the switch to Python 3 (which means trying to agree on some new set of benchmarks to add to the current set).
What is the criteria by which the original benchmark sets were chosen? I'm assuming it was because they're generally popular libraries amongst developers across a variety of purposes, so speed.pypy would show the speed of regular tasks?
That's the reason unladen swallow chose them, yes. PyPy then adopted them and added in the Twisted benchmarks.
If so, presumably it shouldn't be too hard to find appropriate libraries for Python 3?
Perhaps, but someone has to put in the effort to find those benchmarks, code them up, show how they are a reasonable workload, and then get them accepted. Everyone likes the current set because the unladen team put in a lot of time and effort into selecting and creating those benchmarks.
I think we also spent significant amount of time grabbing various benchmarks from various places (we = people who contributed to speed.pypy.org benchmark suite, that's by far not a group consisting only pypy devs).
Where does the PyPy benchmark code live, anyway?
You might be surprised, but the criteria we used were mostly "contributed benchmarks showing some sort of real workload". I don't think we ever *rejected* a benchmark barring one case that was very variable and not very interesting (Depending on the HD performance). Some benchmarks were developed from "we know pypy is slow on this" scenarios as well.
Yeah, you and Alex have told me that in-person before.
The important part is that we want also "interesting" benchmarks to be included. This mostly means "run by someone somewhere" which includes a very broad category of things, but *excludes* fibonacci, richards, pystone and stuff like this. I think it's fine if we have a benchmark that runs python 3 version of whatever is there, but this requires work. Is there someone willing to do that work?
Right, I'm not suggesting something as silly as fibonacci.
I think we need to first decide which set of benchmarks we are using
there is already divergence between what is on hg.python.org and what is measured at speed.pypy.org (e.g. hg.python.org tests 2to3 while pypy.org does not, reverse goes for twisted). Once we know what set of benchmarks we care about (it can be a cross-section), then we need to take a hard look at where we are coming up short for Python 3. But from a python-dev perspective, benchmarks running against Python 2 are not interesting since we are simply no longer developing performance improvements for Python 2.7.
2to3 is essentially an overlook on pypy side, we'll integrate it back. Other than that I think pypy benchmarks are mostly a superset (there is also pickle and a bunch of pointless microbenchmarks).
I think pickle was mostly for unladen's pickle performance patches (trying saying that three times fast =), so I don't really care about that one.
Would it make sense to change the pypy repo to make the unladen_swallow directory an external repo from hg.python.org/benchmarks? Because as it stands right now there are two mako benchmarks that are not identical. Otherwise we should talk at PyCon and figure this all out before we end up with two divergent benchmark suites that are being independently maintained (since we are all going to be running the same benchmarks on speed.python.org).