On Tue, Jan 31, 2012 at 13:44, Maciej Fijalkowski <fijall@gmail.com> wrote:

On Tue, Jan 31, 2012 at 7:40 PM, Brett Cannon <brett@python.org> wrote:
>
>
> On Tue, Jan 31, 2012 at 11:58, Paul Graydon <paul@paulgraydon.co.uk> wrote:
>>
>>
>>> And this is a fundamental issue with tying benchmarks to real
>>> applications and libraries; if the code the benchmark relies on never
>>> changes to Python 3, then the benchmark is dead in the water. As Daniel
>>> pointed out, if spitfire simply never converts then either we need to
>>> convert them ourselves *just* for the benchmark (yuck), live w/o the
>>> benchmark (ok, but if this happens to a bunch of benchmarks then we are
>>> going to not have a lot of data), or we look at making new benchmarks based
>>> on apps/libraries that _have_ made the switch to Python 3 (which means
>>> trying to agree on some new set of benchmarks to add to the current set).
>>>
>>>
>> What is the criteria by which the original benchmark sets were chosen?
>> I'm assuming it was because they're generally popular libraries amongst
>> developers across a variety of purposes, so speed.pypy would show the speed
>> of regular tasks?
>
>
> That's the reason unladen swallow chose them, yes. PyPy then adopted them
> and added in the Twisted benchmarks.
>
>>
>> If so, presumably it shouldn't be too hard to find appropriate libraries
>> for Python 3?
>
>
> Perhaps, but someone has to put in the effort to find those benchmarks, code
> them up, show how they are a reasonable workload, and then get them
> accepted. Everyone likes the current set because the unladen team put in a
> lot of time and effort into selecting and creating those benchmarks.

I think we also spent significant amount of time grabbing various
benchmarks from various places (we = people who contributed to
speed.pypy.org benchmark suite, that's by far not a group consisting
only pypy devs).

Where does the PyPy benchmark code live, anyway?

You might be surprised, but the criteria we used were mostly
"contributed benchmarks showing some sort of real workload". I don't
think we ever *rejected* a benchmark barring one case that was very
variable and not very interesting (Depending on the HD performance).
Some benchmarks were developed from "we know pypy is slow on this"
scenarios as well.

Yeah, you and Alex have told me that in-person before.

The important part is that we want also "interesting" benchmarks to be
included. This mostly means "run by someone somewhere" which includes
a very broad category of things, but *excludes* fibonacci, richards,
pystone and stuff like this. I think it's fine if we have a benchmark
that runs python 3 version of whatever is there, but this requires
work. Is there someone willing to do that work?

Right, I'm not suggesting something as silly as fibonacci.

I think we need to first decide which set of benchmarks we are using since there is already divergence between what is on hg.python.org and what is measured at speed.pypy.org (e.g. hg.python.org tests 2to3 while pypy.org does not, reverse goes for twisted). Once we know what set of benchmarks we care about (it can be a cross-section), then we need to take a hard look at where we are coming up short for Python 3. But from a python-dev perspective, benchmarks running against Python 2 are not interesting since we are simply no longer developing performance improvements for Python 2.7.