Proposal for a common benchmark suite

Hello, As announced in my GSoC proposal I'd like to announce which benchmarks I'll use for the benchmark suite I will work on this summer.
As of now there are two benchmark suites (that I know of) which receive some sort of attention, those are the ones developed as part of the PyPy project[1] which is used for http://speed.pypy.org and the one initially developed for Unladen Swallow which has been continued by CPython[2]. The PyPy benchmarks contain a lot of interesting benchmarks some explicitly developed for that suite, the CPython benchmarks have an extensive set of microbenchmarks in the pybench package as well as the previously mentioned modifications made to the Unladen Swallow benchmarks.
I'd like to "simply" merge both suites so that no changes are lost. However I'd like to leave out the waf benchmark which is part of the PyPy suite, the removal was proposed on pypy-dev for obvious deficits[3]. It will be easier to add a better benchmark later than replacing it at a later point.
Unless there is a major issue with this plan I'd like to go forward with this.
.. [1]: https://bitbucket.org/pypy/benchmarks .. [2]: http://hg.python.org/benchmarks .. [3]: http://mailrepository.com/pypy-dev.codespeak.net/msg/3627509/

DasIch, 28.04.2011 20:55:
the CPython benchmarks have an extensive set of microbenchmarks in the pybench package
Try not to care too much about pybench. There is some value in it, but some of its microbenchmarks are also tied to CPython's interpreter behaviour. For example, the benchmarks for literals can easily be considered dead code by other Python implementations so that they may end up optimising the benchmarked code away completely, or at least partially. That makes a comparison of the results somewhat pointless.
Stefan

Stefan Behnel wrote:
DasIch, 28.04.2011 20:55:
the CPython benchmarks have an extensive set of microbenchmarks in the pybench package
Try not to care too much about pybench. There is some value in it, but some of its microbenchmarks are also tied to CPython's interpreter behaviour. For example, the benchmarks for literals can easily be considered dead code by other Python implementations so that they may end up optimising the benchmarked code away completely, or at least partially. That makes a comparison of the results somewhat pointless.
The point of the micro benchmarks in pybench is to be able to compare them one-by-one, not by looking at the sum of the tests.
If one implementation optimizes away some parts, then the comparison will show this fact very clearly - and that's the whole point.
Taking the sum of the micro benchmarks only has some meaning as very rough indicator of improvement. That's why I wrote pybench: to get a better, more details picture of what's happening, rather than trying to find some way of measuring "average" use.
This "average" is very different depending on where you look: for some applications method calls may be very important, for others, arithmetic operations, and yet others may have more need for fast attribute lookup.

M.-A. Lemburg, 28.04.2011 22:23:
Stefan Behnel wrote:
DasIch, 28.04.2011 20:55:
the CPython benchmarks have an extensive set of microbenchmarks in the pybench package
Try not to care too much about pybench. There is some value in it, but some of its microbenchmarks are also tied to CPython's interpreter behaviour. For example, the benchmarks for literals can easily be considered dead code by other Python implementations so that they may end up optimising the benchmarked code away completely, or at least partially. That makes a comparison of the results somewhat pointless.
The point of the micro benchmarks in pybench is to be able to compare them one-by-one, not by looking at the sum of the tests.
If one implementation optimizes away some parts, then the comparison will show this fact very clearly - and that's the whole point.
Taking the sum of the micro benchmarks only has some meaning as very rough indicator of improvement. That's why I wrote pybench: to get a better, more details picture of what's happening, rather than trying to find some way of measuring "average" use.
This "average" is very different depending on where you look: for some applications method calls may be very important, for others, arithmetic operations, and yet others may have more need for fast attribute lookup.
I wasn't talking about "averages" or "sums", and I also wasn't trying to put down pybench in general. As it stands, it makes sense as a benchmark for CPython.
However, I'm arguing that a substantial part of it does not make sense as a benchmark for PyPy and others. With Cython, I couldn't get some of the literal arithmetic benchmarks to run at all. The runner script simply bails out with an error when the benchmarks accidentally run faster than the initial empty loop. I imagine that PyPy would eventually even drop the loop itself, thus leaving nothing to compare. Does that tell us that PyPy is faster than Cython for arithmetic? I don't think it does.
When I see that a benchmark shows that one implementation runs in 100% less time than another, I simply go *shrug* and look for a better benchmark to compare the two.
Stefan

On Thu, Apr 28, 2011 at 11:10 PM, Stefan Behnel stefan_ml@behnel.de wrote:
M.-A. Lemburg, 28.04.2011 22:23:
Stefan Behnel wrote:
DasIch, 28.04.2011 20:55:
the CPython benchmarks have an extensive set of microbenchmarks in the pybench package
Try not to care too much about pybench. There is some value in it, but some of its microbenchmarks are also tied to CPython's interpreter behaviour. For example, the benchmarks for literals can easily be considered dead code by other Python implementations so that they may end up optimising the benchmarked code away completely, or at least partially. That makes a comparison of the results somewhat pointless.
The point of the micro benchmarks in pybench is to be able to compare them one-by-one, not by looking at the sum of the tests.
If one implementation optimizes away some parts, then the comparison will show this fact very clearly - and that's the whole point.
Taking the sum of the micro benchmarks only has some meaning as very rough indicator of improvement. That's why I wrote pybench: to get a better, more details picture of what's happening, rather than trying to find some way of measuring "average" use.
This "average" is very different depending on where you look: for some applications method calls may be very important, for others, arithmetic operations, and yet others may have more need for fast attribute lookup.
I wasn't talking about "averages" or "sums", and I also wasn't trying to put down pybench in general. As it stands, it makes sense as a benchmark for CPython.
However, I'm arguing that a substantial part of it does not make sense as a benchmark for PyPy and others. With Cython, I couldn't get some of the literal arithmetic benchmarks to run at all. The runner script simply bails out with an error when the benchmarks accidentally run faster than the initial empty loop. I imagine that PyPy would eventually even drop the loop itself, thus leaving nothing to compare. Does that tell us that PyPy is faster than Cython for arithmetic? I don't think it does.
When I see that a benchmark shows that one implementation runs in 100% less time than another, I simply go *shrug* and look for a better benchmark to compare the two.
I second here what Stefan says. This sort of benchmarks might be useful for CPython, but they're not particularly useful for PyPy or for comparisons (or any other implementation which tries harder to optimize stuff away). For example a method call in PyPy would be inlined and completely removed if method is empty, which does not measure method call overhead at all. That's why we settled on medium-to-large examples where it's more of an average of possible scenarios than just one.

Maciej Fijalkowski wrote:
On Thu, Apr 28, 2011 at 11:10 PM, Stefan Behnel stefan_ml@behnel.de wrote:
M.-A. Lemburg, 28.04.2011 22:23:
Stefan Behnel wrote:
DasIch, 28.04.2011 20:55:
the CPython benchmarks have an extensive set of microbenchmarks in the pybench package
Try not to care too much about pybench. There is some value in it, but some of its microbenchmarks are also tied to CPython's interpreter behaviour. For example, the benchmarks for literals can easily be considered dead code by other Python implementations so that they may end up optimising the benchmarked code away completely, or at least partially. That makes a comparison of the results somewhat pointless.
The point of the micro benchmarks in pybench is to be able to compare them one-by-one, not by looking at the sum of the tests.
If one implementation optimizes away some parts, then the comparison will show this fact very clearly - and that's the whole point.
Taking the sum of the micro benchmarks only has some meaning as very rough indicator of improvement. That's why I wrote pybench: to get a better, more details picture of what's happening, rather than trying to find some way of measuring "average" use.
This "average" is very different depending on where you look: for some applications method calls may be very important, for others, arithmetic operations, and yet others may have more need for fast attribute lookup.
I wasn't talking about "averages" or "sums", and I also wasn't trying to put down pybench in general. As it stands, it makes sense as a benchmark for CPython.
However, I'm arguing that a substantial part of it does not make sense as a benchmark for PyPy and others. With Cython, I couldn't get some of the literal arithmetic benchmarks to run at all. The runner script simply bails out with an error when the benchmarks accidentally run faster than the initial empty loop. I imagine that PyPy would eventually even drop the loop itself, thus leaving nothing to compare. Does that tell us that PyPy is faster than Cython for arithmetic? I don't think it does.
When I see that a benchmark shows that one implementation runs in 100% less time than another, I simply go *shrug* and look for a better benchmark to compare the two.
I second here what Stefan says. This sort of benchmarks might be useful for CPython, but they're not particularly useful for PyPy or for comparisons (or any other implementation which tries harder to optimize stuff away). For example a method call in PyPy would be inlined and completely removed if method is empty, which does not measure method call overhead at all. That's why we settled on medium-to-large examples where it's more of an average of possible scenarios than just one.
If CPython were to start incorporating any specialising optimisations, pybench wouldn't be much use for CPython. The Unladen Swallow folks didn't like pybench as a benchmark.

Mark Shannon wrote:
Maciej Fijalkowski wrote:
On Thu, Apr 28, 2011 at 11:10 PM, Stefan Behnel stefan_ml@behnel.de wrote:
M.-A. Lemburg, 28.04.2011 22:23:
Stefan Behnel wrote:
DasIch, 28.04.2011 20:55:
the CPython benchmarks have an extensive set of microbenchmarks in the pybench package
Try not to care too much about pybench. There is some value in it, but some of its microbenchmarks are also tied to CPython's interpreter behaviour. For example, the benchmarks for literals can easily be considered dead code by other Python implementations so that they may end up optimising the benchmarked code away completely, or at least partially. That makes a comparison of the results somewhat pointless.
The point of the micro benchmarks in pybench is to be able to compare them one-by-one, not by looking at the sum of the tests.
If one implementation optimizes away some parts, then the comparison will show this fact very clearly - and that's the whole point.
Taking the sum of the micro benchmarks only has some meaning as very rough indicator of improvement. That's why I wrote pybench: to get a better, more details picture of what's happening, rather than trying to find some way of measuring "average" use.
This "average" is very different depending on where you look: for some applications method calls may be very important, for others, arithmetic operations, and yet others may have more need for fast attribute lookup.
I wasn't talking about "averages" or "sums", and I also wasn't trying to put down pybench in general. As it stands, it makes sense as a benchmark for CPython.
However, I'm arguing that a substantial part of it does not make sense as a benchmark for PyPy and others. With Cython, I couldn't get some of the literal arithmetic benchmarks to run at all. The runner script simply bails out with an error when the benchmarks accidentally run faster than the initial empty loop. I imagine that PyPy would eventually even drop the loop itself, thus leaving nothing to compare. Does that tell us that PyPy is faster than Cython for arithmetic? I don't think it does.
When I see that a benchmark shows that one implementation runs in 100% less time than another, I simply go *shrug* and look for a better benchmark to compare the two.
I second here what Stefan says. This sort of benchmarks might be useful for CPython, but they're not particularly useful for PyPy or for comparisons (or any other implementation which tries harder to optimize stuff away). For example a method call in PyPy would be inlined and completely removed if method is empty, which does not measure method call overhead at all. That's why we settled on medium-to-large examples where it's more of an average of possible scenarios than just one.
If CPython were to start incorporating any specialising optimisations, pybench wouldn't be much use for CPython. The Unladen Swallow folks didn't like pybench as a benchmark.
This is all true, but I think there's a general misunderstanding of what pybench is.
I wrote pybench in 1997 when I was working on optimizing the Python 1.5 implementation for use in an web application server.
At the time, we had pystone and that was a really poor benchmark for determining of whether certain optimizations in the Python VM and compiler made sense or not.
pybench was then improved and extended over the course of several years and then added to Python 2.5 in 2006.
The benchmark is written as framework for micro benchmarks based on the assumption of a non-optimizing (byte code) compiler.
As such it may or may not work with an optimizing compiler. The calibration part would likely have to be disabled for an optimizing compiler (run with -C 0) and a new set of benchmark tests would have to be added; one which tests the Python implementation at a higher level than the existing tests.
That last part is something people tend to forget: pybench is not a monolithic application with a predefined and fixed set of tests. It's a framework that can be extended as needed.
All you have to do is add a new module with test classes and import it in Setup.py.

On 29/04/2011 11:04, M.-A. Lemburg wrote:
Mark Shannon wrote:
Maciej Fijalkowski wrote:
On Thu, Apr 28, 2011 at 11:10 PM, Stefan Behnelstefan_ml@behnel.de wrote:
M.-A. Lemburg, 28.04.2011 22:23:
Stefan Behnel wrote:
DasIch, 28.04.2011 20:55: > the CPython > benchmarks have an extensive set of microbenchmarks in the pybench > package Try not to care too much about pybench. There is some value in it, but some of its microbenchmarks are also tied to CPython's interpreter behaviour. For example, the benchmarks for literals can easily be considered dead code by other Python implementations so that they may end up optimising the benchmarked code away completely, or at least partially. That makes a comparison of the results somewhat pointless.
The point of the micro benchmarks in pybench is to be able to compare them one-by-one, not by looking at the sum of the tests.
If one implementation optimizes away some parts, then the comparison will show this fact very clearly - and that's the whole point.
Taking the sum of the micro benchmarks only has some meaning as very rough indicator of improvement. That's why I wrote pybench: to get a better, more details picture of what's happening, rather than trying to find some way of measuring "average" use.
This "average" is very different depending on where you look: for some applications method calls may be very important, for others, arithmetic operations, and yet others may have more need for fast attribute lookup.
I wasn't talking about "averages" or "sums", and I also wasn't trying to put down pybench in general. As it stands, it makes sense as a benchmark for CPython.
However, I'm arguing that a substantial part of it does not make sense as a benchmark for PyPy and others. With Cython, I couldn't get some of the literal arithmetic benchmarks to run at all. The runner script simply bails out with an error when the benchmarks accidentally run faster than the initial empty loop. I imagine that PyPy would eventually even drop the loop itself, thus leaving nothing to compare. Does that tell us that PyPy is faster than Cython for arithmetic? I don't think it does.
When I see that a benchmark shows that one implementation runs in 100% less time than another, I simply go *shrug* and look for a better benchmark to compare the two.
I second here what Stefan says. This sort of benchmarks might be useful for CPython, but they're not particularly useful for PyPy or for comparisons (or any other implementation which tries harder to optimize stuff away). For example a method call in PyPy would be inlined and completely removed if method is empty, which does not measure method call overhead at all. That's why we settled on medium-to-large examples where it's more of an average of possible scenarios than just one.
If CPython were to start incorporating any specialising optimisations, pybench wouldn't be much use for CPython. The Unladen Swallow folks didn't like pybench as a benchmark.
This is all true, but I think there's a general misunderstanding of what pybench is.
pybench proved useful for IronPython. It certainly highlighted some performance problems with some of the basic operations it measures.
All the best,
Michael Foord
I wrote pybench in 1997 when I was working on optimizing the Python 1.5 implementation for use in an web application server.
At the time, we had pystone and that was a really poor benchmark for determining of whether certain optimizations in the Python VM and compiler made sense or not.
pybench was then improved and extended over the course of several years and then added to Python 2.5 in 2006.
The benchmark is written as framework for micro benchmarks based on the assumption of a non-optimizing (byte code) compiler.
As such it may or may not work with an optimizing compiler. The calibration part would likely have to be disabled for an optimizing compiler (run with -C 0) and a new set of benchmark tests would have to be added; one which tests the Python implementation at a higher level than the existing tests.
That last part is something people tend to forget: pybench is not a monolithic application with a predefined and fixed set of tests. It's a framework that can be extended as needed.
All you have to do is add a new module with test classes and import it in Setup.py.

Given those facts I think including pybench is a mistake. It does not allow for a fair or meaningful comparison between implementations which is one of the things the suite is supposed to be used for in the future.
This easily leads to misinterpretation of the results from this particular benchmark and it negatively affects the performance data as a whole.
The same applies to several Unladen Swallow microbenchmarks such as bm_call_method_*, bm_call_simple and bm_unpack_sequence.

DasIch wrote:
Given those facts I think including pybench is a mistake. It does not allow for a fair or meaningful comparison between implementations which is one of the things the suite is supposed to be used for in the future.
This easily leads to misinterpretation of the results from this particular benchmark and it negatively affects the performance data as a whole.
The same applies to several Unladen Swallow microbenchmarks such as bm_call_method_*, bm_call_simple and bm_unpack_sequence.
I don't think we should exclude any implementation specific benchmarks from a common suite.
They will not necessarily allow for comparisons between implementations, but will provide important information about the progress made in optimizing a particular implementation.

On Fri, 29 Apr 2011 14:29:46 +0200 DasIch dasdasich@googlemail.com wrote:
Given those facts I think including pybench is a mistake. It does not allow for a fair or meaningful comparison between implementations which is one of the things the suite is supposed to be used for in the future.
"Including" is quite vague. pybench is "included" in the suite of benchmarks at hg.python.org, but that doesn't mean it is given any particular importance: you can select whichever benchmarks you want to run when "perf.py" is executed (there are even several predefined benchmark groups, none of which pybench is a member IIRC).
Regards
Antoine.
participants (7)
-
Antoine Pitrou
-
DasIch
-
M.-A. Lemburg
-
Maciej Fijalkowski
-
Mark Shannon
-
Michael Foord
-
Stefan Behnel