
As some of you already know, there is a new performance site online: speed.pypy.org. In Autumn last year I offered my help with a benchmarking website, and eventually for a new main site for the project. What you can see now is a work in progress, which I hope to improve in the next months. I called it Codespeed, and in time it could become a framework for other open source projects to use (for example javascript implementation development, connected to a forge, etc...). But right now, speed has two goals: - To have a powerful tool for the devs to analyse performance and detect regressions in the different pypy interpreters. - To improve the visibility of the project by letting python developers(users) have a look themselves at current performance and to compare pypy to other implementations, not having to wait for some dev to manually put together the graphs and post them on the blog. To that end, there are currently two views. The Overview does what it's name says, and allows a developer to have a broad view of how last revisions are affecting performance, and immediately spot regressions. If the user wants to closer examine the changes in the performance of a given benchmark, just clicking on it takes you to the the timeline for the selected benchmark. Both views will get some more features, and I have further plans down the pipeline: - A Comparison tab: graph bars directly comparing performance of different releases (pypy 1.2, trunk, cpython 2.x, unladen swallow, Jython, etc.) - A statistics tab: Go through regression test literature to see if we can have good metrics shown. - RSS feed/email alerts (for big regressions) - svn info integration - etc... In the end it is a site for the pypy devs, so I hope to get a lot of feedback from you so that it caters exactly to your needs. Have a nice day! Miquel PD: When it is a bit more complete I'll write a post for the PyPy status blog so it gets wider known.

Hi Miquel, On 02/25/2010 04:10 PM, Miquel Torres wrote:
I'm quite impressed, this is very cool work and a good improvement over the current plots. Thanks for doing this! One thing that I would really find useful are error bars in the timeline view. This would help us judge whether up-and-down movements are within the margin of randomness, or whether it is a real effect. I don't know how annoying they are to implement though, no clue whether the plot lib supports that. There should be enough information about errors in the json files, as far as I remember. Cheers, Carl Friedrich

Hi Carl, error bars are certainly doable and supported by the plot lib. The only problem is that my backend model doesn't include such a field, so at the moment the data (which is certainly included in the json files) is not being saved. I will need to change the DB models in order to acomodate that kind of information. I'll add this to the wish list. Cheers, Miquel 2010/2/25 Carl Friedrich Bolz <cfbolz@gmx.de>:

On 02/25/2010 11:39 PM, Jacob Hallén wrote:
Great, thanks.
How do you actually determine the error? Multiple runs or sliding average of historic data?
We use the Unladen Swallow benchmark runner, which runs each benchmark multiple times and computes the standard deviation. I then computes a 95% confidence interval using, I think, Student's T distribution. Cheers, Carl Friedrich

You may also consider that a benchmark that varies greatly between runs may be a flawed benchmark. I think it should be considered, but only on the running side, and act accordingly (too high a deviation: discard run, reconsider benchmark, reconsider environment or whatever). Miquel 2010/2/26 Stefan Behnel <stefan_ml@behnel.de>:

Miquel Torres, 26.02.2010 11:05:
Right, there might even have been a cron job running at the same time. There are various reasons why benchmark numbers can vary. Especially in a JIT environment, you'd normally expect the benchmark numbers to decrease over a certain time, or to stay constantly high for a while, then show a peak when the compiler kicks in, and then continue at a lower level (e.g. with the Sun JVM's hotspot JIT or incremental JIT compilers in general). I assume that the benchmarking machinery handles this, but it's yet another reason why highly differing timings can occur within a single run, and why it's only the best run that really matters. You could even go one step further: ignore deviating results in the history graph and only present them when they are still reproducible (preferably with the same source revision) an hour later. Stefan

On 02/26/2010 10:59 AM, Stefan Behnel wrote:
I disagree with this, please read the following paper: http://buytaert.net/files/oopsla07-georges.pdf Cheers, Carl Friedrich

Carl Friedrich Bolz wrote:
I disagree with this, please read the following paper:
+1 for this paper. I was about to link it as well, but cf has been faster. I looked at the unladen swallow runner when I was writing my thesis, and I can confirm that it does the "statistically right" thing as described in the paper to compute error bars. ciao, Anto

The paper is right, and the unladen swallow runner does the right thing. What I meant was: use the statistically right method (like we are doing now!), but don't show deviation bars if the deviation is acceptable. Check after the run whether the deviation is not "acceptable". If it isn't, rerun later, check that nothing in the background is affecting performance, reevaluate reproducibility of the given benchmark, etc. But it doesn't change the fact that speed could save the deviation data for later use. What do you think? Miquel 2010/2/26 Antonio Cuni <anto.cuni@gmail.com>:

On 02/26/2010 12:30 PM, Miquel Torres wrote:
I think an important point is that the deviations don't need to come from anything in the background. We don't use threads in the benchmarks (yet) which would obviously insert non-determinism, but even currently there is enough randomness in the interpreter itself. The GC can start at different points, the JIT could decide (late in the process) that something else should be compiled, there are cache-effects, etc. This randomness is not a bad thing, but I think we should try to at least evaluate it, by showing the error bars. We should do that even if the errors are small, because that is a good result worth mentioning. I guess around 20 or even 10 years ago you could attribute a "correct" running time to a program, but nowadays there is noise on all levels of the system and it is not really possible to ignore that. Also, there are really a lot more levels too :-).
But it doesn't change the fact that speed could save the deviation data for later use.
Yip. Cheers, Carl Friedrich

Carl Friedrich Bolz, 26.02.2010 11:25:
It's sad that the paper doesn't try to understand *why* others use different ways to benchmark. They even admit at the end that their statistical approach is only really interesting when the differences are small enough, not mentioning at that point that the system must be complex enough also, such as the Sun JVM. However, if the differences are small and the benchmarked system is complex, it's best to question the benchmark in the first place, rather than the statistics that lead to its results. Anyway, I agree that, given the complexity of at least some of the benchmarks in the suite, and given the requirement to do continuous benchmarking to find both small and large differences, taking statistically relevant run lines makes sense. Stefan

On 02/26/2010 01:30 PM, Stefan Behnel wrote:
I guess those others should write their own papers (or blog posts or whatever) :-). If you know any well-written ones, I would be very interested.
In my opinion there are probably not really many non-complex systems around nowadays, at least if we are talking about typical "desktop" systems. There is also a lot of noise on the CPU level, with caches, out of order execution, not even talking about the OS. And while PyPy is not quite as complex as a JVM, it is certainly moving in this direction. So even if your benchmark itself is a simple piece of Python code, the whole system that you invoke is still complex. Cheers, Carl Friedrich

On Fri, Feb 26, 2010 at 9:59 AM, Stefan Behnel <stefan_ml@behnel.de> wrote:
hi, For some sets of problems, the first run is very important. For example, where you only want to process the particular data once. For others the Worst performance is the most important. For example 'real time' like programs which require the runtime only takes at maximum N where N is measured. Running something 10 times and removing outliers fails for these two important cases. cya.

On Mon, Mar 1, 2010 at 10:15 AM, René Dudfield <renesd@gmail.com> wrote:
um, this made no sense at all... oops! Sorry, let me try again... For others the Worst performance is the most important. For example 'real time' like programs which require the runtime only takes at maximum N. Where N is an allocated time budget for that code.

Hi Miquel, On 02/25/2010 04:10 PM, Miquel Torres wrote:
I'm quite impressed, this is very cool work and a good improvement over the current plots. Thanks for doing this! One thing that I would really find useful are error bars in the timeline view. This would help us judge whether up-and-down movements are within the margin of randomness, or whether it is a real effect. I don't know how annoying they are to implement though, no clue whether the plot lib supports that. There should be enough information about errors in the json files, as far as I remember. Cheers, Carl Friedrich

Hi Carl, error bars are certainly doable and supported by the plot lib. The only problem is that my backend model doesn't include such a field, so at the moment the data (which is certainly included in the json files) is not being saved. I will need to change the DB models in order to acomodate that kind of information. I'll add this to the wish list. Cheers, Miquel 2010/2/25 Carl Friedrich Bolz <cfbolz@gmx.de>:

On 02/25/2010 11:39 PM, Jacob Hallén wrote:
Great, thanks.
How do you actually determine the error? Multiple runs or sliding average of historic data?
We use the Unladen Swallow benchmark runner, which runs each benchmark multiple times and computes the standard deviation. I then computes a 95% confidence interval using, I think, Student's T distribution. Cheers, Carl Friedrich

You may also consider that a benchmark that varies greatly between runs may be a flawed benchmark. I think it should be considered, but only on the running side, and act accordingly (too high a deviation: discard run, reconsider benchmark, reconsider environment or whatever). Miquel 2010/2/26 Stefan Behnel <stefan_ml@behnel.de>:

Miquel Torres, 26.02.2010 11:05:
Right, there might even have been a cron job running at the same time. There are various reasons why benchmark numbers can vary. Especially in a JIT environment, you'd normally expect the benchmark numbers to decrease over a certain time, or to stay constantly high for a while, then show a peak when the compiler kicks in, and then continue at a lower level (e.g. with the Sun JVM's hotspot JIT or incremental JIT compilers in general). I assume that the benchmarking machinery handles this, but it's yet another reason why highly differing timings can occur within a single run, and why it's only the best run that really matters. You could even go one step further: ignore deviating results in the history graph and only present them when they are still reproducible (preferably with the same source revision) an hour later. Stefan

On 02/26/2010 10:59 AM, Stefan Behnel wrote:
I disagree with this, please read the following paper: http://buytaert.net/files/oopsla07-georges.pdf Cheers, Carl Friedrich

Carl Friedrich Bolz wrote:
I disagree with this, please read the following paper:
+1 for this paper. I was about to link it as well, but cf has been faster. I looked at the unladen swallow runner when I was writing my thesis, and I can confirm that it does the "statistically right" thing as described in the paper to compute error bars. ciao, Anto

The paper is right, and the unladen swallow runner does the right thing. What I meant was: use the statistically right method (like we are doing now!), but don't show deviation bars if the deviation is acceptable. Check after the run whether the deviation is not "acceptable". If it isn't, rerun later, check that nothing in the background is affecting performance, reevaluate reproducibility of the given benchmark, etc. But it doesn't change the fact that speed could save the deviation data for later use. What do you think? Miquel 2010/2/26 Antonio Cuni <anto.cuni@gmail.com>:

On 02/26/2010 12:30 PM, Miquel Torres wrote:
I think an important point is that the deviations don't need to come from anything in the background. We don't use threads in the benchmarks (yet) which would obviously insert non-determinism, but even currently there is enough randomness in the interpreter itself. The GC can start at different points, the JIT could decide (late in the process) that something else should be compiled, there are cache-effects, etc. This randomness is not a bad thing, but I think we should try to at least evaluate it, by showing the error bars. We should do that even if the errors are small, because that is a good result worth mentioning. I guess around 20 or even 10 years ago you could attribute a "correct" running time to a program, but nowadays there is noise on all levels of the system and it is not really possible to ignore that. Also, there are really a lot more levels too :-).
But it doesn't change the fact that speed could save the deviation data for later use.
Yip. Cheers, Carl Friedrich

Carl Friedrich Bolz, 26.02.2010 11:25:
It's sad that the paper doesn't try to understand *why* others use different ways to benchmark. They even admit at the end that their statistical approach is only really interesting when the differences are small enough, not mentioning at that point that the system must be complex enough also, such as the Sun JVM. However, if the differences are small and the benchmarked system is complex, it's best to question the benchmark in the first place, rather than the statistics that lead to its results. Anyway, I agree that, given the complexity of at least some of the benchmarks in the suite, and given the requirement to do continuous benchmarking to find both small and large differences, taking statistically relevant run lines makes sense. Stefan

On 02/26/2010 01:30 PM, Stefan Behnel wrote:
I guess those others should write their own papers (or blog posts or whatever) :-). If you know any well-written ones, I would be very interested.
In my opinion there are probably not really many non-complex systems around nowadays, at least if we are talking about typical "desktop" systems. There is also a lot of noise on the CPU level, with caches, out of order execution, not even talking about the OS. And while PyPy is not quite as complex as a JVM, it is certainly moving in this direction. So even if your benchmark itself is a simple piece of Python code, the whole system that you invoke is still complex. Cheers, Carl Friedrich

On Fri, Feb 26, 2010 at 9:59 AM, Stefan Behnel <stefan_ml@behnel.de> wrote:
hi, For some sets of problems, the first run is very important. For example, where you only want to process the particular data once. For others the Worst performance is the most important. For example 'real time' like programs which require the runtime only takes at maximum N where N is measured. Running something 10 times and removing outliers fails for these two important cases. cya.

On Mon, Mar 1, 2010 at 10:15 AM, René Dudfield <renesd@gmail.com> wrote:
um, this made no sense at all... oops! Sorry, let me try again... For others the Worst performance is the most important. For example 'real time' like programs which require the runtime only takes at maximum N. Where N is an allocated time budget for that code.
participants (7)
-
Antonio Cuni
-
Carl Friedrich Bolz
-
Jacob Hallén
-
Miquel Torres
-
René Dudfield
-
Stefan Behnel
-
Stephen Thorne