Mailman 3 New speed.pypy.org version - pypy-dev

New speed.pypy.org version

older
virtualenv support and directory...

Miquel Torres

June 25, 2010

11:08 a.m.

Hi all!, I want to announce a new version of the benchmarks site speed.pypy.org. After about 6 months, it finally shows the vision I had for such a website: usefull for pypy developers but also for the general public following pypy's or even other python implementation's development. On to the changes. There are now three views: "Changes", "Timeline" and "Comparison": The Overview was renamed to Changes, and its inline plot bars got removed because you can get the exact same plot in the Comparison view now (and then some). The Timeline got selectable baseline and "humanized" date labels for the x axis. The new Comparison view allows, well, comparing of "competing" interpreters, which will also be of interest to the wider Python community (specially if we can add unladen, ironpython and JPython results). Two examples of interesting comparisons are: - relative bars ( http://speed.pypy.org/comparison/?bas=2%2B35&chart=relative+bars): here we see that the jit is faster than psyco in all cases except spambayes and slowspitfire, were the jit cannot make up for pypy-c's abismal performance. Interestingly, in the only other case where the jit is slower than cpython, the ai benchmark, psyco performs even worse. - stacked bars horizontal( http://speed.pypy.org/comparison/?hor=true&bas=2%2B35&chart=stacked+bars): This is not meant to "demonstrate" that overall the jit is over two times faster than cpython. It is just another way for a developer to picture how long a programme would take to complete if it were composed of 21 such tasks. You can see that cpython's (the normalization chosen) benchmarks all take 1"relative" second. pypy-c needs more or less the same time, some "tasks" being slower and some faster. Psyco shows an interesting picture:

...

I hope you find the new version useful, and as always any feedback is welcome. Cheers! Miquel

Attachments:

attachment.html (text/html — 2.9 KB)

Show replies by date

Antonio Cuni

June 2010

11:43 a.m.

Hi Miquel, On 25/06/10 13:08, Miquel Torres wrote:

...

Hi all!,

[cut]

...

I hope you find the new version useful, and as always any feedback is welcome.

well... what to say? I simply like it *a lot*. Thank you :-) I especially think that the "Changes" view is very useful for us developers, in particular the fact that you can see the log for all the revisions that affected the change: it is something that we did tons of time manually, it's nice to see it automated :-). ciao, Anto

Paolo Giarrusso

12:07 p.m.

Hi! First, I want to restate the obvious, before pointing out what I think is a mistake: your work on this website is great and very useful! On Fri, Jun 25, 2010 at 13:08, Miquel Torres <tobami@googlemail.com> wrote:

...

Please, have a look at the short paper: "How not to lie with statistics: the correct way to summarize benchmark results" http://scholar.google.com/scholar?cluster=1051144955483053492&hl=en&as_sdt=2000 I downloaded it from the ACM library, please tell me if you can't find it.

...

You are not summing up absolute times, so your claim is incorrect. And the error is significant, given the above paper. A sum of absolute times would provide what you claim. the same time, which surprises me (since the PyPy interpreter was known to be slower than CPython). But given that the result is invalid, it may well be an artifact of your statistics.

...

This could maybe be still true, at least in part, but you have to do this reasoning on absolute times. Best regards, and keep up the good work! -- Paolo Giarrusso - Ph.D. Student http://www.informatik.uni-marburg.de/~pgiarrusso/

Maciej Fijalkowski

3:53 p.m.

Hey Paolo. While in general I agree with you, this is not exactly science, I still think it's giving somewhat an impression what's going on to outsiders. Inside, we still look mostly at particular benchmarks. I'm not sure having any convoluted (at least to normal people) metric while summarizing would help, maybe. Speaking a bit on Miguel's behalf, feel free to implement this as a feature on codespeed (it's an open source project after all), you can fork it on github http://github.com/tobami/codespeed. Cheers, fijal On Fri, Jun 25, 2010 at 6:07 AM, Paolo Giarrusso <p.giarrusso@gmail.com> wrote:

...

Hi! First, I want to restate the obvious, before pointing out what I think is a mistake: your work on this website is great and very useful!

On Fri, Jun 25, 2010 at 13:08, Miquel Torres <tobami@googlemail.com> wrote:

...
- stacked bars Here you are summing up normalized times, which is more or less like taking their arithmetic average. And that doesn't work at all: in many cases you can "show" completely different results by normalizing relatively to another item. Even the simple question "who is faster?" can be answered in different ways So you should use the geometric mean, even if this is not so widely known. Or better, it is known by benchmarking experts, but it's difficult to become so.

Please, have a look at the short paper: "How not to lie with statistics: the correct way to summarize benchmark results" http://scholar.google.com/scholar?cluster=1051144955483053492&hl=en&as_sdt=2000 I downloaded it from the ACM library, please tell me if you can't find it.

...
horizontal(http://speed.pypy.org/comparison/?hor=true&bas=2%2B35&chart=stacked+bars): This is not meant to "demonstrate" that overall the jit is over two times faster than cpython. It is just another way for a developer to picture how long a programme would take to complete if it were composed of 21 such tasks.

You are not summing up absolute times, so your claim is incorrect. And the error is significant, given the above paper. A sum of absolute times would provide what you claim.

...
You can see that cpython's (the normalization chosen) benchmarks all take 1"relative" second. Here, for instance, I see that CPython and pypy-c take more or less the same time, which surprises me (since the PyPy interpreter was known to be slower than CPython). But given that the result is invalid, it may well be an artifact of your statistics.

...
pypy-c needs more or less the same time, some "tasks" being slower and some faster. Psyco shows an interesting picture: From meteor-contest downwards (fortuitously) , all benchmarks are extremely "compressed", which means they are speeded up by psyco quite a lot. But any further speed up wouldn't make overall time much shorter because the first group of benchmarks now takes most of the time to complete. pypy-c-jit is a more extreme case of this: If the jit accelerated all "fast" benchmarks to 0 seconds (infinitely fast), it would only get about twice as fast as now because ai, slowspitfire, spambayes and twisted_tcp now need half the entire execution time. An good demonstration of "you are only as fast as your slowest part". Of course the aggregate of all benchmarks is not a real app, but it is still fun.

This could maybe be still true, at least in part, but you have to do this reasoning on absolute times.

Best regards, and keep up the good work! -- Paolo Giarrusso - Ph.D. Student http://www.informatik.uni-marburg.de/~pgiarrusso/ _______________________________________________ pypy-dev@codespeak.net http://codespeak.net/mailman/listinfo/pypy-dev

Paolo Giarrusso

8:48 p.m.

On Fri, Jun 25, 2010 at 17:53, Maciej Fijalkowski <fijall@gmail.com> wrote:

...

Hey Paolo.

While in general I agree with you, this is not exactly science, I still think it's giving somewhat an impression what's going on to outsiders. As long as no important scholars look at that part, and it's just me, yes. When that happen, you probably lose some credibility. No matter if speed.pypy.org is made by people external to the team.

...

Inside, we still look mostly at particular benchmarks.

But if a change improves one benchmark and worsens another, you need some summary.

...

I'm not sure having any convoluted (at least to normal people) metric while summarizing would help, maybe. You're talking to programmers, not to street people. Even before knowing why I should use the geomean here, I never felt too confused by people using it, even without explanation.

The geometric mean is just a mean. And it's the only way to get an average performance ratio: how much faster is PyPy than CPython, on these benchmarks, considered all with equal weight? If you want, put just this on the graph, and the term "geometric mean" in some note.

...

Speaking a bit on Miguel's behalf, feel free to implement this as a feature on codespeed (it's an open source project after all), you can fork it on github http://github.com/tobami/codespeed.

Of course I can, so your answer is valid, but that's not my plan, sorry - the difference of effort needed to me and to him is huge. If I had time to spend, I'd hack PyPy itself instead. Best regards -- Paolo Giarrusso - Ph.D. Student http://www.informatik.uni-marburg.de/~pgiarrusso/

Miquel Torres

5:08 p.m.

Hi Paolo, I am aware of the problem with calculating benchmark means, but let me explain my point of view. You are correct in that it would be preferable to have absolute times. Well, you actually can, but see what it happens: http://speed.pypy.org/comparison/?hor=true&bas=none&chart=stacked+bars Absolute values would only work if we had carefully chosen benchmaks runtimes to be very similar (for our cpython baseline). As it is, html5lib, spitfire and spitfire_cstringio completely dominate the cummulative time. And not because the interpreter is faster or slower but because the benchmark was arbitrarily designed to run that long. Any improvement in the long running benchmarks will carry much more weight than in the short running. What is more useful is to have comparable slices of time so that the improvements can be seen relatively over time. Normalizing does that i think. It just says: we have 21 tasks which take 1 second to run each on interpreter X (cpython in the default case). Then we see how other executables compare to that. What would the geometric mean achieve here, exactly, for the end user? I am not really calculating any mean. You can see that I carefully avoided to display any kind of total bar which would indeed incur in the problem you mention. That a stacked chart implicitly displays a total is something you can not avoid, and for that kind of chart I still think normalized results is visually the best option. Still, i would very much like to read the paper you cite, but you need a login for it. Cheers, Miquel 2010/6/25 Paolo Giarrusso <p.giarrusso@gmail.com>

...

Hi! First, I want to restate the obvious, before pointing out what I think is a mistake: your work on this website is great and very useful!

On Fri, Jun 25, 2010 at 13:08, Miquel Torres <tobami@googlemail.com> wrote:

...
- stacked bars Here you are summing up normalized times, which is more or less like taking their arithmetic average. And that doesn't work at all: in many cases you can "show" completely different results by normalizing relatively to another item. Even the simple question "who is faster?" can be answered in different ways So you should use the geometric mean, even if this is not so widely known. Or better, it is known by benchmarking experts, but it's difficult to become so.

Please, have a look at the short paper: "How not to lie with statistics: the correct way to summarize benchmark results"

http://scholar.google.com/scholar?cluster=1051144955483053492&hl=en&as_sdt=2000 I downloaded it from the ACM library, please tell me if you can't find it.

...
horizontal( http://speed.pypy.org/comparison/?hor=true&bas=2%2B35&chart=stacked+bars): This is not meant to "demonstrate" that overall the jit is over two times faster than cpython. It is just another way for a developer to picture how long a programme would take to complete if it were composed of 21 such tasks.

You are not summing up absolute times, so your claim is incorrect. And the error is significant, given the above paper. A sum of absolute times would provide what you claim.

...
You can see that cpython's (the normalization chosen) benchmarks all take 1"relative" second. Here, for instance, I see that CPython and pypy-c take more or less the same time, which surprises me (since the PyPy interpreter was known to be slower than CPython). But given that the result is invalid, it may well be an artifact of your statistics.

...
pypy-c needs more or less the same time, some "tasks" being slower and some faster. Psyco shows an interesting picture: From meteor-contest downwards (fortuitously) , all benchmarks are extremely "compressed", which means they are speeded up by psyco quite a lot. But any further speed up wouldn't make overall time much shorter because the first group of benchmarks now takes most of the time to complete. pypy-c-jit is a more extreme case of this: If the jit accelerated all "fast" benchmarks to 0 seconds (infinitely fast), it would only get about twice as fast as now because ai, slowspitfire, spambayes and twisted_tcp now need half the entire execution time. An good demonstration of "you are only as fast as your slowest part". Of course the aggregate of all benchmarks is not a real app, but it is still fun.

This could maybe be still true, at least in part, but you have to do this reasoning on absolute times.

Best regards, and keep up the good work! -- Paolo Giarrusso - Ph.D. Student http://www.informatik.uni-marburg.de/~pgiarrusso/<http://www.informatik.uni-marburg.de/%7Epgiarrusso/>

Paolo Giarrusso

8:50 p.m.

On Fri, Jun 25, 2010 at 19:08, Miquel Torres <tobami@googlemail.com> wrote:

...

Hi Paolo,

I am aware of the problem with calculating benchmark means, but let me explain my point of view.

You are correct in that it would be preferable to have absolute times. Well, you actually can, but see what it happens: http://speed.pypy.org/comparison/?hor=true&bas=none&chart=stacked+bars

Ahah! I didn't notice that I could skip normalization! This does not fully invalidate my point, however.

...

Absolute values would only work if we had carefully chosen benchmaks runtimes to be very similar (for our cpython baseline). As it is, html5lib, spitfire and spitfire_cstringio completely dominate the cummulative time.

I acknowledge that (btw, it should be cumulative time, with one 'm', both here and in the website).

...

And not because the interpreter is faster or slower but because the benchmark was arbitrarily designed to run that long. Any improvement in the long running benchmarks will carry much more weight than in the short running.

...

What is more useful is to have comparable slices of time so that the improvements can be seen relatively over time.

If you want to sum up times (but at this point, I see no reason for it), you should rather have externally derived weights, as suggested by the paper (in Rule 3). As soon as you take weights from the data, lots of maths that you need is not going to work any more - that's generally true in many cases in statistics. And the only way making sense to have external weights is to gather them from real world programs. Since that's not going to happen easily, just stick with the geometric mean. Or set an arbitrarily low weight, manually, without any math, so that the long-running benchmarks stop dominating the res. It's no fraud, since the current graph is less valid anyway.

...

Normalizing does that i think. Not really.

...

It just says: we have 21 tasks which take 1 second to run each on interpreter X (cpython in the default case). Then we see how other executables compare to that. What would the geometric mean achieve here, exactly, for the end user?

You actually need the geomean to do that. Don't forget that the geomean is still a mean: it's a mean performance ratio which averages individual performance ratios. If PyPy's geomean is 0.5, it means that PyPy is going to run that task in 11.5 seconds instead of 21. To me, this sounds exactly like what you want to achieve. Moreover, it actually works, unlike what you use. For instance, ignore PyPy-JIT, and look only CPython and pypy-c (no JIT). Then, change the normalization among the two: http://speed.pypy.org/comparison/?exe=2%2B35,3%2BL&ben=1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21&env=1&hor=true&bas=2%2B35&chart=stacked+bars http://speed.pypy.org/comparison/?exe=2%2B35,3%2BL&ben=1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21&env=1&hor=true&bas=3%2BL&chart=stacked+bars with the current data, you get that in one case cpython is faster, in the other pypy-c is faster. It can't happen with the geomean. This is the point of the paper. I could even construct a normalization baseline $base such that CPython seems faster than PyPy-JIT. Such a base should be very fast on, say, ai (where CPython is slower), so that $cpython.ai/$base.ai becomes 100 and $pypyjit.ai/$base.ai becomes 200, and be very slow on other benchmarks (so that they disappear in the sum). So, the only difference I see is that geomean works, arithm. mean doesn't. That's why Real Benchmarkers use geomean. Moreover, you are making a mistake quite common among non-physicists. What you say makes sense under the implicit assumption that dividing two times gives something you can use as a time. When you say "Pypy's runtime for a 1 second task", you actually want to talk about a performance ratio, not about the time. In the same way as when you say "this bird runs 3 meters long in one second", a physicist would sum that up as "3 m/s" rather than "3 m".

...

I am not really calculating any mean. You can see that I carefully avoided to display any kind of total bar which would indeed incur in the problem you mention. That a stacked chart implicitly displays a total is something you can not avoid, and for that kind of chart I still think normalized results is visually the best option.

But on a stacked bars graph, I'm not going to look at individual bars at all, just at the total: it's actually less convenient than in "normal bars" to look at the result of a particular benchmark. I hope I can find guidelines against stacked plots, I have a PhD colleague reading on how to make graphs. Best regards -- Paolo Giarrusso - Ph.D. Student http://www.informatik.uni-marburg.de/~pgiarrusso/

Miquel Torres

7:16 a.m.

Hi Paolo, well, you are right of course. I had forgotten about the real problem, which you actually demonstrate quite well with your CPython and pypy-c case: depending on the normalization you can make any stacked series look faster than the others. I will have a look at the literature and modify normalized stacked plots accordingly. Thanks for taking the time to explain things in such detail. Regards, Miquel 2010/6/25 Paolo Giarrusso <p.giarrusso@gmail.com>

...

On Fri, Jun 25, 2010 at 19:08, Miquel Torres <tobami@googlemail.com> wrote:

...
Hi Paolo,

I am aware of the problem with calculating benchmark means, but let me explain my point of view.

You are correct in that it would be preferable to have absolute times. Well, you actually can, but see what it happens: http://speed.pypy.org/comparison/?hor=true&bas=none&chart=stacked+bars

Ahah! I didn't notice that I could skip normalization! This does not fully invalidate my point, however.

...
Absolute values would only work if we had carefully chosen benchmaks runtimes to be very similar (for our cpython baseline). As it is, html5lib, spitfire and spitfire_cstringio completely dominate the cummulative time.

I acknowledge that (btw, it should be cumulative time, with one 'm', both here and in the website).

...
And not because the interpreter is faster or slower but because the benchmark was arbitrarily designed to run that long. Any improvement in the long running benchmarks will carry much more weight than in the short running.

...
What is more useful is to have comparable slices of time so that the improvements can be seen relatively over time.

If you want to sum up times (but at this point, I see no reason for it), you should rather have externally derived weights, as suggested by the paper (in Rule 3). As soon as you take weights from the data, lots of maths that you need is not going to work any more - that's generally true in many cases in statistics. And the only way making sense to have external weights is to gather them from real world programs. Since that's not going to happen easily, just stick with the geometric mean. Or set an arbitrarily low weight, manually, without any math, so that the long-running benchmarks stop dominating the res. It's no fraud, since the current graph is less valid anyway.

...
Normalizing does that i think. Not really.

...
It just says: we have 21 tasks which take 1 second to run each on interpreter X (cpython in the default case). Then we see how other executables compare to that. What would the geometric mean achieve here, exactly, for the end user?

You actually need the geomean to do that. Don't forget that the geomean is still a mean: it's a mean performance ratio which averages individual performance ratios. If PyPy's geomean is 0.5, it means that PyPy is going to run that task in 11.5 seconds instead of 21. To me, this sounds exactly like what you want to achieve. Moreover, it actually works, unlike what you use.

For instance, ignore PyPy-JIT, and look only CPython and pypy-c (no JIT). Then, change the normalization among the two:

http://speed.pypy.org/comparison/?exe=2%2B35,3%2BL&ben=1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21&env=1&hor=true&bas=2%2B35&chart=stacked+bars

http://speed.pypy.org/comparison/?exe=2%2B35,3%2BL&ben=1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21&env=1&hor=true&bas=3%2BL&chart=stacked+bars with the current data, you get that in one case cpython is faster, in the other pypy-c is faster. It can't happen with the geomean. This is the point of the paper.

I could even construct a normalization baseline $base such that CPython seems faster than PyPy-JIT. Such a base should be very fast on, say, ai (where CPython is slower), so that $cpython.ai/$base.ai becomes 100 and $pypyjit.ai/$base.ai becomes 200, and be very slow on other benchmarks (so that they disappear in the sum).

So, the only difference I see is that geomean works, arithm. mean doesn't. That's why Real Benchmarkers use geomean.

Moreover, you are making a mistake quite common among non-physicists. What you say makes sense under the implicit assumption that dividing two times gives something you can use as a time. When you say "Pypy's runtime for a 1 second task", you actually want to talk about a performance ratio, not about the time. In the same way as when you say "this bird runs 3 meters long in one second", a physicist would sum that up as "3 m/s" rather than "3 m".

...
I am not really calculating any mean. You can see that I carefully avoided to display any kind of total bar which would indeed incur in the problem you mention. That a stacked chart implicitly displays a total is something you can not avoid, and for that kind of chart I still think normalized results is visually the best option.

But on a stacked bars graph, I'm not going to look at individual bars at all, just at the total: it's actually less convenient than in "normal bars" to look at the result of a particular benchmark.

I hope I can find guidelines against stacked plots, I have a PhD colleague reading on how to make graphs.

Best regards -- Paolo Giarrusso - Ph.D. Student http://www.informatik.uni-marburg.de/~pgiarrusso/<http://www.informatik.uni-marburg.de/%7Epgiarrusso/>

Maciej Fijalkowski

3:42 p.m.

On Fri, Jun 25, 2010 at 5:08 AM, Miquel Torres <tobami@googlemail.com> wrote:

...

Hi all!,

I want to announce a new version of the benchmarks site speed.pypy.org.

After about 6 months, it finally shows the vision I had for such a website: usefull for pypy developers but also for the general public following pypy's or even other python implementation's development. On to the changes.

There are now three views: "Changes", "Timeline" and "Comparison":

The Overview was renamed to Changes, and its inline plot bars got removed because you can get the exact same plot in the Comparison view now (and then some).

The Timeline got selectable baseline and "humanized" date labels for the x axis.

The new Comparison view allows, well, comparing of "competing" interpreters, which will also be of interest to the wider Python community (specially if we can add unladen, ironpython and JPython results).

Two examples of interesting comparisons are:

- relative bars (http://speed.pypy.org/comparison/?bas=2%2B35&chart=relative+bars): here we see that the jit is faster than psyco in all cases except spambayes and slowspitfire, were the jit cannot make up for pypy-c's abismal performance. Interestingly, in the only other case where the jit is slower than cpython, the ai benchmark, psyco performs even worse.

- stacked bars horizontal(http://speed.pypy.org/comparison/?hor=true&bas=2%2B35&chart=stacked+bars): This is not meant to "demonstrate" that overall the jit is over two times faster than cpython. It is just another way for a developer to picture how long a programme would take to complete if it were composed of 21 such tasks. You can see that cpython's (the normalization chosen) benchmarks all take 1"relative" second. pypy-c needs more or less the same time, some "tasks" being slower and some faster. Psyco shows an interesting picture: From meteor-contest downwards (fortuitously) , all benchmarks are extremely "compressed", which means they are speeded up by psyco quite a lot. But any further speed up wouldn't make overall time much shorter because the first group of benchmarks now takes most of the time to complete. pypy-c-jit is a more extreme case of this: If the jit accelerated all "fast" benchmarks to 0 seconds (infinitely fast), it would only get about twice as fast as now because ai, slowspitfire, spambayes and twisted_tcp now need half the entire execution time. An good demonstration of "you are only as fast as your slowest part". Of course the aggregate of all benchmarks is not a real app, but it is still fun.

I hope you find the new version useful, and as always any feedback is welcome.

Cheers! Miquel

Wow, I really like it, great job. Can we see how we can use this features for branches? Cheers, fijal

Miquel Torres

5:23 p.m.

There is no problem in running tests for branches. What other branches or interpreters would you for example run? 2010/6/25 Maciej Fijalkowski <fijall@gmail.com>

...

...
Hi all!,

I want to announce a new version of the benchmarks site speed.pypy.org.

After about 6 months, it finally shows the vision I had for such a website: usefull for pypy developers but also for the general public following

...
or even other python implementation's development. On to the changes.

There are now three views: "Changes", "Timeline" and "Comparison":

The Overview was renamed to Changes, and its inline plot bars got removed because you can get the exact same plot in the Comparison view now (and

...
some).

The Timeline got selectable baseline and "humanized" date labels for the x axis.

The new Comparison view allows, well, comparing of "competing" interpreters, which will also be of interest to the wider Python community (specially if we can add unladen, ironpython and JPython results).

Two examples of interesting comparisons are:

- relative bars (http://speed.pypy.org/comparison/?bas=2%2B35&chart=relative+bars): here we see that the jit is faster than psyco in all cases except spambayes and slowspitfire, were the jit cannot make up for pypy-c's abismal

On Fri, Jun 25, 2010 at 5:08 AM, Miquel Torres <tobami@googlemail.com> wrote: pypy's then performance.

...
Interestingly, in the only other case where the jit is slower than cpython, the ai benchmark, psyco performs even worse.

- stacked bars horizontal( http://speed.pypy.org/comparison/?hor=true&bas=2%2B35&chart=stacked+bars): This is not meant to "demonstrate" that overall the jit is over two times faster than cpython. It is just another way for a developer to picture how long a programme would take to complete if it were composed of 21 such tasks. You can see that cpython's (the normalization chosen) benchmarks all take 1"relative" second. pypy-c needs more or less the same time, some "tasks" being slower and some faster. Psyco shows an interesting picture: From meteor-contest downwards (fortuitously) , all benchmarks are extremely "compressed", which means they are speeded up by psyco quite a lot. But any further speed up wouldn't make overall time much shorter because the first group of benchmarks now takes most of the time to complete. pypy-c-jit is a more extreme case of this: If the jit accelerated all "fast" benchmarks to 0 seconds (infinitely fast), it would only get about twice as fast as now because ai, slowspitfire, spambayes and twisted_tcp now need half the entire execution time. An good demonstration of "you are only as fast as your slowest part". Of course the aggregate of all benchmarks is not a real app, but it is still fun.

I hope you find the new version useful, and as always any feedback is welcome.

Cheers! Miquel

Wow, I really like it, great job.

Can we see how we can use this features for branches?

Cheers, fijal

Maciej Fijalkowski

5:28 p.m.

PyPy branches mostly (did this improve or not really kind of question) On Fri, Jun 25, 2010 at 11:23 AM, Miquel Torres <tobami@googlemail.com> wrote:

...

There is no problem in running tests for branches. What other branches or interpreters would you for example run?

2010/6/25 Maciej Fijalkowski <fijall@gmail.com>

...
On Fri, Jun 25, 2010 at 5:08 AM, Miquel Torres <tobami@googlemail.com> wrote:

...
Hi all!,

I want to announce a new version of the benchmarks site speed.pypy.org.

After about 6 months, it finally shows the vision I had for such a website: usefull for pypy developers but also for the general public following pypy's or even other python implementation's development. On to the changes.

There are now three views: "Changes", "Timeline" and "Comparison":

The Overview was renamed to Changes, and its inline plot bars got removed because you can get the exact same plot in the Comparison view now (and then some).

The Timeline got selectable baseline and "humanized" date labels for the x axis.

The new Comparison view allows, well, comparing of "competing" interpreters, which will also be of interest to the wider Python community (specially if we can add unladen, ironpython and JPython results).

Two examples of interesting comparisons are:

- relative bars (http://speed.pypy.org/comparison/?bas=2%2B35&chart=relative+bars): here we see that the jit is faster than psyco in all cases except spambayes and slowspitfire, were the jit cannot make up for pypy-c's abismal performance. Interestingly, in the only other case where the jit is slower than cpython, the ai benchmark, psyco performs even worse.

- stacked bars

horizontal(http://speed.pypy.org/comparison/?hor=true&bas=2%2B35&chart=stacked+bars): This is not meant to "demonstrate" that overall the jit is over two times faster than cpython. It is just another way for a developer to picture how long a programme would take to complete if it were composed of 21 such tasks. You can see that cpython's (the normalization chosen) benchmarks all take 1"relative" second. pypy-c needs more or less the same time, some "tasks" being slower and some faster. Psyco shows an interesting picture: From meteor-contest downwards (fortuitously) , all benchmarks are extremely "compressed", which means they are speeded up by psyco quite a lot. But any further speed up wouldn't make overall time much shorter because the first group of benchmarks now takes most of the time to complete. pypy-c-jit is a more extreme case of this: If the jit accelerated all "fast" benchmarks to 0 seconds (infinitely fast), it would only get about twice as fast as now because ai, slowspitfire, spambayes and twisted_tcp now need half the entire execution time. An good demonstration of "you are only as fast as your slowest part". Of course the aggregate of all benchmarks is not a real app, but it is still fun.

I hope you find the new version useful, and as always any feedback is welcome.

Cheers! Miquel

Wow, I really like it, great job.

Can we see how we can use this features for branches?

Cheers, fijal

Maciej Fijalkowski

4:16 p.m.

Hey. A bit more important problem - results seem to be messed up. I think there is something wrong with baselines. Look here: http://buildbot.pypy.org/builders/jit-benchmark-linux-x86-32/builds/358/step... on twisted_tcp vs http://speed.pypy.org/timeline/?exe=1,3&base=2%2B35&ben=twisted_tcp&env=tannit&revs=200 On Fri, Jun 25, 2010 at 5:08 AM, Miquel Torres <tobami@googlemail.com> wrote:

...

Hi all!,

I want to announce a new version of the benchmarks site speed.pypy.org.

After about 6 months, it finally shows the vision I had for such a website: usefull for pypy developers but also for the general public following pypy's or even other python implementation's development. On to the changes.

There are now three views: "Changes", "Timeline" and "Comparison":

The Overview was renamed to Changes, and its inline plot bars got removed because you can get the exact same plot in the Comparison view now (and then some).

The Timeline got selectable baseline and "humanized" date labels for the x axis.

The new Comparison view allows, well, comparing of "competing" interpreters, which will also be of interest to the wider Python community (specially if we can add unladen, ironpython and JPython results).

Two examples of interesting comparisons are:

- relative bars (http://speed.pypy.org/comparison/?bas=2%2B35&chart=relative+bars): here we see that the jit is faster than psyco in all cases except spambayes and slowspitfire, were the jit cannot make up for pypy-c's abismal performance. Interestingly, in the only other case where the jit is slower than cpython, the ai benchmark, psyco performs even worse.

- stacked bars horizontal(http://speed.pypy.org/comparison/?hor=true&bas=2%2B35&chart=stacked+bars): This is not meant to "demonstrate" that overall the jit is over two times faster than cpython. It is just another way for a developer to picture how long a programme would take to complete if it were composed of 21 such tasks. You can see that cpython's (the normalization chosen) benchmarks all take 1"relative" second. pypy-c needs more or less the same time, some "tasks" being slower and some faster. Psyco shows an interesting picture: From meteor-contest downwards (fortuitously) , all benchmarks are extremely "compressed", which means they are speeded up by psyco quite a lot. But any further speed up wouldn't make overall time much shorter because the first group of benchmarks now takes most of the time to complete. pypy-c-jit is a more extreme case of this: If the jit accelerated all "fast" benchmarks to 0 seconds (infinitely fast), it would only get about twice as fast as now because ai, slowspitfire, spambayes and twisted_tcp now need half the entire execution time. An good demonstration of "you are only as fast as your slowest part". Of course the aggregate of all benchmarks is not a real app, but it is still fun.

I hope you find the new version useful, and as always any feedback is welcome.

Cheers! Miquel

_______________________________________________ pypy-dev@codespeak.net http://codespeak.net/mailman/listinfo/pypy-dev

Miquel Torres

5:14 p.m.

Hey fijal, the baseline problem you mention only happens with some benchmarks, so I will risk the guess that the cpython results currently present are not from the last one, and that in the case you point out (only for twisted_tcp) it changed quite a bit. If the next results overwrite cpython's results, we'll see if that has been the case. 2010/6/25 Maciej Fijalkowski <fijall@gmail.com>

...

Hey.

A bit more important problem - results seem to be messed up. I think there is something wrong with baselines. Look here:

http://buildbot.pypy.org/builders/jit-benchmark-linux-x86-32/builds/358/step...

on twisted_tcp

vs

http://speed.pypy.org/timeline/?exe=1,3&base=2%2B35&ben=twisted_tcp&env=tannit&revs=200

...
Hi all!,

I want to announce a new version of the benchmarks site speed.pypy.org.

After about 6 months, it finally shows the vision I had for such a website: usefull for pypy developers but also for the general public following

...
or even other python implementation's development. On to the changes.

There are now three views: "Changes", "Timeline" and "Comparison":

The Overview was renamed to Changes, and its inline plot bars got removed because you can get the exact same plot in the Comparison view now (and

...
some).

The Timeline got selectable baseline and "humanized" date labels for the x axis.

The new Comparison view allows, well, comparing of "competing" interpreters, which will also be of interest to the wider Python community (specially if we can add unladen, ironpython and JPython results).

Two examples of interesting comparisons are:

- relative bars (http://speed.pypy.org/comparison/?bas=2%2B35&chart=relative+bars): here we see that the jit is faster than psyco in all cases except spambayes and slowspitfire, were the jit cannot make up for pypy-c's abismal

On Fri, Jun 25, 2010 at 5:08 AM, Miquel Torres <tobami@googlemail.com> wrote: pypy's then performance.

...
Interestingly, in the only other case where the jit is slower than cpython, the ai benchmark, psyco performs even worse.

- stacked bars horizontal( http://speed.pypy.org/comparison/?hor=true&bas=2%2B35&chart=stacked+bars): This is not meant to "demonstrate" that overall the jit is over two times faster than cpython. It is just another way for a developer to picture how long a programme would take to complete if it were composed of 21 such tasks. You can see that cpython's (the normalization chosen) benchmarks all take 1"relative" second. pypy-c needs more or less the same time, some "tasks" being slower and some faster. Psyco shows an interesting picture: From meteor-contest downwards (fortuitously) , all benchmarks are extremely "compressed", which means they are speeded up by psyco quite a lot. But any further speed up wouldn't make overall time much shorter because the first group of benchmarks now takes most of the time to complete. pypy-c-jit is a more extreme case of this: If the jit accelerated all "fast" benchmarks to 0 seconds (infinitely fast), it would only get about twice as fast as now because ai, slowspitfire, spambayes and twisted_tcp now need half the entire execution time. An good demonstration of "you are only as fast as your slowest part". Of course the aggregate of all benchmarks is not a real app, but it is still fun.

I hope you find the new version useful, and as always any feedback is welcome.

Cheers! Miquel

_______________________________________________ pypy-dev@codespeak.net http://codespeak.net/mailman/listinfo/pypy-dev

Antonio Cuni

June 2010

11:43 a.m.

Hi Miquel, On 25/06/10 13:08, Miquel Torres wrote:

...

Hi all!,

[cut]

...

I hope you find the new version useful, and as always any feedback is welcome.

Paolo Giarrusso

12:07 p.m.

...

Maciej Fijalkowski

3:53 p.m.

...

Hi! First, I want to restate the obvious, before pointing out what I think is a mistake: your work on this website is great and very useful!

On Fri, Jun 25, 2010 at 13:08, Miquel Torres <tobami@googlemail.com> wrote:

...
- stacked bars Here you are summing up normalized times, which is more or less like taking their arithmetic average. And that doesn't work at all: in many cases you can "show" completely different results by normalizing relatively to another item. Even the simple question "who is faster?" can be answered in different ways So you should use the geometric mean, even if this is not so widely known. Or better, it is known by benchmarking experts, but it's difficult to become so.

Please, have a look at the short paper: "How not to lie with statistics: the correct way to summarize benchmark results" http://scholar.google.com/scholar?cluster=1051144955483053492&hl=en&as_sdt=2000 I downloaded it from the ACM library, please tell me if you can't find it.

...
horizontal(http://speed.pypy.org/comparison/?hor=true&bas=2%2B35&chart=stacked+bars): This is not meant to "demonstrate" that overall the jit is over two times faster than cpython. It is just another way for a developer to picture how long a programme would take to complete if it were composed of 21 such tasks.

You are not summing up absolute times, so your claim is incorrect. And the error is significant, given the above paper. A sum of absolute times would provide what you claim.

...
You can see that cpython's (the normalization chosen) benchmarks all take 1"relative" second. Here, for instance, I see that CPython and pypy-c take more or less the same time, which surprises me (since the PyPy interpreter was known to be slower than CPython). But given that the result is invalid, it may well be an artifact of your statistics.

...
pypy-c needs more or less the same time, some "tasks" being slower and some faster. Psyco shows an interesting picture: From meteor-contest downwards (fortuitously) , all benchmarks are extremely "compressed", which means they are speeded up by psyco quite a lot. But any further speed up wouldn't make overall time much shorter because the first group of benchmarks now takes most of the time to complete. pypy-c-jit is a more extreme case of this: If the jit accelerated all "fast" benchmarks to 0 seconds (infinitely fast), it would only get about twice as fast as now because ai, slowspitfire, spambayes and twisted_tcp now need half the entire execution time. An good demonstration of "you are only as fast as your slowest part". Of course the aggregate of all benchmarks is not a real app, but it is still fun.

This could maybe be still true, at least in part, but you have to do this reasoning on absolute times.

Best regards, and keep up the good work! -- Paolo Giarrusso - Ph.D. Student http://www.informatik.uni-marburg.de/~pgiarrusso/ _______________________________________________ pypy-dev@codespeak.net http://codespeak.net/mailman/listinfo/pypy-dev

Paolo Giarrusso

8:48 p.m.

On Fri, Jun 25, 2010 at 17:53, Maciej Fijalkowski <fijall@gmail.com> wrote:

...

Hey Paolo.

While in general I agree with you, this is not exactly science, I still think it's giving somewhat an impression what's going on to outsiders. As long as no important scholars look at that part, and it's just me, yes. When that happen, you probably lose some credibility. No matter if speed.pypy.org is made by people external to the team.

...

Inside, we still look mostly at particular benchmarks.

But if a change improves one benchmark and worsens another, you need some summary.

...

I'm not sure having any convoluted (at least to normal people) metric while summarizing would help, maybe. You're talking to programmers, not to street people. Even before knowing why I should use the geomean here, I never felt too confused by people using it, even without explanation.

...

Speaking a bit on Miguel's behalf, feel free to implement this as a feature on codespeed (it's an open source project after all), you can fork it on github http://github.com/tobami/codespeed.

Miquel Torres

5:08 p.m.

...

Hi! First, I want to restate the obvious, before pointing out what I think is a mistake: your work on this website is great and very useful!

On Fri, Jun 25, 2010 at 13:08, Miquel Torres <tobami@googlemail.com> wrote:

...
- stacked bars Here you are summing up normalized times, which is more or less like taking their arithmetic average. And that doesn't work at all: in many cases you can "show" completely different results by normalizing relatively to another item. Even the simple question "who is faster?" can be answered in different ways So you should use the geometric mean, even if this is not so widely known. Or better, it is known by benchmarking experts, but it's difficult to become so.

Please, have a look at the short paper: "How not to lie with statistics: the correct way to summarize benchmark results"

http://scholar.google.com/scholar?cluster=1051144955483053492&hl=en&as_sdt=2000 I downloaded it from the ACM library, please tell me if you can't find it.

...
horizontal( http://speed.pypy.org/comparison/?hor=true&bas=2%2B35&chart=stacked+bars): This is not meant to "demonstrate" that overall the jit is over two times faster than cpython. It is just another way for a developer to picture how long a programme would take to complete if it were composed of 21 such tasks.

You are not summing up absolute times, so your claim is incorrect. And the error is significant, given the above paper. A sum of absolute times would provide what you claim.

...
You can see that cpython's (the normalization chosen) benchmarks all take 1"relative" second. Here, for instance, I see that CPython and pypy-c take more or less the same time, which surprises me (since the PyPy interpreter was known to be slower than CPython). But given that the result is invalid, it may well be an artifact of your statistics.

...
pypy-c needs more or less the same time, some "tasks" being slower and some faster. Psyco shows an interesting picture: From meteor-contest downwards (fortuitously) , all benchmarks are extremely "compressed", which means they are speeded up by psyco quite a lot. But any further speed up wouldn't make overall time much shorter because the first group of benchmarks now takes most of the time to complete. pypy-c-jit is a more extreme case of this: If the jit accelerated all "fast" benchmarks to 0 seconds (infinitely fast), it would only get about twice as fast as now because ai, slowspitfire, spambayes and twisted_tcp now need half the entire execution time. An good demonstration of "you are only as fast as your slowest part". Of course the aggregate of all benchmarks is not a real app, but it is still fun.

This could maybe be still true, at least in part, but you have to do this reasoning on absolute times.

Best regards, and keep up the good work! -- Paolo Giarrusso - Ph.D. Student http://www.informatik.uni-marburg.de/~pgiarrusso/<http://www.informatik.uni-marburg.de/%7Epgiarrusso/>

Paolo Giarrusso

8:50 p.m.

On Fri, Jun 25, 2010 at 19:08, Miquel Torres <tobami@googlemail.com> wrote:

...

Hi Paolo,

I am aware of the problem with calculating benchmark means, but let me explain my point of view.

You are correct in that it would be preferable to have absolute times. Well, you actually can, but see what it happens: http://speed.pypy.org/comparison/?hor=true&bas=none&chart=stacked+bars

Ahah! I didn't notice that I could skip normalization! This does not fully invalidate my point, however.

...

Absolute values would only work if we had carefully chosen benchmaks runtimes to be very similar (for our cpython baseline). As it is, html5lib, spitfire and spitfire_cstringio completely dominate the cummulative time.

I acknowledge that (btw, it should be cumulative time, with one 'm', both here and in the website).

...

And not because the interpreter is faster or slower but because the benchmark was arbitrarily designed to run that long. Any improvement in the long running benchmarks will carry much more weight than in the short running.

...

What is more useful is to have comparable slices of time so that the improvements can be seen relatively over time.

...

Normalizing does that i think. Not really.

...

It just says: we have 21 tasks which take 1 second to run each on interpreter X (cpython in the default case). Then we see how other executables compare to that. What would the geometric mean achieve here, exactly, for the end user?

...

I am not really calculating any mean. You can see that I carefully avoided to display any kind of total bar which would indeed incur in the problem you mention. That a stacked chart implicitly displays a total is something you can not avoid, and for that kind of chart I still think normalized results is visually the best option.

Miquel Torres

June 2010

7:16 a.m.

...

On Fri, Jun 25, 2010 at 19:08, Miquel Torres <tobami@googlemail.com> wrote:

...
Hi Paolo,

I am aware of the problem with calculating benchmark means, but let me explain my point of view.

You are correct in that it would be preferable to have absolute times. Well, you actually can, but see what it happens: http://speed.pypy.org/comparison/?hor=true&bas=none&chart=stacked+bars

Ahah! I didn't notice that I could skip normalization! This does not fully invalidate my point, however.

...
Absolute values would only work if we had carefully chosen benchmaks runtimes to be very similar (for our cpython baseline). As it is, html5lib, spitfire and spitfire_cstringio completely dominate the cummulative time.

I acknowledge that (btw, it should be cumulative time, with one 'm', both here and in the website).

...
And not because the interpreter is faster or slower but because the benchmark was arbitrarily designed to run that long. Any improvement in the long running benchmarks will carry much more weight than in the short running.

...
What is more useful is to have comparable slices of time so that the improvements can be seen relatively over time.

If you want to sum up times (but at this point, I see no reason for it), you should rather have externally derived weights, as suggested by the paper (in Rule 3). As soon as you take weights from the data, lots of maths that you need is not going to work any more - that's generally true in many cases in statistics. And the only way making sense to have external weights is to gather them from real world programs. Since that's not going to happen easily, just stick with the geometric mean. Or set an arbitrarily low weight, manually, without any math, so that the long-running benchmarks stop dominating the res. It's no fraud, since the current graph is less valid anyway.

...
Normalizing does that i think. Not really.

...
It just says: we have 21 tasks which take 1 second to run each on interpreter X (cpython in the default case). Then we see how other executables compare to that. What would the geometric mean achieve here, exactly, for the end user?

You actually need the geomean to do that. Don't forget that the geomean is still a mean: it's a mean performance ratio which averages individual performance ratios. If PyPy's geomean is 0.5, it means that PyPy is going to run that task in 11.5 seconds instead of 21. To me, this sounds exactly like what you want to achieve. Moreover, it actually works, unlike what you use.

For instance, ignore PyPy-JIT, and look only CPython and pypy-c (no JIT). Then, change the normalization among the two:

http://speed.pypy.org/comparison/?exe=2%2B35,3%2BL&ben=1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21&env=1&hor=true&bas=2%2B35&chart=stacked+bars

http://speed.pypy.org/comparison/?exe=2%2B35,3%2BL&ben=1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21&env=1&hor=true&bas=3%2BL&chart=stacked+bars with the current data, you get that in one case cpython is faster, in the other pypy-c is faster. It can't happen with the geomean. This is the point of the paper.

I could even construct a normalization baseline $base such that CPython seems faster than PyPy-JIT. Such a base should be very fast on, say, ai (where CPython is slower), so that $cpython.ai/$base.ai becomes 100 and $pypyjit.ai/$base.ai becomes 200, and be very slow on other benchmarks (so that they disappear in the sum).

So, the only difference I see is that geomean works, arithm. mean doesn't. That's why Real Benchmarkers use geomean.

Moreover, you are making a mistake quite common among non-physicists. What you say makes sense under the implicit assumption that dividing two times gives something you can use as a time. When you say "Pypy's runtime for a 1 second task", you actually want to talk about a performance ratio, not about the time. In the same way as when you say "this bird runs 3 meters long in one second", a physicist would sum that up as "3 m/s" rather than "3 m".

...
I am not really calculating any mean. You can see that I carefully avoided to display any kind of total bar which would indeed incur in the problem you mention. That a stacked chart implicitly displays a total is something you can not avoid, and for that kind of chart I still think normalized results is visually the best option.

But on a stacked bars graph, I'm not going to look at individual bars at all, just at the total: it's actually less convenient than in "normal bars" to look at the result of a particular benchmark.

I hope I can find guidelines against stacked plots, I have a PhD colleague reading on how to make graphs.

Best regards -- Paolo Giarrusso - Ph.D. Student http://www.informatik.uni-marburg.de/~pgiarrusso/<http://www.informatik.uni-marburg.de/%7Epgiarrusso/>

Maciej Fijalkowski

3:42 p.m.

On Fri, Jun 25, 2010 at 5:08 AM, Miquel Torres <tobami@googlemail.com> wrote:

...

Hi all!,

I want to announce a new version of the benchmarks site speed.pypy.org.

After about 6 months, it finally shows the vision I had for such a website: usefull for pypy developers but also for the general public following pypy's or even other python implementation's development. On to the changes.

There are now three views: "Changes", "Timeline" and "Comparison":

The Overview was renamed to Changes, and its inline plot bars got removed because you can get the exact same plot in the Comparison view now (and then some).

The Timeline got selectable baseline and "humanized" date labels for the x axis.

The new Comparison view allows, well, comparing of "competing" interpreters, which will also be of interest to the wider Python community (specially if we can add unladen, ironpython and JPython results).

Two examples of interesting comparisons are:

- relative bars (http://speed.pypy.org/comparison/?bas=2%2B35&chart=relative+bars): here we see that the jit is faster than psyco in all cases except spambayes and slowspitfire, were the jit cannot make up for pypy-c's abismal performance. Interestingly, in the only other case where the jit is slower than cpython, the ai benchmark, psyco performs even worse.

- stacked bars horizontal(http://speed.pypy.org/comparison/?hor=true&bas=2%2B35&chart=stacked+bars): This is not meant to "demonstrate" that overall the jit is over two times faster than cpython. It is just another way for a developer to picture how long a programme would take to complete if it were composed of 21 such tasks. You can see that cpython's (the normalization chosen) benchmarks all take 1"relative" second. pypy-c needs more or less the same time, some "tasks" being slower and some faster. Psyco shows an interesting picture: From meteor-contest downwards (fortuitously) , all benchmarks are extremely "compressed", which means they are speeded up by psyco quite a lot. But any further speed up wouldn't make overall time much shorter because the first group of benchmarks now takes most of the time to complete. pypy-c-jit is a more extreme case of this: If the jit accelerated all "fast" benchmarks to 0 seconds (infinitely fast), it would only get about twice as fast as now because ai, slowspitfire, spambayes and twisted_tcp now need half the entire execution time. An good demonstration of "you are only as fast as your slowest part". Of course the aggregate of all benchmarks is not a real app, but it is still fun.

I hope you find the new version useful, and as always any feedback is welcome.

Cheers! Miquel

Wow, I really like it, great job. Can we see how we can use this features for branches? Cheers, fijal

Miquel Torres

5:23 p.m.

There is no problem in running tests for branches. What other branches or interpreters would you for example run? 2010/6/25 Maciej Fijalkowski <fijall@gmail.com>

...

...
Hi all!,

I want to announce a new version of the benchmarks site speed.pypy.org.

After about 6 months, it finally shows the vision I had for such a website: usefull for pypy developers but also for the general public following

...
or even other python implementation's development. On to the changes.

There are now three views: "Changes", "Timeline" and "Comparison":

The Overview was renamed to Changes, and its inline plot bars got removed because you can get the exact same plot in the Comparison view now (and

...
some).

The Timeline got selectable baseline and "humanized" date labels for the x axis.

The new Comparison view allows, well, comparing of "competing" interpreters, which will also be of interest to the wider Python community (specially if we can add unladen, ironpython and JPython results).

Two examples of interesting comparisons are:

- relative bars (http://speed.pypy.org/comparison/?bas=2%2B35&chart=relative+bars): here we see that the jit is faster than psyco in all cases except spambayes and slowspitfire, were the jit cannot make up for pypy-c's abismal

On Fri, Jun 25, 2010 at 5:08 AM, Miquel Torres <tobami@googlemail.com> wrote: pypy's then performance.

...
Interestingly, in the only other case where the jit is slower than cpython, the ai benchmark, psyco performs even worse.

- stacked bars horizontal( http://speed.pypy.org/comparison/?hor=true&bas=2%2B35&chart=stacked+bars): This is not meant to "demonstrate" that overall the jit is over two times faster than cpython. It is just another way for a developer to picture how long a programme would take to complete if it were composed of 21 such tasks. You can see that cpython's (the normalization chosen) benchmarks all take 1"relative" second. pypy-c needs more or less the same time, some "tasks" being slower and some faster. Psyco shows an interesting picture: From meteor-contest downwards (fortuitously) , all benchmarks are extremely "compressed", which means they are speeded up by psyco quite a lot. But any further speed up wouldn't make overall time much shorter because the first group of benchmarks now takes most of the time to complete. pypy-c-jit is a more extreme case of this: If the jit accelerated all "fast" benchmarks to 0 seconds (infinitely fast), it would only get about twice as fast as now because ai, slowspitfire, spambayes and twisted_tcp now need half the entire execution time. An good demonstration of "you are only as fast as your slowest part". Of course the aggregate of all benchmarks is not a real app, but it is still fun.

I hope you find the new version useful, and as always any feedback is welcome.

Cheers! Miquel

Wow, I really like it, great job.

Can we see how we can use this features for branches?

Cheers, fijal

Maciej Fijalkowski

5:28 p.m.

PyPy branches mostly (did this improve or not really kind of question) On Fri, Jun 25, 2010 at 11:23 AM, Miquel Torres <tobami@googlemail.com> wrote:

...

There is no problem in running tests for branches. What other branches or interpreters would you for example run?

2010/6/25 Maciej Fijalkowski <fijall@gmail.com>

...
On Fri, Jun 25, 2010 at 5:08 AM, Miquel Torres <tobami@googlemail.com> wrote:

...
Hi all!,

I want to announce a new version of the benchmarks site speed.pypy.org.

After about 6 months, it finally shows the vision I had for such a website: usefull for pypy developers but also for the general public following pypy's or even other python implementation's development. On to the changes.

There are now three views: "Changes", "Timeline" and "Comparison":

The Overview was renamed to Changes, and its inline plot bars got removed because you can get the exact same plot in the Comparison view now (and then some).

The Timeline got selectable baseline and "humanized" date labels for the x axis.

The new Comparison view allows, well, comparing of "competing" interpreters, which will also be of interest to the wider Python community (specially if we can add unladen, ironpython and JPython results).

Two examples of interesting comparisons are:

- relative bars (http://speed.pypy.org/comparison/?bas=2%2B35&chart=relative+bars): here we see that the jit is faster than psyco in all cases except spambayes and slowspitfire, were the jit cannot make up for pypy-c's abismal performance. Interestingly, in the only other case where the jit is slower than cpython, the ai benchmark, psyco performs even worse.

- stacked bars

horizontal(http://speed.pypy.org/comparison/?hor=true&bas=2%2B35&chart=stacked+bars): This is not meant to "demonstrate" that overall the jit is over two times faster than cpython. It is just another way for a developer to picture how long a programme would take to complete if it were composed of 21 such tasks. You can see that cpython's (the normalization chosen) benchmarks all take 1"relative" second. pypy-c needs more or less the same time, some "tasks" being slower and some faster. Psyco shows an interesting picture: From meteor-contest downwards (fortuitously) , all benchmarks are extremely "compressed", which means they are speeded up by psyco quite a lot. But any further speed up wouldn't make overall time much shorter because the first group of benchmarks now takes most of the time to complete. pypy-c-jit is a more extreme case of this: If the jit accelerated all "fast" benchmarks to 0 seconds (infinitely fast), it would only get about twice as fast as now because ai, slowspitfire, spambayes and twisted_tcp now need half the entire execution time. An good demonstration of "you are only as fast as your slowest part". Of course the aggregate of all benchmarks is not a real app, but it is still fun.

I hope you find the new version useful, and as always any feedback is welcome.

Cheers! Miquel

Wow, I really like it, great job.

Can we see how we can use this features for branches?

Cheers, fijal

Maciej Fijalkowski

4:16 p.m.

...

Hi all!,

I want to announce a new version of the benchmarks site speed.pypy.org.

After about 6 months, it finally shows the vision I had for such a website: usefull for pypy developers but also for the general public following pypy's or even other python implementation's development. On to the changes.

There are now three views: "Changes", "Timeline" and "Comparison":

The Overview was renamed to Changes, and its inline plot bars got removed because you can get the exact same plot in the Comparison view now (and then some).

The Timeline got selectable baseline and "humanized" date labels for the x axis.

The new Comparison view allows, well, comparing of "competing" interpreters, which will also be of interest to the wider Python community (specially if we can add unladen, ironpython and JPython results).

Two examples of interesting comparisons are:

- relative bars (http://speed.pypy.org/comparison/?bas=2%2B35&chart=relative+bars): here we see that the jit is faster than psyco in all cases except spambayes and slowspitfire, were the jit cannot make up for pypy-c's abismal performance. Interestingly, in the only other case where the jit is slower than cpython, the ai benchmark, psyco performs even worse.

- stacked bars horizontal(http://speed.pypy.org/comparison/?hor=true&bas=2%2B35&chart=stacked+bars): This is not meant to "demonstrate" that overall the jit is over two times faster than cpython. It is just another way for a developer to picture how long a programme would take to complete if it were composed of 21 such tasks. You can see that cpython's (the normalization chosen) benchmarks all take 1"relative" second. pypy-c needs more or less the same time, some "tasks" being slower and some faster. Psyco shows an interesting picture: From meteor-contest downwards (fortuitously) , all benchmarks are extremely "compressed", which means they are speeded up by psyco quite a lot. But any further speed up wouldn't make overall time much shorter because the first group of benchmarks now takes most of the time to complete. pypy-c-jit is a more extreme case of this: If the jit accelerated all "fast" benchmarks to 0 seconds (infinitely fast), it would only get about twice as fast as now because ai, slowspitfire, spambayes and twisted_tcp now need half the entire execution time. An good demonstration of "you are only as fast as your slowest part". Of course the aggregate of all benchmarks is not a real app, but it is still fun.

I hope you find the new version useful, and as always any feedback is welcome.

Cheers! Miquel

_______________________________________________ pypy-dev@codespeak.net http://codespeak.net/mailman/listinfo/pypy-dev

Miquel Torres

5:14 p.m.

...

Hey.

A bit more important problem - results seem to be messed up. I think there is something wrong with baselines. Look here:

http://buildbot.pypy.org/builders/jit-benchmark-linux-x86-32/builds/358/step...

on twisted_tcp

vs

http://speed.pypy.org/timeline/?exe=1,3&base=2%2B35&ben=twisted_tcp&env=tannit&revs=200

...
Hi all!,

I want to announce a new version of the benchmarks site speed.pypy.org.

After about 6 months, it finally shows the vision I had for such a website: usefull for pypy developers but also for the general public following

...
or even other python implementation's development. On to the changes.

There are now three views: "Changes", "Timeline" and "Comparison":

The Overview was renamed to Changes, and its inline plot bars got removed because you can get the exact same plot in the Comparison view now (and

...
some).

The Timeline got selectable baseline and "humanized" date labels for the x axis.

The new Comparison view allows, well, comparing of "competing" interpreters, which will also be of interest to the wider Python community (specially if we can add unladen, ironpython and JPython results).

Two examples of interesting comparisons are:

- relative bars (http://speed.pypy.org/comparison/?bas=2%2B35&chart=relative+bars): here we see that the jit is faster than psyco in all cases except spambayes and slowspitfire, were the jit cannot make up for pypy-c's abismal

On Fri, Jun 25, 2010 at 5:08 AM, Miquel Torres <tobami@googlemail.com> wrote: pypy's then performance.

...
Interestingly, in the only other case where the jit is slower than cpython, the ai benchmark, psyco performs even worse.

- stacked bars horizontal( http://speed.pypy.org/comparison/?hor=true&bas=2%2B35&chart=stacked+bars): This is not meant to "demonstrate" that overall the jit is over two times faster than cpython. It is just another way for a developer to picture how long a programme would take to complete if it were composed of 21 such tasks. You can see that cpython's (the normalization chosen) benchmarks all take 1"relative" second. pypy-c needs more or less the same time, some "tasks" being slower and some faster. Psyco shows an interesting picture: From meteor-contest downwards (fortuitously) , all benchmarks are extremely "compressed", which means they are speeded up by psyco quite a lot. But any further speed up wouldn't make overall time much shorter because the first group of benchmarks now takes most of the time to complete. pypy-c-jit is a more extreme case of this: If the jit accelerated all "fast" benchmarks to 0 seconds (infinitely fast), it would only get about twice as fast as now because ai, slowspitfire, spambayes and twisted_tcp now need half the entire execution time. An good demonstration of "you are only as fast as your slowest part". Of course the aggregate of all benchmarks is not a real app, but it is still fun.

I hope you find the new version useful, and as always any feedback is welcome.

Cheers! Miquel

_______________________________________________ pypy-dev@codespeak.net http://codespeak.net/mailman/listinfo/pypy-dev

5380

Age (days ago)

5381

Last active (days ago)

List overview

Download

12 comments

4 participants

participants (4)

Antonio Cuni
Maciej Fijalkowski
Miquel Torres
Paolo Giarrusso

New speed.pypy.org version

Paolo Giarrusso

Paolo Giarrusso

Paolo Giarrusso

Paolo Giarrusso

Paolo Giarrusso

Paolo Giarrusso

tags

participants (4)