Re: [pypy-dev] New speed.pypy.org version

Hi Miquel, I'm quite busy (because of a paper deadline next Tuesday), sorry for not answering earlier. I was just struck by an idea: there is a stacked bar plot where the total bar is related to the geometric mean, such that it is normalization-invariant. But this graph _is_ complicated. It is a stacked plot of _logarithms_ of performance ratios? This way, the complete stacked bar shows the logarithm of the product, rather than their sum, i.e. the log of the (geometric mean)^N rather than their arithmetic mean. log of the (geometric mean)^N = N*log of the (geometric mean). Some simple maths (I didn't write it out, so please recheck!) seems to show that showing (a+b*log (ratio)), instead of log(ratio), gives still a fair comparison, obtaining N*a+b*N*log(geomean) = \Theta(log(geomean)). You need to put a and b because showing if the ratio is 1, log(1) is zero (b is the representation scale which is always there). About your workaround: I would like a table with the geometric mean of the ratios, where we get the real global performance ratio among the interpreters. As far as the results of your solution do not contradict that _real_ table, it should be a reasonable workaround (but I would embed the check in the code - otherwise other projects _will be_ bitten by that). Probably, I would like the website to offer such a table to users, and I would like a graph of the overall performance ratio over time (actually revisions). Finally, the docs of your web application should at the very least reference the paper and this conversation (if there's a public archive of the ML, as I think), and ideally explain the issue. Sorry for being too dense, maybe - if I was unclear, please tell me and I'll answer next week. Best regards, Paolo On Mon, Jun 28, 2010 at 11:21, Miquel Torres <tobami@googlemail.com> wrote:
Hi Paolo,
I read the paper, very interesting. It is perfectly clear that to calculate a normalized total only the geometric mean makes sense.
However, a stacked bars plot shows the individual benchmarks so it implicitly is an arithmetic mean. The only solution (apart from removing the stacked charts and only offering total bars) is the weighted approach.
External weights are not very practical though. Codespeed is used by other projects so an extra option would need to be added to the settings to allow the introducing of arbitrary weights to benchmarks. A bit cumbersome. I have an idea that may work. Take the weights from a defined baseline so that the run times are equal, which is the same as normalizing to a baseline. It would be the same as now, only that you can't choose the normalization, it will be weighted (normalized) according the default baseline (which you already can already configure in the settings).
You may say that it is still an arithmetic mean, but there won't be conflicting results because there is only a single normalization. For PyPy that would be cpython, and everything would make sense. I know it is a work around, not a solution. If you think it is a bad idea, the only other possibility is not to have stacked bars (as in "showing individual benchmarks"). But I find them useful. Yes you can see the individual benchmark results better in the normal bars chart, but there you don't see visually which benchmarks take the biggest part of the pie, which helps visualize what parts of your program need most improving.
What do you think?
Regards, Miquel
2010/6/25 Paolo Giarrusso <p.giarrusso@gmail.com>:
On Fri, Jun 25, 2010 at 19:08, Miquel Torres <tobami@googlemail.com> wrote:
Hi Paolo,
I am aware of the problem with calculating benchmark means, but let me explain my point of view.
You are correct in that it would be preferable to have absolute times. Well, you actually can, but see what it happens: http://speed.pypy.org/comparison/?hor=true&bas=none&chart=stacked+bars
Ahah! I didn't notice that I could skip normalization! This does not fully invalidate my point, however.
Absolute values would only work if we had carefully chosen benchmaks runtimes to be very similar (for our cpython baseline). As it is, html5lib, spitfire and spitfire_cstringio completely dominate the cummulative time.
I acknowledge that (btw, it should be cumulative time, with one 'm', both here and in the website).
And not because the interpreter is faster or slower but because the benchmark was arbitrarily designed to run that long. Any improvement in the long running benchmarks will carry much more weight than in the short running.
What is more useful is to have comparable slices of time so that the improvements can be seen relatively over time.
If you want to sum up times (but at this point, I see no reason for it), you should rather have externally derived weights, as suggested by the paper (in Rule 3). As soon as you take weights from the data, lots of maths that you need is not going to work any more - that's generally true in many cases in statistics. And the only way making sense to have external weights is to gather them from real world programs. Since that's not going to happen easily, just stick with the geometric mean. Or set an arbitrarily low weight, manually, without any math, so that the long-running benchmarks stop dominating the res. It's no fraud, since the current graph is less valid anyway.
Normalizing does that i think. Not really.
It just says: we have 21 tasks which take 1 second to run each on interpreter X (cpython in the default case). Then we see how other executables compare to that. What would the geometric mean achieve here, exactly, for the end user?
You actually need the geomean to do that. Don't forget that the geomean is still a mean: it's a mean performance ratio which averages individual performance ratios. If PyPy's geomean is 0.5, it means that PyPy is going to run that task in 11.5 seconds instead of 21. To me, this sounds exactly like what you want to achieve. Moreover, it actually works, unlike what you use.
For instance, ignore PyPy-JIT, and look only CPython and pypy-c (no JIT). Then, change the normalization among the two: http://speed.pypy.org/comparison/?exe=2%2B35,3%2BL&ben=1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21&env=1&hor=true&bas=2%2B35&chart=stacked+bars http://speed.pypy.org/comparison/?exe=2%2B35,3%2BL&ben=1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21&env=1&hor=true&bas=3%2BL&chart=stacked+bars with the current data, you get that in one case cpython is faster, in the other pypy-c is faster. It can't happen with the geomean. This is the point of the paper.
I could even construct a normalization baseline $base such that CPython seems faster than PyPy-JIT. Such a base should be very fast on, say, ai (where CPython is slower), so that $cpython.ai/$base.ai becomes 100 and $pypyjit.ai/$base.ai becomes 200, and be very slow on other benchmarks (so that they disappear in the sum).
So, the only difference I see is that geomean works, arithm. mean doesn't. That's why Real Benchmarkers use geomean.
Moreover, you are making a mistake quite common among non-physicists. What you say makes sense under the implicit assumption that dividing two times gives something you can use as a time. When you say "Pypy's runtime for a 1 second task", you actually want to talk about a performance ratio, not about the time. In the same way as when you say "this bird runs 3 meters long in one second", a physicist would sum that up as "3 m/s" rather than "3 m".
I am not really calculating any mean. You can see that I carefully avoided to display any kind of total bar which would indeed incur in the problem you mention. That a stacked chart implicitly displays a total is something you can not avoid, and for that kind of chart I still think normalized results is visually the best option.
But on a stacked bars graph, I'm not going to look at individual bars at all, just at the total: it's actually less convenient than in "normal bars" to look at the result of a particular benchmark.
I hope I can find guidelines against stacked plots, I have a PhD colleague reading on how to make graphs.
Best regards -- Paolo Giarrusso - Ph.D. Student http://www.informatik.uni-marburg.de/~pgiarrusso/
-- Paolo Giarrusso - Ph.D. Student http://www.informatik.uni-marburg.de/~pgiarrusso/

Hi Paolo, hey! I think it is a great idea. With logs you get both: correct normalized totals AND the ability to display the individual stacked series, which necessarily add arithmetically. But it strikes me, hasn't anyone written a paper about that method already? or at least documented it? Anyway I need to check that the math is right (hopefully today), and then I would go and implement it. I'll tell you how it goes. Cheers, Miquel 2010/6/30 Paolo Giarrusso <p.giarrusso@gmail.com>:
Hi Miquel, I'm quite busy (because of a paper deadline next Tuesday), sorry for not answering earlier.
I was just struck by an idea: there is a stacked bar plot where the total bar is related to the geometric mean, such that it is normalization-invariant. But this graph _is_ complicated.
It is a stacked plot of _logarithms_ of performance ratios? This way, the complete stacked bar shows the logarithm of the product, rather than their sum, i.e. the log of the (geometric mean)^N rather than their arithmetic mean. log of the (geometric mean)^N = N*log of the (geometric mean).
Some simple maths (I didn't write it out, so please recheck!) seems to show that showing (a+b*log (ratio)), instead of log(ratio), gives still a fair comparison, obtaining N*a+b*N*log(geomean) = \Theta(log(geomean)). You need to put a and b because showing if the ratio is 1, log(1) is zero (b is the representation scale which is always there).
About your workaround: I would like a table with the geometric mean of the ratios, where we get the real global performance ratio among the interpreters. As far as the results of your solution do not contradict that _real_ table, it should be a reasonable workaround (but I would embed the check in the code - otherwise other projects _will be_ bitten by that). Probably, I would like the website to offer such a table to users, and I would like a graph of the overall performance ratio over time (actually revisions).
Finally, the docs of your web application should at the very least reference the paper and this conversation (if there's a public archive of the ML, as I think), and ideally explain the issue.
Sorry for being too dense, maybe - if I was unclear, please tell me and I'll answer next week.
Best regards, Paolo
On Mon, Jun 28, 2010 at 11:21, Miquel Torres <tobami@googlemail.com> wrote:
Hi Paolo,
I read the paper, very interesting. It is perfectly clear that to calculate a normalized total only the geometric mean makes sense.
However, a stacked bars plot shows the individual benchmarks so it implicitly is an arithmetic mean. The only solution (apart from removing the stacked charts and only offering total bars) is the weighted approach.
External weights are not very practical though. Codespeed is used by other projects so an extra option would need to be added to the settings to allow the introducing of arbitrary weights to benchmarks. A bit cumbersome. I have an idea that may work. Take the weights from a defined baseline so that the run times are equal, which is the same as normalizing to a baseline. It would be the same as now, only that you can't choose the normalization, it will be weighted (normalized) according the default baseline (which you already can already configure in the settings).
You may say that it is still an arithmetic mean, but there won't be conflicting results because there is only a single normalization. For PyPy that would be cpython, and everything would make sense. I know it is a work around, not a solution. If you think it is a bad idea, the only other possibility is not to have stacked bars (as in "showing individual benchmarks"). But I find them useful. Yes you can see the individual benchmark results better in the normal bars chart, but there you don't see visually which benchmarks take the biggest part of the pie, which helps visualize what parts of your program need most improving.
What do you think?
Regards, Miquel
2010/6/25 Paolo Giarrusso <p.giarrusso@gmail.com>:
On Fri, Jun 25, 2010 at 19:08, Miquel Torres <tobami@googlemail.com> wrote:
Hi Paolo,
I am aware of the problem with calculating benchmark means, but let me explain my point of view.
You are correct in that it would be preferable to have absolute times. Well, you actually can, but see what it happens: http://speed.pypy.org/comparison/?hor=true&bas=none&chart=stacked+bars
Ahah! I didn't notice that I could skip normalization! This does not fully invalidate my point, however.
Absolute values would only work if we had carefully chosen benchmaks runtimes to be very similar (for our cpython baseline). As it is, html5lib, spitfire and spitfire_cstringio completely dominate the cummulative time.
I acknowledge that (btw, it should be cumulative time, with one 'm', both here and in the website).
And not because the interpreter is faster or slower but because the benchmark was arbitrarily designed to run that long. Any improvement in the long running benchmarks will carry much more weight than in the short running.
What is more useful is to have comparable slices of time so that the improvements can be seen relatively over time.
If you want to sum up times (but at this point, I see no reason for it), you should rather have externally derived weights, as suggested by the paper (in Rule 3). As soon as you take weights from the data, lots of maths that you need is not going to work any more - that's generally true in many cases in statistics. And the only way making sense to have external weights is to gather them from real world programs. Since that's not going to happen easily, just stick with the geometric mean. Or set an arbitrarily low weight, manually, without any math, so that the long-running benchmarks stop dominating the res. It's no fraud, since the current graph is less valid anyway.
Normalizing does that i think. Not really.
It just says: we have 21 tasks which take 1 second to run each on interpreter X (cpython in the default case). Then we see how other executables compare to that. What would the geometric mean achieve here, exactly, for the end user?
You actually need the geomean to do that. Don't forget that the geomean is still a mean: it's a mean performance ratio which averages individual performance ratios. If PyPy's geomean is 0.5, it means that PyPy is going to run that task in 11.5 seconds instead of 21. To me, this sounds exactly like what you want to achieve. Moreover, it actually works, unlike what you use.
For instance, ignore PyPy-JIT, and look only CPython and pypy-c (no JIT). Then, change the normalization among the two: http://speed.pypy.org/comparison/?exe=2%2B35,3%2BL&ben=1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21&env=1&hor=true&bas=2%2B35&chart=stacked+bars http://speed.pypy.org/comparison/?exe=2%2B35,3%2BL&ben=1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21&env=1&hor=true&bas=3%2BL&chart=stacked+bars with the current data, you get that in one case cpython is faster, in the other pypy-c is faster. It can't happen with the geomean. This is the point of the paper.
I could even construct a normalization baseline $base such that CPython seems faster than PyPy-JIT. Such a base should be very fast on, say, ai (where CPython is slower), so that $cpython.ai/$base.ai becomes 100 and $pypyjit.ai/$base.ai becomes 200, and be very slow on other benchmarks (so that they disappear in the sum).
So, the only difference I see is that geomean works, arithm. mean doesn't. That's why Real Benchmarkers use geomean.
Moreover, you are making a mistake quite common among non-physicists. What you say makes sense under the implicit assumption that dividing two times gives something you can use as a time. When you say "Pypy's runtime for a 1 second task", you actually want to talk about a performance ratio, not about the time. In the same way as when you say "this bird runs 3 meters long in one second", a physicist would sum that up as "3 m/s" rather than "3 m".
I am not really calculating any mean. You can see that I carefully avoided to display any kind of total bar which would indeed incur in the problem you mention. That a stacked chart implicitly displays a total is something you can not avoid, and for that kind of chart I still think normalized results is visually the best option.
But on a stacked bars graph, I'm not going to look at individual bars at all, just at the total: it's actually less convenient than in "normal bars" to look at the result of a particular benchmark.
I hope I can find guidelines against stacked plots, I have a PhD colleague reading on how to make graphs.
Best regards -- Paolo Giarrusso - Ph.D. Student http://www.informatik.uni-marburg.de/~pgiarrusso/
-- Paolo Giarrusso - Ph.D. Student http://www.informatik.uni-marburg.de/~pgiarrusso/

On Fri, Jul 2, 2010 at 09:27, Miquel Torres <tobami@googlemail.com> wrote:
Hi Paolo,
hey! I think it is a great idea. With logs you get both: correct normalized totals AND the ability to display the individual stacked series, which necessarily add arithmetically. But it strikes me, hasn't anyone written a paper about that method already? or at least documented it?
I guess the problem is that the graph is weird enough, and that you need arbitrary a and b to make it work, since the logarithm might get negative, and arbitrarily big. log 0 = - inf. I still think that's fair and makes sense, but it's somewhat hard to sell.
Anyway I need to check that the math is right (hopefully today), and then I would go and implement it. I'll tell you how it goes.
Cheers, Miquel
2010/6/30 Paolo Giarrusso <p.giarrusso@gmail.com>:
Hi Miquel, I'm quite busy (because of a paper deadline next Tuesday), sorry for not answering earlier.
I was just struck by an idea: there is a stacked bar plot where the total bar is related to the geometric mean, such that it is normalization-invariant. But this graph _is_ complicated.
It is a stacked plot of _logarithms_ of performance ratios? This way, the complete stacked bar shows the logarithm of the product, rather than their sum, i.e. the log of the (geometric mean)^N rather than their arithmetic mean. log of the (geometric mean)^N = N*log of the (geometric mean).
Some simple maths (I didn't write it out, so please recheck!) seems to show that showing (a+b*log (ratio)), instead of log(ratio), gives still a fair comparison, obtaining N*a+b*N*log(geomean) = \Theta(log(geomean)). You need to put a and b because showing if the ratio is 1, log(1) is zero (b is the representation scale which is always there).
About your workaround: I would like a table with the geometric mean of the ratios, where we get the real global performance ratio among the interpreters. As far as the results of your solution do not contradict that _real_ table, it should be a reasonable workaround (but I would embed the check in the code - otherwise other projects _will be_ bitten by that). Probably, I would like the website to offer such a table to users, and I would like a graph of the overall performance ratio over time (actually revisions).
Finally, the docs of your web application should at the very least reference the paper and this conversation (if there's a public archive of the ML, as I think), and ideally explain the issue.
Sorry for being too dense, maybe - if I was unclear, please tell me and I'll answer next week.
Best regards, Paolo
On Mon, Jun 28, 2010 at 11:21, Miquel Torres <tobami@googlemail.com> wrote:
Hi Paolo,
I read the paper, very interesting. It is perfectly clear that to calculate a normalized total only the geometric mean makes sense.
However, a stacked bars plot shows the individual benchmarks so it implicitly is an arithmetic mean. The only solution (apart from removing the stacked charts and only offering total bars) is the weighted approach.
External weights are not very practical though. Codespeed is used by other projects so an extra option would need to be added to the settings to allow the introducing of arbitrary weights to benchmarks. A bit cumbersome. I have an idea that may work. Take the weights from a defined baseline so that the run times are equal, which is the same as normalizing to a baseline. It would be the same as now, only that you can't choose the normalization, it will be weighted (normalized) according the default baseline (which you already can already configure in the settings).
You may say that it is still an arithmetic mean, but there won't be conflicting results because there is only a single normalization. For PyPy that would be cpython, and everything would make sense. I know it is a work around, not a solution. If you think it is a bad idea, the only other possibility is not to have stacked bars (as in "showing individual benchmarks"). But I find them useful. Yes you can see the individual benchmark results better in the normal bars chart, but there you don't see visually which benchmarks take the biggest part of the pie, which helps visualize what parts of your program need most improving.
What do you think?
Regards, Miquel
2010/6/25 Paolo Giarrusso <p.giarrusso@gmail.com>:
On Fri, Jun 25, 2010 at 19:08, Miquel Torres <tobami@googlemail.com> wrote:
Hi Paolo,
I am aware of the problem with calculating benchmark means, but let me explain my point of view.
You are correct in that it would be preferable to have absolute times. Well, you actually can, but see what it happens: http://speed.pypy.org/comparison/?hor=true&bas=none&chart=stacked+bars
Ahah! I didn't notice that I could skip normalization! This does not fully invalidate my point, however.
Absolute values would only work if we had carefully chosen benchmaks runtimes to be very similar (for our cpython baseline). As it is, html5lib, spitfire and spitfire_cstringio completely dominate the cummulative time.
I acknowledge that (btw, it should be cumulative time, with one 'm', both here and in the website).
And not because the interpreter is faster or slower but because the benchmark was arbitrarily designed to run that long. Any improvement in the long running benchmarks will carry much more weight than in the short running.
What is more useful is to have comparable slices of time so that the improvements can be seen relatively over time.
If you want to sum up times (but at this point, I see no reason for it), you should rather have externally derived weights, as suggested by the paper (in Rule 3). As soon as you take weights from the data, lots of maths that you need is not going to work any more - that's generally true in many cases in statistics. And the only way making sense to have external weights is to gather them from real world programs. Since that's not going to happen easily, just stick with the geometric mean. Or set an arbitrarily low weight, manually, without any math, so that the long-running benchmarks stop dominating the res. It's no fraud, since the current graph is less valid anyway.
Normalizing does that i think. Not really.
It just says: we have 21 tasks which take 1 second to run each on interpreter X (cpython in the default case). Then we see how other executables compare to that. What would the geometric mean achieve here, exactly, for the end user?
You actually need the geomean to do that. Don't forget that the geomean is still a mean: it's a mean performance ratio which averages individual performance ratios. If PyPy's geomean is 0.5, it means that PyPy is going to run that task in 11.5 seconds instead of 21. To me, this sounds exactly like what you want to achieve. Moreover, it actually works, unlike what you use.
For instance, ignore PyPy-JIT, and look only CPython and pypy-c (no JIT). Then, change the normalization among the two: http://speed.pypy.org/comparison/?exe=2%2B35,3%2BL&ben=1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21&env=1&hor=true&bas=2%2B35&chart=stacked+bars http://speed.pypy.org/comparison/?exe=2%2B35,3%2BL&ben=1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21&env=1&hor=true&bas=3%2BL&chart=stacked+bars with the current data, you get that in one case cpython is faster, in the other pypy-c is faster. It can't happen with the geomean. This is the point of the paper.
I could even construct a normalization baseline $base such that CPython seems faster than PyPy-JIT. Such a base should be very fast on, say, ai (where CPython is slower), so that $cpython.ai/$base.ai becomes 100 and $pypyjit.ai/$base.ai becomes 200, and be very slow on other benchmarks (so that they disappear in the sum).
So, the only difference I see is that geomean works, arithm. mean doesn't. That's why Real Benchmarkers use geomean.
Moreover, you are making a mistake quite common among non-physicists. What you say makes sense under the implicit assumption that dividing two times gives something you can use as a time. When you say "Pypy's runtime for a 1 second task", you actually want to talk about a performance ratio, not about the time. In the same way as when you say "this bird runs 3 meters long in one second", a physicist would sum that up as "3 m/s" rather than "3 m".
I am not really calculating any mean. You can see that I carefully avoided to display any kind of total bar which would indeed incur in the problem you mention. That a stacked chart implicitly displays a total is something you can not avoid, and for that kind of chart I still think normalized results is visually the best option.
But on a stacked bars graph, I'm not going to look at individual bars at all, just at the total: it's actually less convenient than in "normal bars" to look at the result of a particular benchmark.
I hope I can find guidelines against stacked plots, I have a PhD colleague reading on how to make graphs.
Best regards -- Paolo Giarrusso - Ph.D. Student http://www.informatik.uni-marburg.de/~pgiarrusso/
-- Paolo Giarrusso - Ph.D. Student http://www.informatik.uni-marburg.de/~pgiarrusso/
-- Paolo Giarrusso - Ph.D. Student http://www.informatik.uni-marburg.de/~pgiarrusso/
participants (2)
-
Miquel Torres
-
Paolo Giarrusso