Hi Paolo,<br><br> I am aware of the problem with calculating benchmark means, but let me 

explain my point of view.<br><br>You are correct in that it would be preferable to have absolute times. Well, you actually can, but see what it happens: <a href="http://speed.pypy.org/comparison/?hor=true&amp;bas=none&amp;chart=stacked+bars">http://speed.pypy.org/comparison/?hor=true&amp;bas=none&amp;chart=stacked+bars</a><br>

<br>Absolute values would only work if we had carefully chosen benchmaks runtimes to be very similar (for our cpython baseline). As it is, html5lib, spitfire and spitfire_cstringio completely dominate the cummulative time. And not because the interpreter is faster or slower but because the benchmark was arbitrarily designed to run that long. Any improvement in the long running benchmarks will carry much more weight than in the short running.<br>

<br>What is more useful is to have comparable slices of time so that the improvements can be seen relatively over time. Normalizing does that i think. It just says: we have 21 tasks which take 1 second to run each on interpreter X (cpython in the default case). Then we see how other executables compare to that. What would the geometric mean achieve here, exactly, for the end user?<br>

<br>I am not really calculating any mean. You can see that I carefully avoided to display any kind of total bar which would indeed incur in the problem you mention. That a stacked chart implicitly displays a total is something you can not avoid, and for that kind of chart I still think normalized results is visually the best option.<br>

<br>Still, i would very much like to read the paper you cite, but you need a login for it.<br><br>Cheers,<br>Miquel<br><br><br><div class="gmail_quote">2010/6/25 Paolo Giarrusso <span dir="ltr">&lt;<a href="mailto:p.giarrusso@gmail.com">p.giarrusso@gmail.com</a>&gt;</span><br>

<blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;">Hi!<br>

First, I want to restate the obvious, before pointing out what I think<br>

is a mistake: your work on this website is great and very useful!<br>

<br>

On Fri, Jun 25, 2010 at 13:08, Miquel Torres &lt;<a href="mailto:tobami@googlemail.com">tobami@googlemail.com</a>&gt; wrote:<br>

&gt; - stacked bars<br>

Here you are summing up normalized times, which is more or less like<br>

taking their arithmetic average. And that doesn&#39;t work at all: in many<br>

cases you can &quot;show&quot; completely different results by normalizing<br>

relatively to another item. Even the simple question &quot;who is faster?&quot;<br>

can be answered in different ways<br>

So you should use the geometric mean, even if this is not so widely<br>

known. Or better, it is known by benchmarking experts, but it&#39;s<br>

difficult to become so.<br>

<br>

Please, have a look at the short paper:<br>

&quot;How not to lie with statistics: the correct way to summarize benchmark results&quot;<br>

<a href="http://scholar.google.com/scholar?cluster=1051144955483053492&amp;hl=en&amp;as_sdt=2000" target="_blank">http://scholar.google.com/scholar?cluster=1051144955483053492&amp;hl=en&amp;as_sdt=2000</a><br>

I downloaded it from the ACM library, please tell me if you can&#39;t find it.<br>

<div class="im"><br>

&gt; horizontal(<a href="http://speed.pypy.org/comparison/?hor=true&amp;bas=2%2B35&amp;chart=stacked+bars" target="_blank">http://speed.pypy.org/comparison/?hor=true&amp;bas=2%2B35&amp;chart=stacked+bars</a>):<br>

&gt; This is not meant to &quot;demonstrate&quot; that overall the jit is over two times<br>

&gt; faster than cpython. It is just another way for a developer to picture how<br>

&gt; long a programme would take to complete if it were composed of 21 such<br>

&gt; tasks.<br>

<br>

</div>You are not summing up absolute times, so your claim is incorrect. And<br>

the error is significant, given the above paper.<br>

A sum of absolute times would provide what you claim.<br>

<div class="im"><br>

&gt; You can see that cpython&#39;s (the normalization chosen) benchmarks all<br>

&gt; take 1&quot;relative&quot; second.<br>

</div>Here, for instance, I see that CPython and pypy-c take more or less<br>

the same time, which surprises me (since the PyPy interpreter was<br>

known to be slower than CPython). But given that the result is<br>

invalid, it may well be an artifact of your statistics.<br>

<div class="im"><br>

&gt; pypy-c needs more or less the same time, some<br>

&gt; &quot;tasks&quot; being slower and some faster. Psyco shows an interesting picture:<br>

&gt; From meteor-contest downwards (fortuitously) , all benchmarks are extremely<br>

&gt; &quot;compressed&quot;, which means they are speeded up by psyco quite a lot. But any<br>

&gt; further speed up wouldn&#39;t make overall time much shorter because the first<br>

&gt; group of benchmarks now takes most of the time to complete. pypy-c-jit is a<br>

&gt; more extreme case of this: If the jit accelerated all &quot;fast&quot; benchmarks to 0<br>

&gt; seconds (infinitely fast), it would only get about twice as fast as now<br>

&gt; because ai, slowspitfire, spambayes and twisted_tcp now need half the entire<br>

&gt; execution time. An good demonstration of &quot;you are only as fast as your<br>

&gt; slowest part&quot;. Of course the aggregate of all benchmarks is not a real app,<br>

&gt; but it is still fun.<br>

<br>

</div>This could maybe be still true, at least in part, but you have to do<br>

this reasoning on absolute times.<br>

<br>

Best regards, and keep up the good work!<br>

<font color="#888888">--<br>

Paolo Giarrusso - Ph.D. Student<br>

<a href="http://www.informatik.uni-marburg.de/%7Epgiarrusso/" target="_blank">http://www.informatik.uni-marburg.de/~pgiarrusso/</a><br>

</font></blockquote></div><br>