[pypy-dev] New speed.pypy.org version
p.giarrusso at gmail.com
Wed Jun 30 10:11:33 CEST 2010
I'm quite busy (because of a paper deadline next Tuesday), sorry for
not answering earlier.
I was just struck by an idea: there is a stacked bar plot where the
total bar is related to the geometric mean, such that it is
normalization-invariant. But this graph _is_ complicated.
It is a stacked plot of _logarithms_ of performance ratios? This way,
the complete stacked bar shows the logarithm of the product, rather
than their sum, i.e. the log of the (geometric mean)^N rather than
their arithmetic mean. log of the (geometric mean)^N = N*log of the
Some simple maths (I didn't write it out, so please recheck!) seems to
show that showing (a+b*log (ratio)), instead of log(ratio), gives
still a fair comparison, obtaining N*a+b*N*log(geomean) =
\Theta(log(geomean)). You need to put a and b because showing if the
ratio is 1, log(1) is zero (b is the representation scale which is
About your workaround: I would like a table with the geometric mean of
the ratios, where we get the real global performance ratio among the
interpreters. As far as the results of your solution do not contradict
that _real_ table, it should be a reasonable workaround (but I would
embed the check in the code - otherwise other projects _will be_
bitten by that). Probably, I would like the website to offer such a
table to users, and I would like a graph of the overall performance
ratio over time (actually revisions).
Finally, the docs of your web application should at the very least
reference the paper and this conversation (if there's a public archive
of the ML, as I think), and ideally explain the issue.
Sorry for being too dense, maybe - if I was unclear, please tell me
and I'll answer next week.
On Mon, Jun 28, 2010 at 11:21, Miquel Torres <tobami at googlemail.com> wrote:
> Hi Paolo,
> I read the paper, very interesting. It is perfectly clear that to
> calculate a normalized total only the geometric mean makes sense.
> However, a stacked bars plot shows the individual benchmarks so it
> implicitly is an arithmetic mean. The only solution (apart from
> removing the stacked charts and only offering total bars) is the
> weighted approach.
> External weights are not very practical though. Codespeed is used by
> other projects so an extra option would need to be added to the
> settings to allow the introducing of arbitrary weights to benchmarks.
> A bit cumbersome. I have an idea that may work. Take the weights from
> a defined baseline so that the run times are equal, which is the same
> as normalizing to a baseline. It would be the same as now, only that
> you can't choose the normalization, it will be weighted (normalized)
> according the default baseline (which you already can already
> configure in the settings).
> You may say that it is still an arithmetic mean, but there won't be
> conflicting results because there is only a single normalization. For
> PyPy that would be cpython, and everything would make sense.
> I know it is a work around, not a solution. If you think it is a bad
> idea, the only other possibility is not to have stacked bars (as in
> "showing individual benchmarks"). But I find them useful. Yes you can
> see the individual benchmark results better in the normal bars chart,
> but there you don't see visually which benchmarks take the biggest
> part of the pie, which helps visualize what parts of your program need
> most improving.
> What do you think?
> 2010/6/25 Paolo Giarrusso <p.giarrusso at gmail.com>:
>> On Fri, Jun 25, 2010 at 19:08, Miquel Torres <tobami at googlemail.com> wrote:
>>> Hi Paolo,
>>> I am aware of the problem with calculating benchmark means, but let me
>>> explain my point of view.
>>> You are correct in that it would be preferable to have absolute times. Well,
>>> you actually can, but see what it happens:
>> Ahah! I didn't notice that I could skip normalization! This does not
>> fully invalidate my point, however.
>>> Absolute values would only work if we had carefully chosen benchmaks
>>> runtimes to be very similar (for our cpython baseline). As it is, html5lib,
>>> spitfire and spitfire_cstringio completely dominate the cummulative time.
>> I acknowledge that (btw, it should be cumulative time, with one 'm',
>> both here and in the website).
>>> And not because the interpreter is faster or slower but because the
>>> benchmark was arbitrarily designed to run that long. Any improvement in the
>>> long running benchmarks will carry much more weight than in the short
>>> What is more useful is to have comparable slices of time so that the
>>> improvements can be seen relatively over time.
>> If you want to sum up times (but at this point, I see no reason for
>> it), you should rather have externally derived weights, as suggested
>> by the paper (in Rule 3).
>> As soon as you take weights from the data, lots of maths that you need
>> is not going to work any more - that's generally true in many cases in
>> And the only way making sense to have external weights is to gather
>> them from real world programs. Since that's not going to happen
>> easily, just stick with the geometric mean. Or set an arbitrarily low
>> weight, manually, without any math, so that the long-running
>> benchmarks stop dominating the res. It's no fraud, since the current
>> graph is less valid anyway.
>>> Normalizing does that i
>> Not really.
>>> It just says: we have 21 tasks which take 1 second to run each on
>>> interpreter X (cpython in the default case). Then we see how other
>>> executables compare to that. What would the geometric mean achieve here,
>>> exactly, for the end user?
>> You actually need the geomean to do that. Don't forget that the
>> geomean is still a mean: it's a mean performance ratio which averages
>> individual performance ratios.
>> If PyPy's geomean is 0.5, it means that PyPy is going to run that task
>> in 11.5 seconds instead of 21. To me, this sounds exactly like what
>> you want to achieve. Moreover, it actually works, unlike what you use.
>> For instance, ignore PyPy-JIT, and look only CPython and pypy-c (no
>> JIT). Then, change the normalization among the two:
>> with the current data, you get that in one case cpython is faster, in
>> the other pypy-c is faster.
>> It can't happen with the geomean. This is the point of the paper.
>> I could even construct a normalization baseline $base such that
>> CPython seems faster than PyPy-JIT. Such a base should be very fast
>> on, say, ai (where CPython is slower), so that $cpython.ai/$base.ai
>> becomes 100 and $pypyjit.ai/$base.ai becomes 200, and be very slow on
>> other benchmarks (so that they disappear in the sum).
>> So, the only difference I see is that geomean works, arithm. mean
>> doesn't. That's why Real Benchmarkers use geomean.
>> Moreover, you are making a mistake quite common among non-physicists.
>> What you say makes sense under the implicit assumption that dividing
>> two times gives something you can use as a time. When you say "Pypy's
>> runtime for a 1 second task", you actually want to talk about a
>> performance ratio, not about the time. In the same way as when you say
>> "this bird runs 3 meters long in one second", a physicist would sum
>> that up as "3 m/s" rather than "3 m".
>>> I am not really calculating any mean. You can see that I carefully avoided
>>> to display any kind of total bar which would indeed incur in the problem you
>>> mention. That a stacked chart implicitly displays a total is something you
>>> can not avoid, and for that kind of chart I still think normalized results
>>> is visually the best option.
>> But on a stacked bars graph, I'm not going to look at individual bars
>> at all, just at the total: it's actually less convenient than in
>> "normal bars" to look at the result of a particular benchmark.
>> I hope I can find guidelines against stacked plots, I have a PhD
>> colleague reading on how to make graphs.
>> Best regards
>> Paolo Giarrusso - Ph.D. Student
Paolo Giarrusso - Ph.D. Student
More information about the Pypy-dev