New subject: New speed.pypy.org version

June 30, 2010

      Hi Miquel,
I'm quite busy (because of a paper deadline next Tuesday), sorry for
not answering earlier.

I was just struck by an idea: there is a stacked bar plot where the
total bar is related to the geometric mean, such that it is
normalization-invariant. But this graph _is_ complicated.

It is a stacked plot of _logarithms_ of performance ratios? This way,
the complete stacked bar shows the logarithm of the product, rather
than their sum, i.e. the log of the (geometric mean)^N rather than
their arithmetic mean. log of the (geometric mean)^N = N*log of the
(geometric mean).

Some simple maths (I didn't write it out, so please recheck!) seems to
show that showing (a+b*log (ratio)), instead of log(ratio), gives
still a fair comparison, obtaining N*a+b*N*log(geomean) =
\Theta(log(geomean)). You need to put a and b because showing if the
ratio is 1, log(1) is zero (b is the representation scale which is
always there).

About your workaround: I would like a table with the geometric mean of
the ratios, where we get the real global performance ratio among the
interpreters. As far as the results of your solution do not contradict
that _real_ table, it should be a reasonable workaround (but I would
embed the check in the code - otherwise other projects _will be_
bitten by that). Probably, I would like the website to offer such a
table to users, and I would like a graph of the overall performance
ratio over time (actually revisions).

Finally, the docs of your web application should at the very least
reference the paper and this conversation (if there's a public archive
of the ML, as I think), and ideally explain the issue.

Sorry for being too dense, maybe - if I was unclear, please tell me
and I'll answer next week.

Best regards,
Paolo

On Mon, Jun 28, 2010 at 11:21, Miquel Torres <tobami@googlemail.com> wrote:
...
Hi Paolo,
I read the paper, very interesting. It is perfectly clear that to
calculate a normalized total only the geometric mean makes sense.
However, a stacked bars plot shows the individual benchmarks so it
implicitly is an arithmetic mean. The only solution (apart from
removing the stacked charts and only offering total bars) is the
weighted approach.
External weights are not very practical though. Codespeed is used by
other projects so an extra option would need to be added to the
settings to allow the introducing of arbitrary weights to benchmarks.
A bit cumbersome. I have an idea that may work. Take the weights from
a defined baseline so that the run times are equal, which is the same
as normalizing to a baseline. It would be the same as now, only that
you can't choose the normalization, it will be weighted (normalized)
according the default baseline (which you already can already
configure in the settings).
You may say that it is still an arithmetic mean, but there won't be
conflicting results because there is only a single normalization. For
PyPy that would be cpython, and everything would make sense.
I know it is a work around, not a solution. If you think it is a bad
idea, the only other possibility is not to have stacked bars (as in
"showing individual benchmarks"). But I find them useful. Yes you can
see the individual benchmark results better in the normal bars chart,
but there you don't see visually which benchmarks take the biggest
part of the pie, which helps visualize what parts of your program need
most improving.
What do you think?
Regards,
Miquel
2010/6/25 Paolo Giarrusso <p.giarrusso@gmail.com>:
...
On Fri, Jun 25, 2010 at 19:08, Miquel Torres <tobami@googlemail.com> wrote:
...
Hi Paolo,
I am aware of the problem with calculating benchmark means, but let me
explain my point of view.
You are correct in that it would be preferable to have absolute times. Well,
you actually can, but see what it happens:
http://speed.pypy.org/comparison/?hor=true&bas=none&chart=stacked+bars
Ahah! I didn't notice that I could skip normalization! This does not
fully invalidate my point, however.
...
Absolute values would only work if we had carefully chosen benchmaks
runtimes to be very similar (for our cpython baseline). As it is, html5lib,
spitfire and spitfire_cstringio completely dominate the cummulative time.
I acknowledge that (btw, it should be cumulative time, with one 'm',
both here and in the website).
...
And not because the interpreter is faster or slower but because the
benchmark was arbitrarily designed to run that long. Any improvement in the
long running benchmarks will carry much more weight than in the short
running.
...
What is more useful is to have comparable slices of time so that the
improvements can be seen relatively over time.
If you want to sum up times (but at this point, I see no reason for
it), you should rather have externally derived weights, as suggested
by the paper (in Rule 3).
As soon as you take weights from the data, lots of maths that you need
is not going to work any more - that's generally true in many cases in
statistics.
And the only way making sense to have external weights is to gather
them from real world programs. Since that's not going to happen
easily, just stick with the geometric mean. Or set an arbitrarily low
weight, manually, without any math, so that the long-running
benchmarks stop dominating the res. It's no fraud, since the current
graph is less valid anyway.
...
Normalizing does that i
think.
Not really.
...
It just says: we have 21 tasks which take 1 second to run each on
interpreter X (cpython in the default case). Then we see how other
executables compare to that. What would the geometric mean achieve here,
exactly, for the end user?
You actually need the geomean to do that. Don't forget that the
geomean is still a mean: it's a mean performance ratio which averages
individual performance ratios.
If PyPy's geomean is 0.5, it means that PyPy is going to run that task
in 11.5 seconds instead of 21. To me, this sounds exactly like what
you want to achieve. Moreover, it actually works, unlike what you use.
For instance, ignore PyPy-JIT, and look only CPython and pypy-c (no
JIT). Then, change the normalization among the two:
http://speed.pypy.org/comparison/?exe=2%2B35,3%2BL&ben=1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21&env=1&hor=true&bas=2%2B35&chart=stacked+bars
http://speed.pypy.org/comparison/?exe=2%2B35,3%2BL&ben=1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21&env=1&hor=true&bas=3%2BL&chart=stacked+bars
with the current data, you get that in one case cpython is faster, in
the other pypy-c is faster.
It can't happen with the geomean. This is the point of the paper.
I could even construct a normalization baseline $base such that
CPython seems faster than PyPy-JIT. Such a base should be very fast
on, say, ai (where CPython is slower), so that $cpython.ai/$base.ai
becomes 100 and $pypyjit.ai/$base.ai becomes 200, and be very slow on
other benchmarks (so that they disappear in the sum).
So, the only difference I see is that geomean works, arithm. mean
doesn't. That's why Real Benchmarkers use geomean.
Moreover, you are making a mistake quite common among non-physicists.
What you say makes sense under the implicit assumption that dividing
two times gives something you can use as a time. When you say "Pypy's
runtime for a 1 second task", you actually want to talk about a
performance ratio, not about the time. In the same way as when you say
"this bird runs 3 meters long in one second", a physicist would sum
that up as "3 m/s" rather than "3 m".
...
I am not really calculating any mean. You can see that I carefully avoided
to display any kind of total bar which would indeed incur in the problem you
mention. That a stacked chart implicitly displays a total is something you
can not avoid, and for that kind of chart I still think normalized results
is visually the best option.
But on a stacked bars graph, I'm not going to look at individual bars
at all, just at the total: it's actually less convenient than in
"normal bars" to look at the result of a particular benchmark.
I hope I can find guidelines against stacked plots, I have a PhD
colleague reading on how to make graphs.
Best regards
--
Paolo Giarrusso - Ph.D. Student
http://www.informatik.uni-marburg.de/~pgiarrusso/
-- 
Paolo Giarrusso - Ph.D. Student
http://www.informatik.uni-marburg.de/~pgiarrusso/

Re: [pypy-dev] New speed.pypy.org version

Paolo Giarrusso

Miquel Torres

Paolo Giarrusso

tags

participants (2)