[pypy-dev] New speed.pypy.org version

Fri Jun 25 19:08:23 CEST 2010

Hi Paolo,

I am aware of the problem with calculating benchmark means, but let me
explain my point of view.

You are correct in that it would be preferable to have absolute times. Well,
you actually can, but see what it happens:
http://speed.pypy.org/comparison/?hor=true&bas=none&chart=stacked+bars

Absolute values would only work if we had carefully chosen benchmaks
runtimes to be very similar (for our cpython baseline). As it is, html5lib,
spitfire and spitfire_cstringio completely dominate the cummulative time.
And not because the interpreter is faster or slower but because the
benchmark was arbitrarily designed to run that long. Any improvement in the
long running benchmarks will carry much more weight than in the short
running.

What is more useful is to have comparable slices of time so that the
improvements can be seen relatively over time. Normalizing does that i
think. It just says: we have 21 tasks which take 1 second to run each on
interpreter X (cpython in the default case). Then we see how other
executables compare to that. What would the geometric mean achieve here,
exactly, for the end user?

I am not really calculating any mean. You can see that I carefully avoided
to display any kind of total bar which would indeed incur in the problem you
mention. That a stacked chart implicitly displays a total is something you
can not avoid, and for that kind of chart I still think normalized results
is visually the best option.

Still, i would very much like to read the paper you cite, but you need a
login for it.

Cheers,
Miquel

2010/6/25 Paolo Giarrusso <p.giarrusso at gmail.com>

> Hi!
> First, I want to restate the obvious, before pointing out what I think
> is a mistake: your work on this website is great and very useful!
>
> On Fri, Jun 25, 2010 at 13:08, Miquel Torres <tobami at googlemail.com>
> wrote:
> > - stacked bars
> Here you are summing up normalized times, which is more or less like
> taking their arithmetic average. And that doesn't work at all: in many
> cases you can "show" completely different results by normalizing
> relatively to another item. Even the simple question "who is faster?"
> can be answered in different ways
> So you should use the geometric mean, even if this is not so widely
> known. Or better, it is known by benchmarking experts, but it's
> difficult to become so.
>
> Please, have a look at the short paper:
> "How not to lie with statistics: the correct way to summarize benchmark
> results"
>
> http://scholar.google.com/scholar?cluster=1051144955483053492&hl=en&as_sdt=2000
> I downloaded it from the ACM library, please tell me if you can't find it.
>
> > horizontal(
> http://speed.pypy.org/comparison/?hor=true&bas=2%2B35&chart=stacked+bars):
> > This is not meant to "demonstrate" that overall the jit is over two times
> > faster than cpython. It is just another way for a developer to picture
> how
> > long a programme would take to complete if it were composed of 21 such
> > tasks.
>
> You are not summing up absolute times, so your claim is incorrect. And
> the error is significant, given the above paper.
> A sum of absolute times would provide what you claim.
>
> > You can see that cpython's (the normalization chosen) benchmarks all
> > take 1"relative" second.
> Here, for instance, I see that CPython and pypy-c take more or less
> the same time, which surprises me (since the PyPy interpreter was
> known to be slower than CPython). But given that the result is
> invalid, it may well be an artifact of your statistics.
>
> > pypy-c needs more or less the same time, some
> > "tasks" being slower and some faster. Psyco shows an interesting picture:
> > From meteor-contest downwards (fortuitously) , all benchmarks are
> extremely
> > "compressed", which means they are speeded up by psyco quite a lot. But
> any
> > further speed up wouldn't make overall time much shorter because the
> first
> > group of benchmarks now takes most of the time to complete. pypy-c-jit is
> a
> > more extreme case of this: If the jit accelerated all "fast" benchmarks
> to 0
> > seconds (infinitely fast), it would only get about twice as fast as now
> > because ai, slowspitfire, spambayes and twisted_tcp now need half the
> entire
> > execution time. An good demonstration of "you are only as fast as your
> > slowest part". Of course the aggregate of all benchmarks is not a real
> app,
> > but it is still fun.
>
> This could maybe be still true, at least in part, but you have to do
> this reasoning on absolute times.
>
> Best regards, and keep up the good work!
> --
> Paolo Giarrusso - Ph.D. Student
> http://www.informatik.uni-marburg.de/~pgiarrusso/<http://www.informatik.uni-marburg.de/%7Epgiarrusso/>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pypy-dev/attachments/20100625/ca26fe45/attachment.html>