[pypy-dev] New speed.pypy.org version

Fri Jun 25 14:07:44 CEST 2010

Hi!
First, I want to restate the obvious, before pointing out what I think
is a mistake: your work on this website is great and very useful!

On Fri, Jun 25, 2010 at 13:08, Miquel Torres <tobami at googlemail.com> wrote:
> - stacked bars
Here you are summing up normalized times, which is more or less like
taking their arithmetic average. And that doesn't work at all: in many
cases you can "show" completely different results by normalizing
relatively to another item. Even the simple question "who is faster?"
can be answered in different ways
So you should use the geometric mean, even if this is not so widely
known. Or better, it is known by benchmarking experts, but it's
difficult to become so.

Please, have a look at the short paper:
"How not to lie with statistics: the correct way to summarize benchmark results"
http://scholar.google.com/scholar?cluster=1051144955483053492&hl=en&as_sdt=2000
I downloaded it from the ACM library, please tell me if you can't find it.

> horizontal(http://speed.pypy.org/comparison/?hor=true&bas=2%2B35&chart=stacked+bars):
> This is not meant to "demonstrate" that overall the jit is over two times
> faster than cpython. It is just another way for a developer to picture how
> long a programme would take to complete if it were composed of 21 such
> tasks.

You are not summing up absolute times, so your claim is incorrect. And
the error is significant, given the above paper.
A sum of absolute times would provide what you claim.

> You can see that cpython's (the normalization chosen) benchmarks all
> take 1"relative" second.
Here, for instance, I see that CPython and pypy-c take more or less
the same time, which surprises me (since the PyPy interpreter was
known to be slower than CPython). But given that the result is
invalid, it may well be an artifact of your statistics.

> pypy-c needs more or less the same time, some
> "tasks" being slower and some faster. Psyco shows an interesting picture:
> From meteor-contest downwards (fortuitously) , all benchmarks are extremely
> "compressed", which means they are speeded up by psyco quite a lot. But any
> further speed up wouldn't make overall time much shorter because the first
> group of benchmarks now takes most of the time to complete. pypy-c-jit is a
> more extreme case of this: If the jit accelerated all "fast" benchmarks to 0
> seconds (infinitely fast), it would only get about twice as fast as now
> because ai, slowspitfire, spambayes and twisted_tcp now need half the entire
> execution time. An good demonstration of "you are only as fast as your
> slowest part". Of course the aggregate of all benchmarks is not a real app,
> but it is still fun.

This could maybe be still true, at least in part, but you have to do
this reasoning on absolute times.

Best regards, and keep up the good work!
-- 
Paolo Giarrusso - Ph.D. Student
http://www.informatik.uni-marburg.de/~pgiarrusso/