[pypy-dev] New speed.pypy.org version

Fri Jul 2 09:27:10 CEST 2010

Hi Paolo,

hey! I think it is a great idea. With logs you get both: correct
normalized totals AND the ability to display the individual stacked
series, which necessarily add arithmetically. But it strikes me,
hasn't anyone written a paper about that method already? or at least
documented it?

Anyway I need to check that the math is right (hopefully today), and
then I would go and implement it.
I'll tell you how it goes.

Cheers,
Miquel

2010/6/30 Paolo Giarrusso <p.giarrusso at gmail.com>:
> Hi Miquel,
> I'm quite busy (because of a paper deadline next Tuesday), sorry for
> not answering earlier.
>
> I was just struck by an idea: there is a stacked bar plot where the
> total bar is related to the geometric mean, such that it is
> normalization-invariant. But this graph _is_ complicated.
>
> It is a stacked plot of _logarithms_ of performance ratios? This way,
> the complete stacked bar shows the logarithm of the product, rather
> than their sum, i.e. the log of the (geometric mean)^N rather than
> their arithmetic mean. log of the (geometric mean)^N = N*log of the
> (geometric mean).
>
> Some simple maths (I didn't write it out, so please recheck!) seems to
> show that showing (a+b*log (ratio)), instead of log(ratio), gives
> still a fair comparison, obtaining N*a+b*N*log(geomean) =
> \Theta(log(geomean)). You need to put a and b because showing if the
> ratio is 1, log(1) is zero (b is the representation scale which is
> always there).
>
> About your workaround: I would like a table with the geometric mean of
> the ratios, where we get the real global performance ratio among the
> interpreters. As far as the results of your solution do not contradict
> that _real_ table, it should be a reasonable workaround (but I would
> embed the check in the code - otherwise other projects _will be_
> bitten by that). Probably, I would like the website to offer such a
> table to users, and I would like a graph of the overall performance
> ratio over time (actually revisions).
>
> Finally, the docs of your web application should at the very least
> reference the paper and this conversation (if there's a public archive
> of the ML, as I think), and ideally explain the issue.
>
> Sorry for being too dense, maybe - if I was unclear, please tell me
> and I'll answer next week.
>
> Best regards,
> Paolo
>
> On Mon, Jun 28, 2010 at 11:21, Miquel Torres <tobami at googlemail.com> wrote:
>> Hi Paolo,
>>
>> I read the paper, very interesting. It is perfectly clear that to
>> calculate a normalized total only the geometric mean makes sense.
>>
>> However, a stacked bars plot shows the individual benchmarks so it
>> implicitly is an arithmetic mean. The only solution (apart from
>> removing the stacked charts and only offering total bars) is the
>> weighted approach.
>>
>> External weights are not very practical though. Codespeed is used by
>> other projects so an extra option would need to be added to the
>> settings to allow the introducing of arbitrary weights to benchmarks.
>> A bit cumbersome. I have an idea that may work. Take the weights from
>> a defined baseline so that the run times are equal, which is the same
>> as normalizing to a baseline. It would be the same as now, only that
>> you can't choose the normalization, it will be weighted (normalized)
>> according the default baseline (which you already can already
>> configure in the settings).
>>
>> You may say that it is still an arithmetic mean, but there won't be
>> conflicting results because there is only a single normalization. For
>> PyPy that would be cpython, and everything would make sense.
>> I know it is a work around, not a solution. If you think it is a bad
>> idea, the only other possibility is not to have stacked bars (as in
>> "showing individual benchmarks"). But I find them useful. Yes you can
>> see the individual benchmark results better in the normal bars chart,
>> but there you don't see visually which benchmarks take the biggest
>> part of the pie, which helps visualize what parts of your program need
>> most improving.
>>
>> What do you think?
>>
>> Regards,
>> Miquel
>>
>>
>> 2010/6/25 Paolo Giarrusso <p.giarrusso at gmail.com>:
>>> On Fri, Jun 25, 2010 at 19:08, Miquel Torres <tobami at googlemail.com> wrote:
>>>> Hi Paolo,
>>>>
>>>> I am aware of the problem with calculating benchmark means, but let me
>>>> explain my point of view.
>>>>
>>>> You are correct in that it would be preferable to have absolute times. Well,
>>>> you actually can, but see what it happens:
>>>> http://speed.pypy.org/comparison/?hor=true&bas=none&chart=stacked+bars
>>>
>>> Ahah! I didn't notice that I could skip normalization! This does not
>>> fully invalidate my point, however.
>>>
>>>> Absolute values would only work if we had carefully chosen benchmaks
>>>> runtimes to be very similar (for our cpython baseline). As it is, html5lib,
>>>> spitfire and spitfire_cstringio completely dominate the cummulative time.
>>>
>>> I acknowledge that (btw, it should be cumulative time, with one 'm',
>>> both here and in the website).
>>>
>>>> And not because the interpreter is faster or slower but because the
>>>> benchmark was arbitrarily designed to run that long. Any improvement in the
>>>> long running benchmarks will carry much more weight than in the short
>>>> running.
>>>
>>>> What is more useful is to have comparable slices of time so that the
>>>> improvements can be seen relatively over time.
>>>
>>> If you want to sum up times (but at this point, I see no reason for
>>> it), you should rather have externally derived weights, as suggested
>>> by the paper (in Rule 3).
>>> As soon as you take weights from the data, lots of maths that you need
>>> is not going to work any more - that's generally true in many cases in
>>> statistics.
>>> And the only way making sense to have external weights is to gather
>>> them from real world programs. Since that's not going to happen
>>> easily, just stick with the geometric mean. Or set an arbitrarily low
>>> weight, manually, without any math, so that the long-running
>>> benchmarks stop dominating the res. It's no fraud, since the current
>>> graph is less valid anyway.
>>>
>>>> Normalizing does that i
>>>> think.
>>> Not really.
>>>
>>>> It just says: we have 21 tasks which take 1 second to run each on
>>>> interpreter X (cpython in the default case). Then we see how other
>>>> executables compare to that. What would the geometric mean achieve here,
>>>> exactly, for the end user?
>>>
>>> You actually need the geomean to do that. Don't forget that the
>>> geomean is still a mean: it's a mean performance ratio which averages
>>> individual performance ratios.
>>> If PyPy's geomean is 0.5, it means that PyPy is going to run that task
>>> in 11.5 seconds instead of 21. To me, this sounds exactly like what
>>> you want to achieve. Moreover, it actually works, unlike what you use.
>>>
>>> For instance, ignore PyPy-JIT, and look only CPython and pypy-c (no
>>> JIT). Then, change the normalization among the two:
>>> http://speed.pypy.org/comparison/?exe=2%2B35,3%2BL&ben=1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21&env=1&hor=true&bas=2%2B35&chart=stacked+bars
>>> http://speed.pypy.org/comparison/?exe=2%2B35,3%2BL&ben=1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21&env=1&hor=true&bas=3%2BL&chart=stacked+bars
>>> with the current data, you get that in one case cpython is faster, in
>>> the other pypy-c is faster.
>>> It can't happen with the geomean. This is the point of the paper.
>>>
>>> I could even construct a normalization baseline $base such that
>>> CPython seems faster than PyPy-JIT. Such a base should be very fast
>>> on, say, ai (where CPython is slower), so that $cpython.ai/$base.ai
>>> becomes 100 and $pypyjit.ai/$base.ai becomes 200, and be very slow on
>>> other benchmarks (so that they disappear in the sum).
>>>
>>> So, the only difference I see is that geomean works, arithm. mean
>>> doesn't. That's why Real Benchmarkers use geomean.
>>>
>>> Moreover, you are making a mistake quite common among non-physicists.
>>> What you say makes sense under the implicit assumption that dividing
>>> two times gives something you can use as a time. When you say "Pypy's
>>> runtime for a 1 second task", you actually want to talk about a
>>> performance ratio, not about the time. In the same way as when you say
>>> "this bird runs 3 meters long in one second", a physicist would sum
>>> that up as "3 m/s" rather than "3 m".
>>>
>>>> I am not really calculating any mean. You can see that I carefully avoided
>>>> to display any kind of total bar which would indeed incur in the problem you
>>>> mention. That a stacked chart implicitly displays a total is something you
>>>> can not avoid, and for that kind of chart I still think normalized results
>>>> is visually the best option.
>>>
>>> But on a stacked bars graph, I'm not going to look at individual bars
>>> at all, just at the total: it's actually less convenient than in
>>> "normal bars" to look at the result of a particular benchmark.
>>>
>>> I hope I can find guidelines against stacked plots, I have a PhD
>>> colleague reading on how to make graphs.
>>>
>>> Best regards
>>> --
>>> Paolo Giarrusso - Ph.D. Student
>>> http://www.informatik.uni-marburg.de/~pgiarrusso/
>>>
>>
>
>
>
> --
> Paolo Giarrusso - Ph.D. Student
> http://www.informatik.uni-marburg.de/~pgiarrusso/
>