Re: [pypy-dev] PGO Optimized Binary

Nov. 8, 2016

      Hi Armin,

Thanks for your feedback.
We ran one of the program suggested by you as an example for evaluation: 
cd rpython/jit/tl
non-pgo-pypy ../../bin/rpython -O2 --source targettlr   
pgo-pypy ../../bin/rpython -O2 --source targettlr 

We got the following results :
Non-Pgo pypy -
[Timer] Timings:
[Timer] annotate                       ---  7.5 s
[Timer] rtype_lltype                   ---  5.8 s
[Timer] backendopt_lltype              ---  3.6 s
[Timer] stackcheckinsertion_lltype     ---  0.1 s
[Timer] database_c                     --- 19.6 s
[Timer] source_c                       ---  2.6 s
[Timer] =========================================
[Timer] Total:                         --- 39.2 s

PGO-pypy : 
[Timer] Timings:
[Timer] annotate                       ---  7.6 s
[Timer] rtype_lltype                   ---  5.1 s
[Timer] backendopt_lltype              ---  3.1 s
[Timer] stackcheckinsertion_lltype     ---  0.0 s
[Timer] database_c                     --- 18.5 s
[Timer] source_c                       ---  2.3 s
[Timer] =========================================
[Timer] Total:                         --- 36.6 s

The delta in performance  between these two is about 8%.

We are working on getting the data to identify the % of interpreted code vs the jited code for both the binaries. We are also working on creating a pull request to get a better feedback on the change.

Regards
Yash

________________________________________
From: Armin Rigo [armin.rigo@gmail.com]
Sent: Wednesday, November 02, 2016 2:18 AM
To: Singh, Yashwardhan
Cc: pypy-dev@python.org
Subject: Re: [pypy-dev] PGO Optimized Binary

Hi,

On 31 October 2016 at 22:28, Singh, Yashwardhan
<yashwardhan.singh@intel.com> wrote:
...
We applied compiler assisted optimization technique called PGO or Profile Guided Optimization while building PyPy, and found performance got improved by up to 22.4% on the Grand Unified Python Benchmark (GUPB) from “hg clone https://hg.python.org/benchmarks”.  The below result table shows majority of 51 micros got performance boost with 8 got performance regression.
The kind of performance improvement you are measuring involves only
short- or very short-running programs.  A few years ago we'd have
shrugged it off as irrelevant---"please modify the benchmarks so that
they run for at least 10 seconds, more if they are larger"---because
the JIT compiler doesn't have a chance to warm up.  But we'd also have
shrugged off your whole attempt---"PGO optimization cannot change
anything to the speed of JIT-produced machine code".

Nowadays we tend to look more seriously at the cold or warming-up
performance too, or at least we know that we should look there.  There
are (stalled) plans of setting up a second benchmark suite for PyPy
which focuses on this.

You can get an estimate of whether you're looking at cold or hot code:
compare the timings with CPython.  Also, you can set the environment
variable  ``PYPYLOG=jit-summary:-`` and look at the first 2 lines to
see how much time was spent warming up the JIT (or attempting to).

Note that we did enable PGO long ago, with modest benefits.  We gave
up when our JIT compiler became good enough.  Maybe now is the time to
try again (and also, PGO itself might have improved in the meantime).
...
We’d like to get some input on how to contribute our optimization recipe to the PyPy dev tree, perhaps by creating an item to the PyPy issue tracker?
The best would be to create a pull request so that we can look at your
changes more easily.
...
In addition, we would also appreciate any other benchmark or real world use based workload as alternatives to evaluate this.
You can take any Python program that runs either very shortly or not
faster than CPython.  For a larger example (with Python 2.7):

    cd rpython/jit/tl
    python ../../bin/rpython -O2 --source targettlr    # 24 secs
    pypy ../../bin/rpython -O2 --source targettlr        # 39 secs

A bientôt,

Armin.