On Mon, Nov 21, 2016 at 03:26:19PM -0800, Kevin Modzelewski wrote:
> Oh sorry I was unclear, yes this is for the pyston binary itself, and yes
> PGO does a better job and I definitely think it should be used.
That raised a second question: do you collect branch / hotness info
during lower tier jitted code run, so as to improve performance of
higher tiers ?
> Separately, we often use non-pgo builds for quick checks, so we also have
> the system I described that makes our non-pgo build more reliable by using
> the function ordering from the pgo build.
ok. Are you just « putting hot stuff in the hot section » or did you try
to specify an ordering to further improve locality? (I don't know if it's possible, it's mentionned in one of the paper)
Thanks,
Serge
On Sat, Nov 19, 2016 at 05:58:19PM -0800, Kevin Modzelewski wrote:
> I think it's safe to not reinvent the wheel here. Some searching gives:
> http://perso.ensta-paristech.fr/~bmonsuez/Cours/B6-4/Articles/papers15.pdf
> http://www.cs.utexas.edu/users/mckinley/papers/dcm-vee-2006.pdf
> https://github.com/facebook/hhvm/tree/master/hphp/tools/hfsort
Thanks Kevin for the pointers! I'm new to this area of optimization...
another source of fun and weirdness :-$
> Pyston takes a different approach where we pull the list of hot functions
> from the PGO build, ie defer all the hard work to the C compiler.
You're talking about the build of Pyston itself, not the jit generated
code, right? In that case, how is it different to a regular
-fprofile-generate followed by several runs then -fprofile-use?
PGO builds should perform better than marking some functions as hot, as
it also includes info for better branch prediction too, right?
On Sat, Nov 19, 2016 at 02:32:26AM +0100, Victor Stinner wrote:
> Hi,
>
> I'm happy because I just finished an article putting the most
> important things that I learnt this year on the most silly issue with
> Python performance: code placement.
>
> https://haypo.github.io/analysis-python-performance-issue.html
>
> I explain how to debug such issue and my attempt to fix it in CPython.
>
> I hate code placement issues :-) I hate performance slowdowns caused
> by random unrelated changes...
>
> Victor
Thanks *a lot* victor for this great article. You not only very
accurately describe the method you used to track the performance bug,
but also give very convincing results.
I still wonder what the conclusion should be:
- (this) Micro benchmarks are not relevant at all, they are sensible to minor
factors that are not relevant to bigger applications
- There is a generally good code layout that favors most applications?
Maybe some core function from the interpreter ? Why does PGO fails to
``find'' them?
Serge
Hi,
I'm happy because I just finished an article putting the most
important things that I learnt this year on the most silly issue with
Python performance: code placement.
https://haypo.github.io/analysis-python-performance-issue.html
I explain how to debug such issue and my attempt to fix it in CPython.
I hate code placement issues :-) I hate performance slowdowns caused
by random unrelated changes...
Victor
On Tue, Nov 15, 2016 at 12:20:18AM +1000, Nick Coghlan wrote:
> On 11 November 2016 at 03:01, Paul Graydon <paul(a)paulgraydon.co.uk> wrote:
> > I've a niggling feeling there was discussion about some performance drops on 16.04 not all that long ago, but I'm
> > completely failing to find it in my emails.
>
> You may be thinking of the PGO-related issue that Victor found on
> *14*.04: https://mail.python.org/pipermail/speed/2016-November/000471.html
>
> Cheers,
> Nick.
>
> --
> Nick Coghlan | ncoghlan(a)gmail.com | Brisbane, Australia
I think you might be right there. Too many bugs going bouncing around at work, and on other projects, I guess I'm losing track :D
Paul
I've a niggling feeling there was discussion about some performance drops on 16.04 not all that long ago, but I'm
completely failing to find it in my emails.
The OpenStack-Ansible project has noticed that performance on Ubuntu 16.04 is quite significantly slower than on 14.04.
At the moment it's looking like *possibly* a GCC related bug.
https://bugs.launchpad.net/ubuntu/+source/python2.7/+bug/1638695
Code layout matters a lot and you can get lucky or unlucky with it. I
wasn't able to make it to this talk but the slides look quite interesting:
https://llvmdevelopersmeetingbay2016.sched.org/event/8YzY/causes-
of-performance-instability-due-to-code-placement-in-x86
I'm not sure how much us mere mortals can debug this sort of thing, but I
know the intel folks have at one point expressed interest in making sure
that Python runs quickly on their processors so they might be willing to
give advice (the deck even says "if all else fails, ask Intel").
On Fri, Nov 4, 2016 at 3:35 PM, Victor Stinner <victor.stinner(a)gmail.com>
wrote:
> Hi,
>
> I noticed a temporary performance peak in the call_method:
>
> https://speed.python.org/timeline/#/?exe=4&ben=call_
> method&env=1&revs=50&equid=off&quarts=on&extr=on
>
> The difference is major: 17 ms => 29 ms, 70% slower!
>
> I expected a temporary issue on the server used to run benchmarks,
> but... I reproduced the result on the server.
>
> Recently, the performance of call_method() changed in CPython default
> from 17 ms to 28 ms (well, the exact value is variable: 25 ms, 28 ms,
> 29 ms, ...) and then back to 17 ms:
>
> (1) ce85a1f129e3: 17 ms => 83877018ef97 (Oct 18): 25 ms
>
> https://hg.python.org/cpython/rev/83877018ef97
>
> (2) 3e073e7b4460: 28 ms => 204a43c452cc (Oct 22): 17 ms
>
> https://hg.python.org/cpython/rev/204a43c452cc
>
> None of these revisions modify code used in the call_method()
> benchmark, so I guess that it's yet another compiler joke.
>
>
> On my laptop and my desktop PC, I'm unable to reproduce the issue: the
> performance is the same (I tested ce85a1f129e3, 83877018ef97,
> 204a43c452cc). These PC uses Fedora 24, GCC 6.2.1. CPUs:
>
> * laptop: Intel(R) Core(TM) i7-3520M CPU @ 2.90GHz
> * desktop: Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz
>
>
> The speed-python runs Ubuntu 14.04, GCC 4.8.4-2ubuntu1~14.04. CPU:
> "Intel(R) Xeon(R) CPU X5680 @ 3.33GHz".
>
>
> call_method() benchmark is a microbenchmark which seems to depend a
> lot of very low level stuff like CPU L1 cache. Maybe the impact on the
> compiler is more important on speed-python which has an older CPU,
> than my more recent hardware. Maybe GCC 6.2 produces more efficient
> machine code than GCC 4.8.
>
>
> I expect that PGO would "fix" the call_method() performance issue, but
> PGO compilation fails on Ubuntu 14.04 with a compiler error :-p A
> solution would be to upgrade the OS of this server.
>
> Victor
> _______________________________________________
> Speed mailing list
> Speed(a)python.org
> https://mail.python.org/mailman/listinfo/speed
>
On 4 November 2016 at 22:12, Victor Stinner <victor.stinner(a)gmail.com> wrote:
> I don't well yet the hardware of the speed-python server. The CPU is a
> "Intel(R) Xeon(R) CPU X5680 @ 3.33GHz":
This is still the system HP contributed a few years back, so the full
system specs can be found at https://speed.python.org/about/
Once you get the benchmark suite up and running reliably there, it
could be interesting to get it running under Beaker [1] and then let
it loose as an automated job in Red Hat's hardware compatibility
testing environment :)
Cheers,
Nick.
[1] https://beaker-project.org/
--
Nick Coghlan | ncoghlan(a)gmail.com | Brisbane, Australia
2016-11-05 15:56 GMT+01:00 Nick Coghlan <ncoghlan(a)gmail.com>:
> Since the use case for --duplicate is to reduce the relative overhead
> of the outer loop when testing a micro-optimisation within a *given*
> interpreter, perhaps the error should be for combining --duplicate and
> --compare-to at all? And then it would just be up to developers of a
> *particular* implementation to know if "--duplicate" is relevant to
> them.
Hum, I think that using "timeit --compare-to=python --duplicate=1000"
makes sense when you compare two versions of CPython.
If I understood correctly Armin, the usage of --duplicate on a Python
using a JIT must fail with an error.
It's in my (long) TODO list ;-)
Victor
On 3 November 2016 at 02:03, Armin Rigo <armin.rigo(a)gmail.com> wrote:
> Hi Victor,
>
> On 2 November 2016 at 16:53, Victor Stinner <victor.stinner(a)gmail.com> wrote:
>> 2016-11-02 15:20 GMT+01:00 Armin Rigo <armin.rigo(a)gmail.com>:
>>> Is that really the kind of examples you want to put forward?
>>
>> I am not a big fan of timeit, but we must use it sometimes to
>> micro-optimizations in CPython to check if an optimize really makes
>> CPython faster or not. I am only trying to enhance timeit.
>> Understanding results require to understand how the statements are
>> executed.
>
> Don't get me wrong, I understand the point of the following usage of timeit:
>
> python2 -m perf timeit '[1,2]*1000' --duplicate=1000
>
> What I'm criticizing here is this instead:
>
> python2 -m perf timeit '[1,2]*1000' --duplicate=1000 --compare-to=pypy
>
> because you're very unlikely to get any relevant information from such
> a comparison. I stand by my original remark: I would say it should be
> an error or at least a big fat warning to use --duplicate and PyPy in
> the same invocation. This is as opposed to silently ignoring
> --duplicate for PyPy, which is just adding more confusion imho.
Since the use case for --duplicate is to reduce the relative overhead
of the outer loop when testing a micro-optimisation within a *given*
interpreter, perhaps the error should be for combining --duplicate and
--compare-to at all? And then it would just be up to developers of a
*particular* implementation to know if "--duplicate" is relevant to
them.
Regards,
Nick.
--
Nick Coghlan | ncoghlan(a)gmail.com | Brisbane, Australia