[Python-Dev] Python 3 optimizations continued...

Fri Sep 2 17:55:03 CEST 2011

> 1) The SFC optimisation is purely based on static code analysis, right? I
> assume it takes loops into account (and just multiplies scores for inner
> loops)? Is that what you mean with "nesting level"? Obviously, static
> analysis can sometimes be misleading, e.g. when there's a rare special case
> with lots of loops that needs to adapt input data in some way, but in
> general, I'd expect that this heuristic would tend to hit the important
> cases, especially for well structured code with short functions.
>
Yes, currently I only use the heuristic to statically estimate utility
of assigning an optimized slot to a local variable. And, another yes,
nested blocks (like for-statements) is what I have in mind when using
"nesting level". I was told that the algorithm itself is very similar
to linear scan register allocation, modulo the ability to spill
values, of course.
>From my benchmarks and in-depth analysis of several programs, I found
this to work very well. In fact, the only situation I found is
(unfortunately) one of the top-most executed functions in US'
bm_django.py: There is one loop that gets almost never executed but
this loop gives precedence to local variables used inside. Because of
this, I have already an idea for a better approach: first, use the
static heuristic to compute stack slot score, then count back-branches
(I would need this anyways, as the _Py_CheckInterval has gone and
OSR/hot-swapping is in general a good idea) and record their
frequency. Next, just replace the current static weight of 100 by the
dynamically recorded weight. Consequently, you should get better
allocations. (Please note that I did some quantitative analysis of
bython functions to determine that using 4 SFC-slots covers a
substantial amount of functions [IIRC >95%] with the trivial scenario
when there are at most 4 local variables.)


> 2) The RC elimination is tricky to get right and thus somewhat dangerous,
> but sounds worthwhile and should work particularly well on a stack based
> byte code interpreter like CPython.
>
Well, it was very tricky to get right when I implemented it first
around Christmas 2009. The current implementation is reasonably simple
to understand, however, it depends on the function refcount_effect to
give me correct information at all times. I got the biggest
performance improvement on one benchmark on the PowerPC and think that
RISC architectures in general benefit more from this optimization
(eliminate the load, add and store instructions) than x86 CISCs do (an
INCREF is just an add on the memory location without data
dependencies, so fairly cheap). In any case, however, you get the
replication effect of improving CPU branch predicion by having these
additional instruction derivatives. It would be interesting
(research-wise, too) to be able to measure whether the reduction in
memory operations makes Python programs use less energy, and if so,
how much the difference is.


> 3) Inline caching also sounds worthwhile, although I wonder how large the
> savings will be here. You'd save a couple of indirect jumps at the C-API
> level, sure, but apart from that, my guess is that it would highly depend on
> the type of instruction. Certain (repeated) calls to C implemented functions
> would likely benefit quite a bit, for example, which would be a nice
> optimisation by itself, e.g. for builtins. I would expect that the same
> applies to iterators, even a couple of percent faster iteration can make a
> great deal of a difference, and a substantial set of iterators are
> implemented in C, e.g. itertools, range, zip and friends.
>
> I'm not so sure about arithmetic operations. In Cython, we (currently?) do
> not optimistically replace these with more specific code (unless we know the
> types at compile time), because it complicates the generated C code and
> indirect jumps aren't all that slow that the benefit would be important.
> Savings are *much* higher when data can be unboxed, so much that the slight
> improvement for optimistic type guesses is totally dwarfed in Cython. I
> would expect that the return of investment is better when the types are
> actually known at runtime, as in your case.
>
Well, in my thesis I already hint at another improvement of the
existing design that can work on unboxed data as well (while still
being an interpreter.) I am eager to try this, but don't know how much
time I can spend on this (because there are several other research
projects I am actively involved in.) In my experience, this works very
well and you cannot actually report good speedups without
inline-caching arithmetic operations, simply because that's where all
JITs shine and most benchmarks don't reflect real world scenarios but
mathematics-inclined microbenchmarks. Also, if in the future compilers
(gcc and clang) will be able to inline the invoked functions, higher
speedups will be possible.


> 4) Regarding inlined object references, I would expect that it's much more
> worthwhile to speed up LOAD_GLOBAL and LOAD_NAME than LOAD_CONST. I guess
> that this would be best helped by watching the module dict and the builtin
> dict internally and invalidating the interpreter state after changes (e.g.
> by providing a change counter in those dicts and checking that in the
> instructions that access them), and otherwise keeping the objects cached.
> Simply watching the dedicated instructions that change that state isn't
> enough as Python allows code to change these dicts directly through their
> dict interface.
>
Ok, I thought about something along these lines, too, but in the end,
decided to go with the current design, as it is easy and language
neutral (for my research I primarily chose Python as a demonstration
vehicle and none of these techniques is specific to Python.)
LOAD_GLOBAL pays off handsomly, and I think that I could easily make
it correct for all cases, if I knew the places that need to call
"invalidate_cache". Most of the LOAD_CONST instructions can be
replaced with the inlined-version (INCA_LOAD_CONST), and while I did
not do any benchmarks only on this, simply because they are very
frequently executed, even small optimizations pay off nicely. Another
point is that you can slim down the activation record of
PyEval_EvalFrameEx, because you don't need to use the "consts" field
anymore (similarly, you could probably eliminate the "names" and
"fastlocals" fields, if you find that most of the frequent and fast
cases are covered by the optimized instructions.)


> All in all, your list does sound like an interesting set of changes that are
> both understandable and worthwhile.
>
Thanks, I think so, too, which is why I wanted to integrate the
optimizations with CPython in the first place.


Thanks for the pointers to the dict stuff, I will take a look (IIRC,
Antoine pointed me in the same direction last year, but I think the
design was slightly different then),
--stefan