[Python-Dev] Python 3 optimizations continued...

Fri Sep 2 11:13:19 CEST 2011

stefan brunthaler, 02.09.2011 06:37:
> as promised, I created a publicly available preview of an
> implementation with my optimizations, which is available under the
> following location:
> https://bitbucket.org/py3_pio/preview/wiki/Home
>
> I followed Nick's advice and added some valuable advice and
> overview/introduction at the wiki page the link points to, I am
> positive that spending 10mins reading this will provide you with a
> valuable information regarding what's happening.

It does, thanks.

A couple of remarks:

1) The SFC optimisation is purely based on static code analysis, right? I 
assume it takes loops into account (and just multiplies scores for inner 
loops)? Is that what you mean with "nesting level"? Obviously, static 
analysis can sometimes be misleading, e.g. when there's a rare special case 
with lots of loops that needs to adapt input data in some way, but in 
general, I'd expect that this heuristic would tend to hit the important 
cases, especially for well structured code with short functions.

2) The RC elimination is tricky to get right and thus somewhat dangerous, 
but sounds worthwhile and should work particularly well on a stack based 
byte code interpreter like CPython.

3) Inline caching also sounds worthwhile, although I wonder how large the 
savings will be here. You'd save a couple of indirect jumps at the C-API 
level, sure, but apart from that, my guess is that it would highly depend 
on the type of instruction. Certain (repeated) calls to C implemented 
functions would likely benefit quite a bit, for example, which would be a 
nice optimisation by itself, e.g. for builtins. I would expect that the 
same applies to iterators, even a couple of percent faster iteration can 
make a great deal of a difference, and a substantial set of iterators are 
implemented in C, e.g. itertools, range, zip and friends.

I'm not so sure about arithmetic operations. In Cython, we (currently?) do 
not optimistically replace these with more specific code (unless we know 
the types at compile time), because it complicates the generated C code and 
indirect jumps aren't all that slow that the benefit would be important. 
Savings are *much* higher when data can be unboxed, so much that the slight 
improvement for optimistic type guesses is totally dwarfed in Cython. I 
would expect that the return of investment is better when the types are 
actually known at runtime, as in your case.

4) Regarding inlined object references, I would expect that it's much more 
worthwhile to speed up LOAD_GLOBAL and LOAD_NAME than LOAD_CONST. I guess 
that this would be best helped by watching the module dict and the builtin 
dict internally and invalidating the interpreter state after changes (e.g. 
by providing a change counter in those dicts and checking that in the 
instructions that access them), and otherwise keeping the objects cached. 
Simply watching the dedicated instructions that change that state isn't 
enough as Python allows code to change these dicts directly through their 
dict interface.

All in all, your list does sound like an interesting set of changes that 
are both understandable and worthwhile.

Stefan