[Python-Dev] Python 3 optimizations...

Fri Jul 23 08:48:26 CEST 2010

Hi,

I guess it would be a good idea to quickly outline my inline caching
approach, so that we all have a basic understanding of how it works.
If we take for instance the BINARY_ADD instruction, the interpreter
evaluates the actual operand types and chooses the matching operation
implementation at runtime, i.e., operands that are unicode strings
will be concatenated via unicode_concatenate, for float operands on
the other hand, the interpreter would end up invoking float_add via
binary_op1. Now, a very efficient way to achieve purely interpretative
inline caching is to quicken the type-generic BINARY_ADD instruction
to a type-dependent FLOAT_ADD instruction (this technique, i.e.,
inline caching via quickening, is the primary contribution of my ECOOP
paper). Hence, I have a very simple code generator, that generates
type-dependent interpreter instructions in a pre-compile step of the
interpreter, and uses runtime type information to quicken/rewrite
instructions.
Aside of the operators, I have implemented this quickening technique
for FOR_ITER, COMPARE_OP and CALL_FUNCTION instructions.

> I'm absolutely interested, although not for the CPython project but for
> Cython. I wonder how you do inline caching in Python if the methods of a
> type can be replaced by whatever at runtime. Could you elaborate on that?
>
Currently, I only provide optimized derivatives for several separate
call targets, i.e., whether a call target is a C function with
varargs, or a Python function/method--this already eliminates a lot of
overhead from invoking call_function. Based on further quantitative
analysis, I can easily provide inline cached derivatives of frequently
called functions, such as some builtin primitives.

> Based on what information do you switch between inlining states?
>
I have instrumented versions of some functions that allow me to make
quickening decisions, such as binary_op1, do_richcompare, or
call_function, where I can quicken instructions to an optimized,
inline cached, instruction derivative.

> Or do you restrict yourself to builtin types?
>
Currently, my approach provides optimized derivative instructions for
the standard library, e.g., unicode strings, numerical objects,
containers, and iterators.

> That might be worth it
> already, just think of list.append(). We have an optimistic optimisation for
> object.append() in Cython that gives us massive speed-ups in loops that
> build lists, even if we don't know at compile time that we are dealing with
> lists.
>
Yes, that sounds like a reasonable thing to do. I could provide much
more optimized derivatives based on application profiles, too. Since I
use a simple code generator for generating the derivatives, it would
also be possible to provide end-users with the means to analyze their
apps and generate optimized instruction derivatives matching their
profile.

Regards,
--stefan