Python 3 optimizations...
Hello, during the last year, I have developed a couple of quickening-based optimizations for the Python 3.1 interpreter. As part of my PhD programme, I have published a first technique that combines quickening with inline caching at this year's ECOOP, and subsequently extended this technique to optimize several load instructions as well as eliminate redundant reference counting operations from instructions, which has been accepted for publication for an upcoming conference. I have a working prototype combining all of these optimizations that achieves a maximum speedup of 2.18 on the spectralnorm benchmark of the computer language benchmarks game. Early measurements on the Unladen Swallow django benchmark (with Martin von Loewis' patch for Python 3) achieve a speedup of about 1.3. Both speedups were measured on an i7 920 when combined with the threaded code/computed goto optimization enabled, and normalized by the standard Python 3.1 interpreter with all optimizations disabled. Since all of these optimizations are purely interpretative, they have next-to-no additional memory requirements and do not incur extensive warm-up costs. I wonder whether you would be interested in integrating these optimizations with the Python 3 distribution, hence this mail. I could send copies of the papers, as well as provide my prototype source code to interested members of the python development community. Have a nice day, --stefan
On Thu, 22 Jul 2010 13:22:48 +0200 stefan brunthaler <sbrunthaler@gmail.com> wrote:
I wonder whether you would be interested in integrating these optimizations with the Python 3 distribution, hence this mail. I could send copies of the papers, as well as provide my prototype source code to interested members of the python development community.
Is the source code under an open source non-copyleft license? Have you checked that the whole regression test suite passes? Can you publish all this online (papers and source code)? Regards Antoine.
Is the source code under an open source non-copyleft license?
I am (unfortunately) not employed or funded by anybody, so I think that I can license/release the code as I see fit.
Have you checked that the whole regression test suite passes?
Currently, I am sure my prototype will not pass the regression test suite, because I used it for benchmarking only. However, I think it is probably not a too much work to get this done.
Can you publish all this online (papers and source code)?
I am not sure whether I can publish the ECOOP paper without violating any Springer copyrights, but will check if that is any problem. I could send you the second paper in private; until it is going to be published in the corresponding proceedings, I fear that I might not be able to put this online, though. Publishing the source code should not be a problem at all, though I think some polishing might not be a bad idea... Best, --stefan
On Thu, Jul 22, 2010 at 10:08 PM, stefan brunthaler <sbrunthaler@gmail.com> wrote:
Is the source code under an open source non-copyleft license?
I am (unfortunately) not employed or funded by anybody, so I think that I can license/release the code as I see fit.
If you did this work under your PhD program, you may be more restricted than you think. You may want to check with your adviser first, cheers, David
On Fri, Jul 23, 2010 at 1:00 AM, David Cournapeau <cournape@gmail.com> wrote:
On Thu, Jul 22, 2010 at 10:08 PM, stefan brunthaler <sbrunthaler@gmail.com> wrote:
Is the source code under an open source non-copyleft license?
I am (unfortunately) not employed or funded by anybody, so I think that I can license/release the code as I see fit.
If you did this work under your PhD program, you may be more restricted than you think. You may want to check with your adviser first,
This is actually a good point - some universities have (IMO) weird ideas regarding the ownership of thesis material. Best to get that squared away before you publish the code online. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
On Thu, Jul 22, 2010 at 9:22 PM, stefan brunthaler <sbrunthaler@gmail.com> wrote:
I wonder whether you would be interested in integrating these optimizations with the Python 3 distribution, hence this mail. I could send copies of the papers, as well as provide my prototype source code to interested members of the python development community.
The Springer link [1] at least shows the front page to give more of an idea as to what this is about. The idea does sound potentially interesting, although I'm not sure how applicable it will be with a full-blown LLVM-based JIT on the way for 3.3 (via Unladen Swallow). Depending on the size and complexity of the patches, it may still be worth exploring for 3.2. Regards, Nick. [1] http://www.springerlink.com/content/p4u0851w34180n74/ -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
The Springer link [1] at least shows the front page to give more of an idea as to what this is about.
Thanks, I forgot to mention the link.
The idea does sound potentially interesting, although I'm not sure how applicable it will be with a full-blown LLVM-based JIT on the way for 3.3 (via Unladen Swallow).
Yeah, I see that. However, I think that a JIT compiler could very well re-use the information gathered by using my techniques to generate better code in the first run. There certainly are a couple of other benefits for combining both approaches, i.e., they are not mutually exclusive.
Depending on the size and complexity of the patches, it may still be worth exploring for 3.2.
I am currently not aware of the planned release schedule, but I think that some of the techniques could very well be released in 3.2 -- with a conditional compilation target similar to the computed-goto feature. Regards, --stefan
On 7/22/2010 9:36 AM, stefan brunthaler wrote:
Depending on the size and complexity of the patches, it may still be worth exploring for 3.2.
If your work speeds CPython, U.S. would have to be even better to knock it out.
I am currently not aware of the planned release schedule, but I think that some of the techniques could very well be released in 3.2 -- with a conditional compilation target similar to the computed-goto feature.
2.6.6 is scheduled for August. 3.2 releases should start thereafter with a December target. On licenses. Look for the contributor agreement on the site. It says that you license your code on a non-exclusive basis to the PSF under one of two licenses. Both allow PSF to relicense and redistribute your code as part of a Python release. I recommend you print it out and, as someone else suggested, check with your advisor/university as to whether there is any problem with you contributing the software under one of*those particular licenses*. -- Terry Jan Reedy
Hi, stefan brunthaler, 22.07.2010 13:22:
during the last year, I have developed a couple of quickening-based optimizations for the Python 3.1 interpreter. As part of my PhD programme, I have published a first technique that combines quickening with inline caching at this year's ECOOP, and subsequently extended this technique to optimize several load instructions as well as eliminate redundant reference counting operations from instructions, which has been accepted for publication for an upcoming conference. [...] I wonder whether you would be interested in integrating these optimizations with the Python 3 distribution, hence this mail. I could send copies of the papers, as well as provide my prototype source code to interested members of the python development community.
I'm absolutely interested, although not for the CPython project but for Cython. I wonder how you do inline caching in Python if the methods of a type can be replaced by whatever at runtime. Could you elaborate on that? Based on what information do you switch between inlining states? Or do you restrict yourself to builtin types? That might be worth it already, just think of list.append(). We have an optimistic optimisation for object.append() in Cython that gives us massive speed-ups in loops that build lists, even if we don't know at compile time that we are dealing with lists. Stefan
Hi, I guess it would be a good idea to quickly outline my inline caching approach, so that we all have a basic understanding of how it works. If we take for instance the BINARY_ADD instruction, the interpreter evaluates the actual operand types and chooses the matching operation implementation at runtime, i.e., operands that are unicode strings will be concatenated via unicode_concatenate, for float operands on the other hand, the interpreter would end up invoking float_add via binary_op1. Now, a very efficient way to achieve purely interpretative inline caching is to quicken the type-generic BINARY_ADD instruction to a type-dependent FLOAT_ADD instruction (this technique, i.e., inline caching via quickening, is the primary contribution of my ECOOP paper). Hence, I have a very simple code generator, that generates type-dependent interpreter instructions in a pre-compile step of the interpreter, and uses runtime type information to quicken/rewrite instructions. Aside of the operators, I have implemented this quickening technique for FOR_ITER, COMPARE_OP and CALL_FUNCTION instructions.
I'm absolutely interested, although not for the CPython project but for Cython. I wonder how you do inline caching in Python if the methods of a type can be replaced by whatever at runtime. Could you elaborate on that?
Currently, I only provide optimized derivatives for several separate call targets, i.e., whether a call target is a C function with varargs, or a Python function/method--this already eliminates a lot of overhead from invoking call_function. Based on further quantitative analysis, I can easily provide inline cached derivatives of frequently called functions, such as some builtin primitives.
Based on what information do you switch between inlining states?
I have instrumented versions of some functions that allow me to make quickening decisions, such as binary_op1, do_richcompare, or call_function, where I can quicken instructions to an optimized, inline cached, instruction derivative.
Or do you restrict yourself to builtin types?
Currently, my approach provides optimized derivative instructions for the standard library, e.g., unicode strings, numerical objects, containers, and iterators.
That might be worth it already, just think of list.append(). We have an optimistic optimisation for object.append() in Cython that gives us massive speed-ups in loops that build lists, even if we don't know at compile time that we are dealing with lists.
Yes, that sounds like a reasonable thing to do. I could provide much more optimized derivatives based on application profiles, too. Since I use a simple code generator for generating the derivatives, it would also be possible to provide end-users with the means to analyze their apps and generate optimized instruction derivatives matching their profile. Regards, --stefan
stefan brunthaler, 23.07.2010 08:48:
I guess it would be a good idea to quickly outline my inline caching approach, so that we all have a basic understanding of how it works.
Yes, that certainly makes it easier to discuss.
If we take for instance the BINARY_ADD instruction, the interpreter evaluates the actual operand types and chooses the matching operation implementation at runtime, i.e., operands that are unicode strings will be concatenated via unicode_concatenate, for float operands on the other hand, the interpreter would end up invoking float_add via binary_op1. Now, a very efficient way to achieve purely interpretative inline caching is to quicken the type-generic BINARY_ADD instruction to a type-dependent FLOAT_ADD instruction (this technique, i.e., inline caching via quickening, is the primary contribution of my ECOOP paper). Hence, I have a very simple code generator, that generates type-dependent interpreter instructions in a pre-compile step of the interpreter, and uses runtime type information to quicken/rewrite instructions. Aside of the operators, I have implemented this quickening technique for FOR_ITER, COMPARE_OP and CALL_FUNCTION instructions.
This sounds like wpython (a CPython derivative with a wider set of byte code commands) could benefit from it. Do I understand correctly that you modify the byte code of modules/functions at runtime?
I'm absolutely interested, although not for the CPython project but for Cython. I wonder how you do inline caching in Python if the methods of a type can be replaced by whatever at runtime. Could you elaborate on that?
Currently, I only provide optimized derivatives for several separate call targets, i.e., whether a call target is a C function with varargs, or a Python function/method--this already eliminates a lot of overhead from invoking call_function.
Ah, yes, that makes good sense. So you basically add an intermediate step to calls that provides faster dispatch for known C functions.
Or do you restrict yourself to builtin types?
Currently, my approach provides optimized derivative instructions for the standard library, e.g., unicode strings, numerical objects, containers, and iterators.
I'm interested in the code that determines what can be optimised in what way. I read that Jython recently received a contribution that provides type information for lots of modules and builtins, but having something like that for CPython would be cool.
That might be worth it already, just think of list.append(). We have an optimistic optimisation for object.append() in Cython that gives us massive speed-ups in loops that build lists, even if we don't know at compile time that we are dealing with lists.
Yes, that sounds like a reasonable thing to do. I could provide much more optimized derivatives based on application profiles, too. Since I use a simple code generator for generating the derivatives, it would also be possible to provide end-users with the means to analyze their apps and generate optimized instruction derivatives matching their profile.
Such an approach would also be very useful for Cython. Think of a profiler that runs a program in CPython and tells you exactly what static type annotations to put where in your Python code to make it compile to a fast binary with Cython. Or, even better, it could just spit out a .pxd file that you drop next to your .py file and that provides the static type information for you. Stefan
This sounds like wpython (a CPython derivative with a wider set of byte code commands) could benefit from it.
I am aware of the wpython project of Cesare di Mauro. I change the instruction format from bytecode to wordcode, too (because it allows for more efficient instruction decoding). Contrary to his approach, however, I do not change the instruction encoding to pack in additional optimizations. (I hope to have put that correctly; I have seen his slides about a year ago.)
Do I understand correctly that you modify the byte code of modules/functions at runtime?
Yes. Quickening is runtime only optimization technique that rewrites instructions from a generic instruction to an optimized derivative (orignally for the Java virtual machine). It is completely hidden from the compiler and has no other dependencies than the interpreter dispatch routine itself.
Ah, yes, that makes good sense. So you basically add an intermediate step to calls that provides faster dispatch for known C functions.
Exactly. I also contemplated to provide optimized derivatives for all builtin functions, but never implemented it (lack of time). Based on quantitative analysis of usage frequency one could very well decide to, e.g., provide an optimized CALL_FUNCTION derivative for the "len" function. Another benefit of using my technique is that a compiler could decide to inline all of the functions of the optimized derivatives (e.g., the float_add function call inside my FLOAT_ADD interpreter instruction). Unfortunately, however, gcc currently does not allow for cross-module inlining (AFAIR). (Preliminary tests with manually changing the default inlining size for ceval.c resulted in speedups of up to 1.3 on my machine, so I think inlinling of function bodies for the optimized derivatives would boost performance noticeably.)
I'm interested in the code that determines what can be optimised in what way. I read that Jython recently received a contribution that provides type information for lots of modules and builtins, but having something like that for CPython would be cool.
Ok. For this year's PPPJ I wanted to submit a paper realizing my optimization in Jython. Because of bytecode-rewriting tools, the interpreter could decide at runtime which optimized derivatives to generate and add rewriting code that supports the changing instruction set. Either way (static pre-compiling or dynamic bytecode rewriting that is), I think that Jython and IronPython would greatly benefit from applying this optimization technique, because their JIT compilers would inline the function calls with a high likelihood.
Such an approach would also be very useful for Cython. Think of a profiler that runs a program in CPython and tells you exactly what static type annotations to put where in your Python code to make it compile to a fast binary with Cython. Or, even better, it could just spit out a .pxd file that you drop next to your .py file and that provides the static type information for you.
Hm, I think you could very easily save away the optimized bytecode sequence for function calls that would allow you to do that (e.g., you could save something similar to: LOAD_FAST LOAD_CONST LONG_ADD or LOAD_GLOBAL CALL_C_ZERO ) Cheers, --stefan
On Fri, Jul 23, 2010 at 1:58 AM, stefan brunthaler <stefan@brunthaler.net> wrote:
Do I understand correctly that you modify the byte code of modules/functions at runtime?
Yes. Quickening is runtime only optimization technique that rewrites instructions from a generic instruction to an optimized derivative (orignally for the Java virtual machine). It is completely hidden from the compiler and has no other dependencies than the interpreter dispatch routine itself.
How do you generate the specialized opcode implementations? Presumably that is done ahead of time, or you'd have to use a JIT, which is what you're avoiding. I'm guessing from your comments below about cross-module inlining that you generate a separate .c file with the specialized opcode bodies and then call through to them via a table of function pointers indexed by opcode, but I could be totally wrong. :)
Another benefit of using my technique is that a compiler could decide to inline all of the functions of the optimized derivatives (e.g., the float_add function call inside my FLOAT_ADD interpreter instruction). Unfortunately, however, gcc currently does not allow for cross-module inlining (AFAIR). (Preliminary tests with manually changing the default inlining size for ceval.c resulted in speedups of up to 1.3 on my machine, so I think inlinling of function bodies for the optimized derivatives would boost performance noticeably.)
There are a variety of solutions to getting cross-module inlining these days. Clang+LLVM support link-time optimization (LTO) via a plugin for gold. GCC has LTO and LIPO as well.
Such an approach would also be very useful for Cython. Think of a profiler that runs a program in CPython and tells you exactly what static type annotations to put where in your Python code to make it compile to a fast binary with Cython. Or, even better, it could just spit out a .pxd file that you drop next to your .py file and that provides the static type information for you.
This would be interesting. We have (obviously) have similar instrumentation in unladen swallow to gather type feedback. We talked with Craig Citro about finding a way to feed that back to Cython for exactly this reason, but we haven't really pursued it. Reid
How do you generate the specialized opcode implementations?
I have a small code generator written in Python that uses Mako templates to generate C files that can be included in the main interpreter. It is a data driven approach that uses type information gathered by gdb and check whether given types implement for instance a nb_add method.
Presumably that is done ahead of time, or you'd have to use a JIT, which is what you're avoiding.
Yes, and yes: I execute the code generator before compiling the Python interpreter, and I am interested in purely interpretative optimization techniques.
I'm guessing from your comments below about cross-module inlining that you generate a separate .c file with the specialized opcode bodies and then call through to them via a table of function pointers indexed by opcode, but I could be totally wrong. :)
No, dead on ;) Probably a small example from the top of my head illustrates what is going on: TARGET(FLOAT_ADD): w= POP(); v= TOP(); x= PyFloat_Type.tp_as_number->nb_add(v, w); SET_TOP(x); if (x != NULL) FAST_DISPATCH(); break; And I extend the standard indirect threaded code dispatch table to support the FLOAT_ADD operation.
There are a variety of solutions to getting cross-module inlining these days. Clang+LLVM support link-time optimization (LTO) via a plugin for gold. GCC has LTO and LIPO as well.
A PhD colleague from our institute pointed the gold stuff out to me yesterday, I am going to check out if any of these solutions would work. A deeper problem here is that the heuristics of the compilers are ill-suited to the needs of compiling an interpreter dispatch routine -- I will investigate this further in future research.
This would be interesting. We have (obviously) have similar instrumentation in unladen swallow to gather type feedback. We talked with Craig Citro about finding a way to feed that back to Cython for exactly this reason, but we haven't really pursued it.
Ok; I think it would actually be fairly easy to use the type information gathered at runtime by the quickening approach. Several auxiliary functions for dealing with these types could be generated by my code generator as well. It is probably worth looking into this, though my current top-priority is my PhD research, so I cannot promise to being able to allocate vast amounts of time for such endeavours. Best, --stefan
On Fri, Jul 23, 2010 at 11:26 AM, stefan brunthaler <stefan@brunthaler.net> wrote:
I'm guessing from your comments below about cross-module inlining that you generate a separate .c file with the specialized opcode bodies and then call through to them via a table of function pointers indexed by opcode, but I could be totally wrong. :)
No, dead on ;) Probably a small example from the top of my head illustrates what is going on:
TARGET(FLOAT_ADD): w= POP(); v= TOP(); x= PyFloat_Type.tp_as_number->nb_add(v, w); SET_TOP(x); if (x != NULL) FAST_DISPATCH(); break;
And I extend the standard indirect threaded code dispatch table to support the FLOAT_ADD operation.
I think I was wrong, but now I understand. The inlining you want is to get the nb_add body, not the opcode body. The example you've given brings up a correctness issue. It seems you'd want to add checks that verify that the operands w and v are both floats, and jump to BINARY_ADD if the guard fails. It would require reshuffling the stack operations to defer the pop until after the check, but it shouldn't be a problem.
This would be interesting. We have (obviously) have similar instrumentation in unladen swallow to gather type feedback. We talked with Craig Citro about finding a way to feed that back to Cython for exactly this reason, but we haven't really pursued it.
Ok; I think it would actually be fairly easy to use the type information gathered at runtime by the quickening approach. Several auxiliary functions for dealing with these types could be generated by my code generator as well. It is probably worth looking into this, though my current top-priority is my PhD research, so I cannot promise to being able to allocate vast amounts of time for such endeavours.
I think you also record (via gdb) exactly the information that we record. I now see three consumers of type feedback from the CPython interpreter: you or any quickening approach, Cython, and Unladen Swallow. It might be worth converging on a standard way to record this information and serialize it so that it can be analyzed offline. Our feedback recording mechanism currently uses LLVM data structures, but the point of quickening is to avoid that kind of dependency, so we'd need to rewrite it before it would really be useful to you. Reid
I think I was wrong, but now I understand. The inlining you want is to get the nb_add body, not the opcode body.
Exactly. This would increase performace by quite a bit -- I will start experimentation with that stuff a.s.a.p.
The example you've given brings up a correctness issue. It seems you'd want to add checks that verify that the operands w and v are both floats, and jump to BINARY_ADD if the guard fails. It would require reshuffling the stack operations to defer the pop until after the check, but it shouldn't be a problem.
That's usually the problem when you're doing something from the top of your head, especially when its already 9pm :) You're right, a simple way around this is either: TARGET(FLOAT_ADD): if (!(TOP()->ob_type == SECOND()->ob_type && TOP()->ob_type == &PyFloat_Type)) goto TARGET_BINARY_ADD_SKIP_DECODE; ...remains the same... Note: the skip_decode is necessary because otherwise it would advance the instruction pointer. Another, simple approach is to add another set of labels to denote inline cache misses, e.g.: TARGET(BINARY_ADD): w= POP(); v= TOP(); BINARY_ADD_CACHE_MISS: x= PyNumber_Add(v, w); ... TARGET(FLOAT_ADD): w= POP(); v= TOP(); if (v->ob_type != w->ob_type || v->ob_type != &PyFloat_Type) goto BINARY_ADD_CACHE_MISS; ... You could also call the PyNumber_Add function yourself instead of the goto, but I did not implement it this way...
I think you also record (via gdb) exactly the information that we record. I now see three consumers of type feedback from the CPython interpreter: you or any quickening approach, Cython, and Unladen Swallow. It might be worth converging on a standard way to record this information and serialize it so that it can be analyzed offline.
Indeed. Probably a bigger set of types of frequently used C extension modules would be useful as well. I am doing the simplest possible thing here: since the gdb pretty print already is pretty close to a Python data type definition, I just copied this and did a few search and replace commands to convert it to a valid Python data type.
Our feedback recording mechanism currently uses LLVM data structures, but the point of quickening is to avoid that kind of dependency, so we'd need to rewrite it before it would really be useful to you.
Ok. --stefan
2010/7/23 stefan brunthaler <stefan@brunthaler.net>
This sounds like wpython (a CPython derivative with a wider set of byte code commands) could benefit from it.
I am aware of the wpython project of Cesare di Mauro.
wpython has reached 1.1 final version. If you are interested, you can find it here: http://code.google.com/p/wpython2/ and you can download the new slides that cover the improvements over 1.0 alpha.
I change the instruction format from bytecode to wordcode, too (because it allows for more efficient instruction decoding).
Did you used wpython wordcode format, or a new one?
Contrary to his approach, however, I do not change the instruction encoding to pack in additional optimizations. (I hope to have put that correctly; I have seen his slides about a year ago.)
Yes, you're right. wpython approach is to encode as much information as it can to save space, decoding time, "specialize" some opcodes, etc.. Cesare
wpython has reached 1.1 final version. If you are interested, you can find it here: http://code.google.com/p/wpython2/ and you can download the new slides that cover the improvements over 1.0 alpha.
Thanks for the hint, I will definitely check your new slides.
Did you used wpython wordcode format, or a new one?
No, actually I was well into working on my stuff when you announced wpython last year. My latest instruction format uses a native machine word (64bit) that contains two 32bit halves with the opcode in the lower half and the operand in the upper half. While the opcode certainly does not exceed 10bits or so, I need more than a byte to support more operations (my latest interpreter has 395 instructions). Our instruction decoding is almost identical, though.
Yes, you're right. wpython approach is to encode as much information as it can to save space, decoding time, "specialize" some opcodes, etc..
Yes, I see that wpython eliminates common instruction sequences. From my understanding, it corresponds to using static superinstructions in combination with a changed instruction format. Aside of the optimizations in the operation implementation wpython allows to eliminate some instruction dispatches, which are notoriously inefficient. I think it is a very nice approach and some form of inline caching with quickening might well boost performance even more. Cheers, --stefan
2010/7/23 Stefan Behnel <stefan_ml@behnel.de>
stefan brunthaler, 23.07.2010 08:48:
If we take for instance the BINARY_ADD instruction, the interpreter
evaluates the actual operand types and chooses the matching operation implementation at runtime, i.e., operands that are unicode strings will be concatenated via unicode_concatenate, for float operands on the other hand, the interpreter would end up invoking float_add via binary_op1. Now, a very efficient way to achieve purely interpretative inline caching is to quicken the type-generic BINARY_ADD instruction to a type-dependent FLOAT_ADD instruction (this technique, i.e., inline caching via quickening, is the primary contribution of my ECOOP paper). Hence, I have a very simple code generator, that generates type-dependent interpreter instructions in a pre-compile step of the interpreter, and uses runtime type information to quicken/rewrite instructions. Aside of the operators, I have implemented this quickening technique for FOR_ITER, COMPARE_OP and CALL_FUNCTION instructions.
This sounds like wpython (a CPython derivative with a wider set of byte code commands) could benefit from it.
WPython 1.1 does it at compile time, if you enable the new "experimental integer opcodes" flag. Similar optimizations were introduced with new opcodes for specialized string interpolation and joins, which are common operations in Python. It also added a new opcode GET_GENERATOR which internally uses a faster function call, which is used also by (the modified) BUILD_CLASS for the same reason (cut some unnecessary checks and code). Cesare
participants (9)
-
Antoine Pitrou -
Cesare Di Mauro -
David Cournapeau -
Nick Coghlan -
Reid Kleckner -
Stefan Behnel -
stefan brunthaler -
stefan brunthaler -
Terry Reedy