Python 3 optimizations, continued, continued again...
Hi guys, while there is at least some interest in incorporating my optimizations, response has still been low. I figure that the changes are probably too much for a single big incorporation step. On a recent flight, I thought about cutting it down to make it more easily digestible. The basic idea is to remove the optimized interpreter dispatch loop and advanced instruction format and use the existing ones. Currently (rev. ca8a0dfb2176), opcode.h uses 109 of potentially available 255 instructions using the current instruction format. Hence, up to 149 instruction opcodes could be given to optimized instruction derivatives. Consequently, a possible change would require to change: a) opcode.h to add new instruction opcodes, b) ceval.c to include the new instruction opcodes in PyEval_EvalFrameEx, c) abstract.c, object.c (possible other files) to add the quickening/rewriting function calls. If this is more interesting, I could start evaluating which instruction opcodes should be allocated to which derivatives to get the biggest benefit. This is a lot easier to implement (because I can re-use the existing instruction implementations) and can easily be made to be conditionally compile-able, similar to the computed-gotos option. Since the changes are minimal it is also simpler to understand and deal with for everybody else, too. On the "downside", however, not all optimizations are possible and/or make sense in the given limit of instructions (no data-object inlining and no reference-count elimination.) How does that sound? Have a nice day, --stefan
Hi, On Tue, Nov 8, 2011 at 10:36, Benjamin Peterson <benjamin@python.org> wrote:
2011/11/8 stefan brunthaler <s.brunthaler@uci.edu>:
How does that sound?
I think I can hear real patches and benchmarks most clearly.
I spent the better part of my -20% time on implementing the work as "suggested". Please find the benchmarks attached to this email, I just did them on my system (i7-920, Linux 3.0.0-15, GCC 4.6.1). I branched off the regular 3.3a0 default tip changeset 73977 shortly after your email. I do not have an official patch yet, but am going to create one if wanted. Changes to the existing interpreter are minimal, the biggest chunk is a new interpreter dispatch loop. Merging dispatch loops eliminates some of my optimizations, but my inline caching technique enables inlining some functionality, which results in visible speedups. The code is normalized to the non-threaded-code version of the CPython interpreter (named "vanilla"), so that I can reference it to my preceding results. I anticipate *no* compatibility issues and the interpreter requires less than 100 KiB of extra memory at run-time. Since my interpreter is using 215 of a maximum of 255 instructions, there is room for adding additional derivatives, e.g., for popular Python libraries, too. Let me know what python-dev thinks of this and have a nice weekend, --stefan PS: AFAIR the version without partial stack frame caching also passes all regression tests modulo the ones that test against specific bytecodes.
2012/1/27 stefan brunthaler <s.brunthaler@uci.edu>:
Hi,
On Tue, Nov 8, 2011 at 10:36, Benjamin Peterson <benjamin@python.org> wrote:
2011/11/8 stefan brunthaler <s.brunthaler@uci.edu>:
How does that sound?
I think I can hear real patches and benchmarks most clearly.
I spent the better part of my -20% time on implementing the work as "suggested". Please find the benchmarks attached to this email, I just did them on my system (i7-920, Linux 3.0.0-15, GCC 4.6.1). I branched off the regular 3.3a0 default tip changeset 73977 shortly after your email. I do not have an official patch yet, but am going to create one if wanted. Changes to the existing interpreter are minimal, the biggest chunk is a new interpreter dispatch loop.
Merging dispatch loops eliminates some of my optimizations, but my inline caching technique enables inlining some functionality, which results in visible speedups. The code is normalized to the non-threaded-code version of the CPython interpreter (named "vanilla"), so that I can reference it to my preceding results. I anticipate *no* compatibility issues and the interpreter requires less than 100 KiB of extra memory at run-time. Since my interpreter is using 215 of a maximum of 255 instructions, there is room for adding additional derivatives, e.g., for popular Python libraries, too.
Let me know what python-dev thinks of this and have a nice weekend,
Cool. It'd be nice to see a patch. -- Regards, Benjamin
stefan brunthaler wrote:
Hi,
On Tue, Nov 8, 2011 at 10:36, Benjamin Peterson <benjamin@python.org> wrote:
2011/11/8 stefan brunthaler <s.brunthaler@uci.edu>:
How does that sound? I think I can hear real patches and benchmarks most clearly.
I spent the better part of my -20% time on implementing the work as "suggested". Please find the benchmarks attached to this email, I just
Could you try benchmarking with the "standard" benchmarks: http://hg.python.org/benchmarks/ and see what sort of performance gains you get?
did them on my system (i7-920, Linux 3.0.0-15, GCC 4.6.1). I branched off the regular 3.3a0 default tip changeset 73977 shortly after your email. I do not have an official patch yet, but am going to create one if wanted. Changes to the existing interpreter are minimal, the biggest chunk is a new interpreter dispatch loop.
How portable is the threaded interpreter? Do you have a public repository for the code, so we can take a look? Cheers, Mark.
Hello,
Could you try benchmarking with the "standard" benchmarks: http://hg.python.org/benchmarks/ and see what sort of performance gains you get?
Yeah, of course. I already did. Refere to the page listed below for details. I did not look into the results yet, though.
How portable is the threaded interpreter?
Well, you can implement threaded code on any machine that support indirect branch instructions. Fortunately, GCC supports the "label-as-values" feature, which makes it available on any machine that supports GCC. My optimizations themselves are portable, and I tested them on a PowerPC for my thesis, too. (AFAIR, llvm supports this feature, too.)
Do you have a public repository for the code, so we can take a look?
I have created a patch (as Benjamin wanted) and put all of the resources (i.e., benchmark results and the patch itself) on my home page: http://www.ics.uci.edu/~sbruntha/pydev.html Regards, --stefan
Hello,
Well, you can implement threaded code on any machine that support indirect branch instructions. Fortunately, GCC supports the "label-as-values" feature, which makes it available on any machine that supports GCC. My optimizations themselves are portable, and I tested them on a PowerPC for my thesis, too. (AFAIR, llvm supports this feature, too.)
Well, you're aware that Python already uses threaded code where available? Or are you testing against Python 2? Regards Antoine.
stefan brunthaler, 30.01.2012 20:18:
Well, you're aware that Python already uses threaded code where available? Or are you testing against Python 2?
Yes, and I am building on that.
I assume "yes" here means "yes, I'm aware" and not "yes, I'm using Python 2", right? And you're building on top of the existing support for threaded code in order to improve it? Stefan
I assume "yes" here means "yes, I'm aware" and not "yes, I'm using Python 2", right? And you're building on top of the existing support for threaded code in order to improve it?
Your assumption is correct, I'm sorry for the sloppiness (I was heading out for lunch.) None of the code is 2.x compatible, all of my work has always targeted Python 3.x. My work does not improve threaded code (as in interpreter dispatch technique), but enables efficient and purely interpretative inline caching via quickening. (So, after execution of BINARY_ADD, I rewrite the specific occurence of the bytecode instruction to a, say, FLOAT_ADD instruction and ensure that my assumption is correct in the FLOAT_ADD instruction.) Thanks, --stefan
Am 30.01.2012 20:06, schrieb stefan brunthaler:
Do you have a public repository for the code, so we can take a look?
I have created a patch (as Benjamin wanted) and put all of the resources (i.e., benchmark results and the patch itself) on my home page: http://www.ics.uci.edu/~sbruntha/pydev.html
If I read the patch correctly, most of it is auto-generated (and there is probably a few spurious changes that blow it up, such as the python-gdb.py file). But the tool that actually generates the code doesn't seem to be included? (Which means that in this form, the patch couldn't possibly be accepted.) Georg
If I read the patch correctly, most of it is auto-generated (and there is probably a few spurious changes that blow it up, such as the python-gdb.py file).
Hm, honestly I don't know where the python-gdb.py file comes from, I thought it came with the switch from 3.1 to the tip version I was using. Anyways, I did not tuch it or at least have no recollection of doing so. Regarding the spurious changes: This might very well be, regression testing works, and it would actually be fairly easy to figure out crashes (e.g., by tracing all executed bytecode instructions and seeing if all of them are actually executed, I could easily do that if wanted/necessary.)
But the tool that actually generates the code doesn't seem to be included? (Which means that in this form, the patch couldn't possibly be accepted.)
Well, the tool is not included because it does a lot more (e.g., generate the code for elimination of reference count operations.) Unfortunately, my interpreter architecture that achieves the highest speedups is more complicated, and I got the feeling that this is not going well with python-dev. So, I had the idea of basically using just one (but a major one) optimization technique and going with that. I don't see why you would need my code generator, though. Not that I cared, but I would need to strip down and remove many parts of it and also make it more accessible to other people. However, if python-dev decides that it wants to include the optimizations and requires the code generator, I'll happily chip in the extra work an give you the corresponding code generator, too. Thanks, --stefan
Am 31.01.2012 16:46, schrieb stefan brunthaler:
If I read the patch correctly, most of it is auto-generated (and there is probably a few spurious changes that blow it up, such as the python-gdb.py file).
Hm, honestly I don't know where the python-gdb.py file comes from, I thought it came with the switch from 3.1 to the tip version I was using. Anyways, I did not tuch it or at least have no recollection of doing so. Regarding the spurious changes: This might very well be, regression testing works, and it would actually be fairly easy to figure out crashes (e.g., by tracing all executed bytecode instructions and seeing if all of them are actually executed, I could easily do that if wanted/necessary.)
There is also the issue of the two test modules removed from the test suite.
But the tool that actually generates the code doesn't seem to be included? (Which means that in this form, the patch couldn't possibly be accepted.)
Well, the tool is not included because it does a lot more (e.g., generate the code for elimination of reference count operations.) Unfortunately, my interpreter architecture that achieves the highest speedups is more complicated, and I got the feeling that this is not going well with python-dev. So, I had the idea of basically using just one (but a major one) optimization technique and going with that. I don't see why you would need my code generator, though. Not that I cared, but I would need to strip down and remove many parts of it and also make it more accessible to other people. However, if python-dev decides that it wants to include the optimizations and requires the code generator, I'll happily chip in the extra work an give you the corresponding code generator, too.
Well, nobody wants to review generated code. Georg
There is also the issue of the two test modules removed from the test suite.
Oh, I'm sorry, seems like the patch did contain too much of my development stuff. (I did remove them before, because they were always failing due to the instruction opcodes being changed because of quickening; they pass the tests, though.)
Well, nobody wants to review generated code.
I agree. The code generator basically uses templates that contain the information and a dump of the C-structure of several types to traverse and see which one of them implements which functions. There is really no magic there, the most "complex" thing is to get the inline-cache miss checks for function calls right. But I tried to make the generated code look pretty, so that working with it is not too much of a hassle. The code generator itself is a little bit more complicated, so I am not sure it would help a lot... best, --stefan
stefan brunthaler, 31.01.2012 22:17:
Well, nobody wants to review generated code.
I agree. The code generator basically uses templates that contain the information and a dump of the C-structure of several types to traverse and see which one of them implements which functions. There is really no magic there, the most "complex" thing is to get the inline-cache miss checks for function calls right. But I tried to make the generated code look pretty, so that working with it is not too much of a hassle. The code generator itself is a little bit more complicated, so I am not sure it would help a lot...
How many times did you regenerate this code until you got it right? And how do you know that you really got it so right that it was the last time ever that you needed your generator for it? What if the C structure of any of those "several types" ever changes? Stefan
How many times did you regenerate this code until you got it right?
Well, honestly, I changed the code generator to "pack" the new optimized instruction derivatives densly into the available opcodes, so that I can make optimal use of what's there. Thus I only generated the code twice for this patch.
And how do you know that you really got it so right that it was the last time ever that you needed your generator for it?
I am positive that I am going to need my code generator in the future, as I have several ideas to increase performance even more. As I have mentioned before, my quickening based inline caching technique is very simple, and if it would crash, chances are that any of the inline-cache miss guards don't capture all scenarios, i.e., are non-exhaustive. The regression-tests run, so do the official benchmarks plus the computer language benchmarks game. In addition, this has been my line of research since 2009, so I have extensive experience with it, too.
What if the C structure of any of those "several types" ever changes?
Since I optimize interpreter instructions, any change that affects their implementation requires changing of the optimized instructions, too. Having the code generator ready for such things would certainly be a good idea (probably also for generating the default interpreter dispatch loop), since you could also add your own "profile" for your application/domain to re-use the remaining 30+ instruction opcodes. The direct answer is that I would need to re-generate the driver file, which is basically a gdb-dump plus an Emacs macro (please note that I did not need to do that since working with ~ 3.0b1) I will add a list of the types I use for specializing to patch section on the "additional resources" page of my homepage (including a fixed patch of what Georg brought to my attention.) --stefan
Wiadomość napisana przez stefan brunthaler w dniu 1 lut 2012, o godz. 16:55:
And how do you know that you really got it so right that it was the last time ever that you needed your generator for it?
I am positive that I am going to need my code generator in the future, as I have several ideas to increase performance even more.
Hello, Stefan. First let me thank you for your interest in improving the interpreter. We appreciate and encourage efforts to make it perform better. But let me put this straight: as an open-source project, we are hesitant to accept changes which depend on closed software. Even if your optimization techniques would result in performance a hundred times better than what is currently achieved, we would still be wary to accept them. Please note that this is not because of lack of trust or better yet greed for your code. We need to make sure that under no circumstances our codebase is in danger because something important was left out along the way. Maintenance of generated code is yet another nuissance that should better be strongly justified. -- Best regards, Łukasz Langa Senior Systems Architecture Engineer IT Infrastructure Department Grupa Allegro Sp. z o.o.
But let me put this straight: as an open-source project, we are hesitant to accept changes which depend on closed software. Even if your optimization techniques would result in performance a hundred times better than what is currently achieved, we would still be wary to accept them.
Please note that this is not because of lack of trust or better yet greed for your code. We need to make sure that under no circumstances our codebase is in danger because something important was left out along the way.
I am positive that the code generator does not depend on any closed source components, I just juse mako for storing the C code templates that I generate -- everything else I wrote myself. Of course, I'll give the code generator to pydev, too, if necessary. However, I need to strip it down, so that it does not do all the other stuff that you don't need. I just wanted to give you the implementation now, since Benjamin said that he wants to see real code and results first. If you want to integrate the inca-optimization, I am going to start working on this asap.
Maintenance of generated code is yet another nuissance that should better be strongly justified.
I agree, but the nice thing is that the technique is very simple: only if you changed a significant part of the interpreter implementation's, you'd need to change the optimized derivatives, too. If one generates the default interpreter implementation, too, then one gets the optimizations almost for free. For maintenance reasons I chose to use a template-based system, too, since this gives you a direct correspondence between the actual code and what's generated, without interfering with the code generator at all. --stefan
Let's make one thing clear. The Python core developers need to be able to reproduce your results from scratch, and that means access to the templates, code generators, inputs, and everything else you used. (Of course for stuff you didn't write that's already open source, all we need is a pointer to the open source project and the exact version/configuration you used, plus any local mods you made.) I understand that you're hesitant to just dump your current mess, and you want to clean it up before you show it to us. That's fine. But until you're ready to show it, we're not going to integrate any of your work into CPython, even though some of us (maybe Benjamin) may be interested in kicking its tires. And remember, it doesn't need to be perfect (in fact perfectionism is probably a bad idea here). But it does need to be open source. Every single bit of it. (And no GPL, please.) --Guido 2012/2/1 stefan brunthaler <s.brunthaler@uci.edu>:
But let me put this straight: as an open-source project, we are hesitant to accept changes which depend on closed software. Even if your optimization techniques would result in performance a hundred times better than what is currently achieved, we would still be wary to accept them.
Please note that this is not because of lack of trust or better yet greed for your code. We need to make sure that under no circumstances our codebase is in danger because something important was left out along the way.
I am positive that the code generator does not depend on any closed source components, I just juse mako for storing the C code templates that I generate -- everything else I wrote myself. Of course, I'll give the code generator to pydev, too, if necessary. However, I need to strip it down, so that it does not do all the other stuff that you don't need. I just wanted to give you the implementation now, since Benjamin said that he wants to see real code and results first. If you want to integrate the inca-optimization, I am going to start working on this asap.
Maintenance of generated code is yet another nuissance that should better be strongly justified.
I agree, but the nice thing is that the technique is very simple: only if you changed a significant part of the interpreter implementation's, you'd need to change the optimized derivatives, too. If one generates the default interpreter implementation, too, then one gets the optimizations almost for free. For maintenance reasons I chose to use a template-based system, too, since this gives you a direct correspondence between the actual code and what's generated, without interfering with the code generator at all.
--stefan _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/guido%40python.org
-- --Guido van Rossum (python.org/~guido)
On Wed, Feb 1, 2012 at 09:46, Guido van Rossum <guido@python.org> wrote:
Let's make one thing clear. The Python core developers need to be able to reproduce your results from scratch, and that means access to the templates, code generators, inputs, and everything else you used. (Of course for stuff you didn't write that's already open source, all we need is a pointer to the open source project and the exact version/configuration you used, plus any local mods you made.)
I understand that you're hesitant to just dump your current mess, and you want to clean it up before you show it to us. That's fine. But until you're ready to show it, we're not going to integrate any of your work into CPython, even though some of us (maybe Benjamin) may be interested in kicking its tires. And remember, it doesn't need to be perfect (in fact perfectionism is probably a bad idea here). But it does need to be open source. Every single bit of it. (And no GPL, please.)
I understand all of these issues. Currently, it's not really a mess, but much more complicated as it needs to be for only supporting the inca optimization. I don't know what the time frame for a possible integration is (my guess is that it'd be safe anyways to disable it, like the threaded code support was handled.) As for the license: I really don't care about that at all, the only thing nice to have would be to have a pointer to my home page and/or the corresponding research, but that's about all on my wish list. --stefan
On Wed, Feb 1, 2012 at 11:08 AM, stefan brunthaler <s.brunthaler@uci.edu> wrote:
On Wed, Feb 1, 2012 at 09:46, Guido van Rossum <guido@python.org> wrote:
Let's make one thing clear. The Python core developers need to be able to reproduce your results from scratch, and that means access to the templates, code generators, inputs, and everything else you used. (Of course for stuff you didn't write that's already open source, all we need is a pointer to the open source project and the exact version/configuration you used, plus any local mods you made.)
I understand that you're hesitant to just dump your current mess, and you want to clean it up before you show it to us. That's fine. But until you're ready to show it, we're not going to integrate any of your work into CPython, even though some of us (maybe Benjamin) may be interested in kicking its tires. And remember, it doesn't need to be perfect (in fact perfectionism is probably a bad idea here). But it does need to be open source. Every single bit of it. (And no GPL, please.)
I understand all of these issues. Currently, it's not really a mess, but much more complicated as it needs to be for only supporting the inca optimization. I don't know what the time frame for a possible integration is (my guess is that it'd be safe anyways to disable it, like the threaded code support was handled.)
It won't be integrated until you have published your mess.
As for the license: I really don't care about that at all, the only thing nice to have would be to have a pointer to my home page and/or the corresponding research, but that's about all on my wish list.
Please don't try to enforce that in the license. That usually backfires. Use Apache 2, which is what the PSF prefers. -- --Guido van Rossum (python.org/~guido)
On Wed, Feb 1, 2012 at 20:08, stefan brunthaler <s.brunthaler@uci.edu> wrote:
I understand all of these issues. Currently, it's not really a mess, but much more complicated as it needs to be for only supporting the inca optimization.
I really don't think that is a problem. The core contributors can deal well with complexity in my experience. :-) //Lennart
I really don't think that is a problem. The core contributors can deal well with complexity in my experience. :-)
No no, I wasn't trying to insinuate anything like that at all. No, I just figured that having the code generator being able to generate 4 optimizations where only one is supported is a bad idea for several reasons, such as maintainability, etc. Anyways, I've just completed the integration of the code generator and put the corresponding patch on my page (http://www.ics.uci.edu/~sbruntha/pydev.html) for downloading. The license thing is still missing, I'll do that tomorrow or sometime next week. Regards, --stefan
On Feb 1, 2012, at 12:46 PM, Guido van Rossum wrote:
I understand that you're hesitant to just dump your current mess, and you want to clean it up before you show it to us. That's fine. (...) And remember, it doesn't need to be perfect (in fact perfectionism is probably a bad idea here).
Just as a general point of advice to open source contributors, I'd suggest erring on the side of the latter rather than the former suggestion here: dump your current mess, along with the relevant caveats ("it's a mess, much of it is irrelevant") so that other developers can help you clean it up, rather than putting the entire burden of the cleanup on yourself. Experience has taught me that most people who hold back work because it needs cleanup eventually run out of steam and their work never gets integrated and maintained. -glyph
participants (11)
-
Antoine Pitrou
-
Benjamin Peterson
-
Georg Brandl
-
Glyph Lefkowitz
-
Guido van Rossum
-
Lennart Regebro
-
Mark Shannon
-
Stefan Behnel
-
stefan brunthaler
-
stefan brunthaler
-
Łukasz Langa