Python 3 optimizations continued...
Hi, pretty much a year ago I wrote about the optimizations I did for my PhD thesis that target the Python 3 series interpreters. While I got some replies, the discussion never really picked up and no final explicit conclusion was reached. AFAICT, because of the following two factors, my optimizations were not that interesting for inclusion with the distribution at that time: a) Unladden Swallow was targeting Python 3, too. b) My prototype did not pass the regression tests. As of November 2010 (IIRC), Google is not supporting work on US anymore, and the project is stalled. (If I am wrong and there is still activity and any plans with the corresponding PEP, please let me know.) Which is why I recently spent some time fixing issues so that I can run the regression tests. There is still some work to be done, but by and large it should be possible to complete all regression tests in reasonable time (with the actual infrastructure in place, enabling optimizations later on is not a problem at all, too.) So, the two big issues aside, is there any interest in incorporating these optimizations in Python 3? Have a nice day, --stefan PS: It probably is unusual, but in a part of my home page I have created a link to indicate interest (makes both counting and voting easier, http://www.ics.uci.edu/~sbruntha/) There were also links indicating interest in funding the work; I have disabled these, so as not to upset anybody or make the impression of begging for money...
Perhaps there would be something to say given patches/overviews/specifics.
Currently I don't have patches, but for an overview and specifics, I can provide the following: * My optimizations basically rely on quickening to incorporate run-time information. * I use two separate instruction dispatch routines, and use profiling to switch from the regular Python 3 dispatch routine to an optimized one (the implementation is actually vice versa, but that is not important now) * The optimized dispatch routine has a changed instruction format (word-sized instead of bytecodes) that allows for regular instruction decoding (without the HAS_ARG-check) and inlinind of some objects in the instruction format on 64bit architectures. * I use inline-caching based on quickening (passes almost all regression tests [302 out of 307]), eliminate reference count operations using quickening (passes but has a memory leak), promote frequently accessed local variables to their dedicated instructions (passes), and cache LOAD_GLOBAL/LOAD_NAME objects in the instruction encoding when possible (I am working on this right now.) The changes I made can be summarized as: * I changed some header files to accommodate additional information (Python.h, ceval.h, code.h, frameobject.h, opcode.h, tupleobject.h) * I changed mostly abstract.c to incorporate runtime-type feedback. * All other changes target mostly ceval.c and all supplementary code is in a sub-directory named "opt" and all generated files in a sub-directory within that ("opt/gen"). * I have a code generator in place that takes care of generating all the functions; it uses the Mako template system for creating C code and does not necessarily need to be shipped with the interpreter (though one can play around and experiment with it.) So, all in all, the changes are not that big to the actual implementation, and most of the code is generated (using sloccount, opt has 1990 lines of C, and opt/gen has 8649 lines of C). That's a quick summary, if there are any further or more in-depth questions, let me know. best, --stefan
On Mon, 29 Aug 2011 11:33:14 -0700
stefan brunthaler
* The optimized dispatch routine has a changed instruction format (word-sized instead of bytecodes) that allows for regular instruction decoding (without the HAS_ARG-check) and inlinind of some objects in the instruction format on 64bit architectures.
Having a word-sized "bytecode" format would probably be acceptable in itself, so if you want to submit a patch for that, go ahead. Regards Antoine.
On Tue, Aug 30, 2011 at 7:14 AM, Antoine Pitrou
On Mon, 29 Aug 2011 11:33:14 -0700 stefan brunthaler
wrote: * The optimized dispatch routine has a changed instruction format (word-sized instead of bytecodes) that allows for regular instruction decoding (without the HAS_ARG-check) and inlinind of some objects in the instruction format on 64bit architectures.
Having a word-sized "bytecode" format would probably be acceptable in itself, so if you want to submit a patch for that, go ahead.
Although any such patch should discuss how it compares with Cesare's work on wpython. Personally, I *like* CPython fitting into the "simple-and-portable" niche in the Python interpreter space. Armin Rigo made the judgment years ago that CPython was a poor platform for serious optimisation when he stopped working on Psyco and started PyPy instead, and I think the contrasting fates of PyPy and Unladen Swallow have borne out that opinion. Significantly increasing the complexity of CPython for speed-ups that are dwarfed by those available through PyPy seems like a poor trade-off to me. At a bare minimum, I don't think any significant changes should be made under the "it will be faster" justification until the bulk of the real-world benchmark suite used for speed.pypy.org is available for Python 3. (Wasn't there a GSoC project about that?) Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
Personally, I *like* CPython fitting into the "simple-and-portable" niche in the Python interpreter space. Armin Rigo made the judgment years ago that CPython was a poor platform for serious optimisation when he stopped working on Psyco and started PyPy instead, and I think the contrasting fates of PyPy and Unladen Swallow have borne out that opinion. Significantly increasing the complexity of CPython for speed-ups that are dwarfed by those available through PyPy seems like a poor trade-off to me.
I agree with the trade-off, but the nice thing is that CPython's interpreter remains simple and portable using my optimizations. All of these optimizations are purely interpretative and the complexity of CPython is not affected much. (For example, I have an inline-cached version of BINARY_ADD that is called INCA_FLOAT_ADD [INCA being my abbreviation for INline CAching]; you don't actually have to look at its source code, since it is generated by my code generator but can by looking at instruction traces immediately tell what's going on.) So, the interpreter remains fully portable and any compatibility issues with C modules should not occur either.
At a bare minimum, I don't think any significant changes should be made under the "it will be faster" justification until the bulk of the real-world benchmark suite used for speed.pypy.org is available for Python 3. (Wasn't there a GSoC project about that?)
Having more tests would surely be helpful, as already said, the most real-world stuff I can do is Martin's django patch (some of the other benchmarks though are from the shootout and I can [and did] run them, too {binarytrees, fannkuch, fasta, mandelbrot, nbody and spectralnorm}. I have also the AI benchmark from Unladden Swallow but no current figures.) Best, --stefan
On Tue, 30 Aug 2011 10:00:28 +1000
Nick Coghlan
Having a word-sized "bytecode" format would probably be acceptable in itself, so if you want to submit a patch for that, go ahead.
Although any such patch should discuss how it compares with Cesare's work on wpython. Personally, I *like* CPython fitting into the "simple-and-portable" niche in the Python interpreter space.
Changing the bytecode width wouldn't make the interpreter more complex.
Armin Rigo made the judgment years ago that CPython was a poor platform for serious optimisation when he stopped working on Psyco and started PyPy instead, and I think the contrasting fates of PyPy and Unladen Swallow have borne out that opinion.
Well, PyPy didn't show any significant achievements before they spent *much* more time on it than the Unladen Swallow guys did. Whether or not a good JIT is possible on top of CPython might remain a largely unanswered question.
Significantly increasing the complexity of CPython for speed-ups that are dwarfed by those available through PyPy seems like a poor trade-off to me.
Some years ago we were waiting for Unladen Swallow to improve itself and be ported to Python 3. Now it seems we are waiting for PyPy to be ported to Python 3. I'm not sure how "let's just wait" is a good trade-off if someone proposes interesting patches (which, of course, remains to be seen).
At a bare minimum, I don't think any significant changes should be made under the "it will be faster" justification until the bulk of the real-world benchmark suite used for speed.pypy.org is available for Python 3. (Wasn't there a GSoC project about that?)
I'm not sure what the bulk is, but have you already taken a look at http://hg.python.org/benchmarks/ ? Regards Antoine.
Although any such patch should discuss how it compares with Cesare's work on wpython. Personally, I *like* CPython fitting into the "simple-and-portable" niche in the Python interpreter space.
Changing the bytecode width wouldn't make the interpreter more complex.
No, but I think Stefan is proposing to add a *second* byte code format, in addition to the one that remains there. That would certainly be an increase in complexity.
Some years ago we were waiting for Unladen Swallow to improve itself and be ported to Python 3. Now it seems we are waiting for PyPy to be ported to Python 3. I'm not sure how "let's just wait" is a good trade-off if someone proposes interesting patches (which, of course, remains to be seen).
I completely agree. Let's not put unmet preconditions to such projects. For example, I still plan to write a JIT for Python at some point. This may happen in two months, or in two years. I wouldn't try to stop anybody from contributing improvements that may become obsolete with the JIT. The only recent case where I *did* try to stop people is with PEP-393, where I do believe that some of the changes that had been made over the last year become redundant. Regards, Martin
Changing the bytecode width wouldn't make the interpreter more complex.
No, but I think Stefan is proposing to add a *second* byte code format, in addition to the one that remains there. That would certainly be an increase in complexity.
Yes, indeed I have a more straightforward instruction format to allow for more efficient decoding. Just going from bytecode size to word-code size without changing the instruction format is going to require 8 (or word-size) times more memory on a 64bit system. From an optimization perspective, the irregular instruction format was the biggest problem, because checking for HAS_ARG is always on the fast path and mostly unpredictable. Hence, I chose to extend the instruction format to have word-size and use the additional space to have the upper half be used for the argument and the lower half for the actual opcode. Encoding is more efficient, and *not* more complex. Using profiling to indicate what code is hot, I don't waste too much memory on encoding this regular instruction format.
For example, I still plan to write a JIT for Python at some point. This may happen in two months, or in two years. I wouldn't try to stop anybody from contributing improvements that may become obsolete with the JIT.
I would not necessary argue that at least my optimizations would become obsolete; if you still think about writing a JIT, it might make sense to re-use what I've got and not start from scratch, e.g., building a simple JIT compiler that just inlines the operation implementations as templates to eliminate the interpretative overhead (in similar vein as Piumarta and Riccardi's paper from 1998) might be good start. Thoug I don't want to pre-influence your JIT design, I'm just thinking out loud... Regards, --stefan
Stefan, have you shared a pointer to your code yet? Is it open source? It sounds like people are definitely interested and it would make sense to let them experiment with your code and review it. -- --Guido van Rossum (python.org/~guido)
On Tue, Aug 30, 2011 at 09:42, Guido van Rossum
Stefan, have you shared a pointer to your code yet? Is it open source?
I have no shared code repository, but could create one (is there any pydev preferred provider?). I have all the copyrights on the code, and I would like to open-source it.
It sounds like people are definitely interested and it would make sense to let them experiment with your code and review it.
That sounds fine. I need to do some clean up work (contains most of my comments to remind me of issues) and currently does not pass all regression tests. But if people want to take a look first to decide if they want it than that's good enough for me. (I just wanted to know if there is substantial interest so that it eventually pays off to find and fix the remaining bugs) --stefan
On 8/30/2011 1:23 PM, stefan brunthaler wrote:
(I just wanted to know if there is substantial interest so that it eventually pays off to find and fix the remaining bugs)
It is the nature of our development process that there usually can be no guarantee of acceptance of future code. The rather early acceptance of Unladen Swallow was to me something of an anomaly. I also think it was something of a mistake insofar as it discouraged other efforts, like yours. I think the answer you have gotten is that there is a) substantial interest and b) a willingness to consider a major change such as switfing from bytecode to something else. There also seem to be two main concerns: 1) that the increase in complexity be 'less' than the increase in speed, and 2) that the changes be presented in small enough chunks that they can be reviewed. Whether this is good enough for you to proceed is for you to decide. -- Terry Jan Reedy
On Wed, Aug 31, 2011 at 3:23 AM, stefan brunthaler
On Tue, Aug 30, 2011 at 09:42, Guido van Rossum
wrote: Stefan, have you shared a pointer to your code yet? Is it open source?
I have no shared code repository, but could create one (is there any pydev preferred provider?). I have all the copyrights on the code, and I would like to open-source it.
Currently, the easiest way to create shared repositories for CPython variants is to start with bitbucket's mirror of the main CPython repo: https://bitbucket.org/mirror/cpython/overview Use the website to create your own public CPython fork, then edit the configuration of your local copy of the CPython repo to point to the your new bitbucket repo rather than the main one on hg.python.org. hg push/pull can then be used as normal to publish in-development material to the world. 'hg pull' from hg.python.org makes it fairly easy to track the trunk. One key thing is to avoid making any changes of your own on the official CPython branches (i.e. default, 3.2, 2.7). Instead, use a named branch for anything you're working on. This makes it much easier to generate standalone patches later on. My own public sandbox (https://bitbucket.org/ncoghlan/cpython_sandbox/overview) is set up that way, and you can see plenty of other examples on bitbucket. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
On Tue, 30 Aug 2011 08:27:13 -0700
stefan brunthaler
Changing the bytecode width wouldn't make the interpreter more complex.
No, but I think Stefan is proposing to add a *second* byte code format, in addition to the one that remains there. That would certainly be an increase in complexity.
Yes, indeed I have a more straightforward instruction format to allow for more efficient decoding. Just going from bytecode size to word-code size without changing the instruction format is going to require 8 (or word-size) times more memory on a 64bit system.
Do you really need it to match a machine word? Or is, say, a 16-bit format sufficient. Regards Antoine.
Do you really need it to match a machine word? Or is, say, a 16-bit format sufficient.
Hm, technically no, but practically it makes more sense, as (at least for x86 architectures) having opargs and opcodes in half-words can be efficiently expressed in assembly. On 64bit architectures, I could also inline data object references that fit into the 32bit upper half. It turns out that most constant objects fit nicely into this, and I have used this for a special cache region (again below 2^32) for global objects, too. So, technically it's not necessary, but practically it makes a lot of sense. (Most of these things work on 32bit systems, too. For architectures with a smaller size, we can adapt or disable the optimizations.) Cheers, --stefan
On Tue, Aug 30, 2011 at 10:50 AM, stefan brunthaler
Do you really need it to match a machine word? Or is, say, a 16-bit format sufficient.
Hm, technically no, but practically it makes more sense, as (at least for x86 architectures) having opargs and opcodes in half-words can be efficiently expressed in assembly. On 64bit architectures, I could also inline data object references that fit into the 32bit upper half. It turns out that most constant objects fit nicely into this, and I have used this for a special cache region (again below 2^32) for global objects, too. So, technically it's not necessary, but practically it makes a lot of sense. (Most of these things work on 32bit systems, too. For architectures with a smaller size, we can adapt or disable the optimizations.)
Do I sense that the bytecode format is no longer platform-independent? That will need a bit of discussion. I bet there are some things around that depend on that. -- --Guido van Rossum (python.org/~guido)
Do I sense that the bytecode format is no longer platform-independent? That will need a bit of discussion. I bet there are some things around that depend on that.
Hm, I haven't really thought about that in detail and for longer, I ran it on PowerPC 970 and Intel Atom & i7 without problems (the latter ones are a non-issue) and think that it can be portable. I just stuff argument and opcode into one word for regular instruction decoding like a RISC CPU, and I realize there might be little/big endian issues, but they surely can be conditionally compiled... --stefan
On Tue, Aug 30, 2011 at 11:23 AM, stefan brunthaler
Do I sense that the bytecode format is no longer platform-independent? That will need a bit of discussion. I bet there are some things around that depend on that.
Hm, I haven't really thought about that in detail and for longer, I ran it on PowerPC 970 and Intel Atom & i7 without problems (the latter ones are a non-issue) and think that it can be portable. I just stuff argument and opcode into one word for regular instruction decoding like a RISC CPU, and I realize there might be little/big endian issues, but they surely can be conditionally compiled...
Um, I'm sorry, but that reply sounds incredibly naive, like you're not really sure what the on-disk format for .pyc files is or why it would matter. You're not even answering the question, except indirectly -- it seems that you've never even thought about the possibility of generating a .pyc file on one platform and copying it to a computer using a different one. -- --Guido van Rossum (python.org/~guido)
Um, I'm sorry, but that reply sounds incredibly naive, like you're not really sure what the on-disk format for .pyc files is or why it would matter. You're not even answering the question, except indirectly -- it seems that you've never even thought about the possibility of generating a .pyc file on one platform and copying it to a computer using a different one.
Well, it may sound incredibly naive, but the truth is: I am never storing the optimized representation to disk, it's done purely at runtime when profiling tells me it makes sense to make the switch. Thus I circumvent many of the problems outlined by you. So I am positive that a full fledged change of the representation has many more intricacies to it, but my approach is only tangentially related... --stefan
On Tue, Aug 30, 2011 at 11:34 AM, stefan brunthaler
Um, I'm sorry, but that reply sounds incredibly naive, like you're not really sure what the on-disk format for .pyc files is or why it would matter. You're not even answering the question, except indirectly -- it seems that you've never even thought about the possibility of generating a .pyc file on one platform and copying it to a computer using a different one.
Well, it may sound incredibly naive, but the truth is: I am never storing the optimized representation to disk, it's done purely at runtime when profiling tells me it makes sense to make the switch. Thus I circumvent many of the problems outlined by you. So I am positive that a full fledged change of the representation has many more intricacies to it, but my approach is only tangentially related...
Ok, there there's something else you haven't told us. Are you saying that the original (old) bytecode is still used (and hence written to and read from .pyc files)? -- --Guido van Rossum (python.org/~guido)
Ok, there there's something else you haven't told us. Are you saying that the original (old) bytecode is still used (and hence written to and read from .pyc files)?
Short answer: yes. Long answer: I added an invocation counter to the code object and keep interpreting in the usual Python interpreter until this counter reaches a configurable threshold. When it reaches this threshold, I create the new instruction format and interpret with this optimized representation. All the macros look exactly the same in the source code, they are just redefined to use the different instruction format. I am at no point serializing this representation or the runtime information gathered by me, as any subsequent invocation might have different characteristics. I will remove my development commentaries and create a private repository at bitbucket for you* to take an early look like Georg (and more or less Terry, too) suggested. Is that a good way for most of you? (I would then give access to whomever wants to take a look.) Best, --stefan *: not personally targeted at Guido (who is naturally very much welcome to take a look, too) but addressed to python-dev in general.
2011/8/30 stefan brunthaler
I will remove my development commentaries and create a private repository at bitbucket for you* to take an early look like Georg (and more or less Terry, too) suggested. Is that a good way for most of you? (I would then give access to whomever wants to take a look.)
And what is wrong with a public one? -- Regards, Benjamin
On Tue, Aug 30, 2011 at 13:42, Benjamin Peterson
2011/8/30 stefan brunthaler
: I will remove my development commentaries and create a private repository at bitbucket for you* to take an early look like Georg (and more or less Terry, too) suggested. Is that a good way for most of you? (I would then give access to whomever wants to take a look.)
And what is wrong with a public one?
Well, since it does not fully pass all regression tests and is just meant for people to take a first look to find out if it's interesting, I think I might take it offline after you had a look. It seems to me that that is easier to be done with a private repository, but in general, I don't have a problem with a public one... Regards, --stefan PS: If you want to, I can also just put a tarball on my home page and post a link here. It's not that I would like to have control/influence about who is allowed to look and who doesn't.
2011/8/30 stefan brunthaler
On Tue, Aug 30, 2011 at 13:42, Benjamin Peterson
wrote: 2011/8/30 stefan brunthaler
: I will remove my development commentaries and create a private repository at bitbucket for you* to take an early look like Georg (and more or less Terry, too) suggested. Is that a good way for most of you? (I would then give access to whomever wants to take a look.)
And what is wrong with a public one?
Well, since it does not fully pass all regression tests and is just meant for people to take a first look to find out if it's interesting, I think I might take it offline after you had a look. It seems to me that that is easier to be done with a private repository, but in general, I don't have a problem with a public one...
Well, if your intention is for people to look at it, public seems to be the best solution. -- Regards, Benjamin
On Tue, Aug 30, 2011 at 1:54 PM, Benjamin Peterson
2011/8/30 stefan brunthaler
: On Tue, Aug 30, 2011 at 13:42, Benjamin Peterson
wrote: 2011/8/30 stefan brunthaler
: I will remove my development commentaries and create a private repository at bitbucket for you* to take an early look like Georg (and more or less Terry, too) suggested. Is that a good way for most of you? (I would then give access to whomever wants to take a look.)
And what is wrong with a public one?
Well, since it does not fully pass all regression tests and is just meant for people to take a first look to find out if it's interesting, I think I might take it offline after you had a look. It seems to me that that is easier to be done with a private repository, but in general, I don't have a problem with a public one...
Well, if your intention is for people to look at it, public seems to be the best solution.
+1 The point of open source is more eyeballs and the ability for anyone else to pick up code and run in whatever direction they want (license permitting) with it. :)
stefan brunthaler, 30.08.2011 22:41:
Ok, there there's something else you haven't told us. Are you saying that the original (old) bytecode is still used (and hence written to and read from .pyc files)?
Short answer: yes. Long answer: I added an invocation counter to the code object and keep interpreting in the usual Python interpreter until this counter reaches a configurable threshold. When it reaches this threshold, I create the new instruction format and interpret with this optimized representation. All the macros look exactly the same in the source code, they are just redefined to use the different instruction format. I am at no point serializing this representation or the runtime information gathered by me, as any subsequent invocation might have different characteristics.
So, basically, you built a JIT compiler but don't want to call it that, right? Just because it compiles byte code to other byte code rather than to native CPU instructions does not mean it doesn't compile Just In Time. That actually sounds like a nice feature in general. It could even replace (or accompany?) the existing peep hole optimiser as part of a more general optimisation architecture, in the sense that it could apply byte code optimisations at runtime rather than compile time, potentially based on better knowledge about what's actually going on.
I will remove my development commentaries and create a private repository at bitbucket
I agree with the others that it's best to open up your repository for everyone who is interested. I can see no reason why you would want to close it back down once it's there. Stefan
So, basically, you built a JIT compiler but don't want to call it that, right? Just because it compiles byte code to other byte code rather than to native CPU instructions does not mean it doesn't compile Just In Time.
For me, a definition of a JIT compiler or any dynamic compilation subsystem entails that native machine code is generated at run-time. Furthermore, I am not compiling from bytecode to bytecode, but rather changing the instruction encoding underneath and use subsequently use quickening to optimize interpretation. But, OTOH, I am not aware of a canonical definition of JIT compilation, so it depends ;)
I agree with the others that it's best to open up your repository for everyone who is interested. I can see no reason why you would want to close it back down once it's there.
Well, my code has primarily been a vehicle for my research in that area and thus is not immediately suited to adoption (it does not adhere to Python C coding standards, contains lots of private comments about various facts, debugging hints, etc.). The explanation for this is easy: When I started out on my research it was far from clear that it would be successful and really that much faster. So, I would like to clean up the comments and some parts of the code and publish the code I have without any of the clean-up work for naming conventions, etc., so that you can all take a look and it is clear what it's all about. After that we can then have a factual discussion about whether it fits the bill for you, too, and if so, which changes (naming conventions, extensive documentation, etc.) are necessary *before* any adoption is reasonable for you, too. That seems to be a good way to start off and get results and feedback quickly, any ideas/complaints/comments/suggestions? Best regards, --stefan PS: I am using Nick's suggested plan to incorporate my changes directly to the most recent version, as mine is currently only running on Python 3.1.
On Wed, Aug 31, 2011 at 10:08 AM, stefan brunthaler
Well, my code has primarily been a vehicle for my research in that area and thus is not immediately suited to adoption [...].
But if you want to be taken seriously as a researcher, you should publish your code! Without publication of your *code* research in your area cannot be reproduced by others, so it is not science. Please stop being shy and open up what you have. The software engineering issues can be dealt with separately! -- --Guido van Rossum (python.org/~guido)
On 8/30/2011 4:41 PM, stefan brunthaler wrote:
Ok, there there's something else you haven't told us. Are you saying that the original (old) bytecode is still used (and hence written to and read from .pyc files)?
Short answer: yes. Long answer: I added an invocation counter to the code object and keep interpreting in the usual Python interpreter until this counter reaches a configurable threshold. When it reaches this threshold, I create the new instruction format and interpret with this optimized representation. All the macros look exactly the same in the source code, they are just redefined to use the different instruction format. I am at no point serializing this representation or the runtime information gathered by me, as any subsequent invocation might have different characteristics. When the switchover to the new instruction format happens, what happens to sys.settrace() tracing? Will it report the same sequence of line numbers? For a small but important class of program executions, this is more important than speed.
--Ned.
Best, --stefan
2011/9/1 Ned Batchelder
When the switchover to the new instruction format happens, what happens to sys.settrace() tracing? Will it report the same sequence of line numbers? For a small but important class of program executions, this is more important than speed.
--Ned
A simple solution: when tracing is enabled, the new instruction format will never be executed (and information tracking disabled as well). Regards, Cesare
Cesare Di Mauro wrote:
2011/9/1 Ned Batchelder
mailto:ned@nedbatchelder.com> When the switchover to the new instruction format happens, what happens to sys.settrace() tracing? Will it report the same sequence of line numbers? For a small but important class of program executions, this is more important than speed.
--Ned
A simple solution: when tracing is enabled, the new instruction format will never be executed (and information tracking disabled as well).
What happens if tracing is enabled *during* the execution of the new instruction format? Some sort of deoptimisation will be required in order to recover the correct VM state. Cheers, Mark.
Regards, Cesare
------------------------------------------------------------------------
_______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/mark%40hotpy.org
2011/9/1 Mark Shannon
Cesare Di Mauro wrote:
2011/9/1 Ned Batchelder
> When the switchover to the new instruction format happens, what happens to sys.settrace() tracing? Will it report the same sequence of line numbers? For a small but important class of program executions, this is more important than speed.
--Ned
A simple solution: when tracing is enabled, the new instruction format will never be executed (and information tracking disabled as well).
What happens if tracing is enabled *during* the execution of the new instruction format? Some sort of deoptimisation will be required in order to recover the correct VM state.
Cheers, Mark.
Sure. I don't think that the regular ceval.c loop will be "dropped" when executing the new instruction format, so we can "intercept" a change like this using the "why" variable, for example, or something similar that is normally used to break the regular loop execution. Anyway, we need to take a look at the code. Cheers, Cesare
On Sep 1, 2011, at 5:23 AM, Cesare Di Mauro wrote:
A simple solution: when tracing is enabled, the new instruction format will never be executed (and information tracking disabled as well).
Correct me if I'm wrong: doesn't this mean that no profiler will accurately be able to measure the performance impact of the new instruction format, and therefore one may get incorrect data when on is trying to make a CPU optimization for real-world performance?
On Thu, Sep 1, 2011 at 10:15 AM, Glyph Lefkowitz
On Sep 1, 2011, at 5:23 AM, Cesare Di Mauro wrote:
A simple solution: when tracing is enabled, the new instruction format will never be executed (and information tracking disabled as well).
Correct me if I'm wrong: doesn't this mean that no profiler will accurately be able to measure the performance impact of the new instruction format, and therefore one may get incorrect data when on is trying to make a CPU optimization for real-world performance?
Well, profilers already skew results by adding call overhead. But tracing for debugging and profiling don't do exactly the same thing: debug tracing stops at every line, but profiling only executes hooks at the start and end of a function(*). So I think the function body could still be executed using the new format (assuming this is turned on/off per code object anyway). (*) And whenever a generator yields or is resumed. I consider that an annoying bug though, just as the debugger doesn't do the right thing with yield -- there's no way to continue until the yielding generator is resumed short of setting a manual breakpoint on the next line. -- --Guido van Rossum (python.org/~guido)
Hi, as promised, I created a publicly available preview of an implementation with my optimizations, which is available under the following location: https://bitbucket.org/py3_pio/preview/wiki/Home I followed Nick's advice and added some valuable advice and overview/introduction at the wiki page the link points to, I am positive that spending 10mins reading this will provide you with a valuable information regarding what's happening. In addition, as Guido already mentioned, this is more or less a direct copy of my research-branch without some of my private comments and *no* additional refactorings because of software-engineering issues (which I am very much aware of.) I hope this clarifies a *lot* and makes it easier to see what parts are involved and how all the pieces fit together. I hope you'll like it, have fun, --stefan
On Fri, Sep 2, 2011 at 2:37 PM, stefan brunthaler
I hope this clarifies a *lot* and makes it easier to see what parts are involved and how all the pieces fit together.
It does, thanks. There are likely to be some fun corner cases relating to trace functions and use of the "locals()" builtin, but now the code has been published hopefully those interested will be able to dig in and provide some more detailed feedback. (Not me, though - I've already dropped some things from my original personal to-do list for 3.3, so I'm not keen to start adding any more). Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
stefan brunthaler, 02.09.2011 06:37:
as promised, I created a publicly available preview of an implementation with my optimizations, which is available under the following location: https://bitbucket.org/py3_pio/preview/wiki/Home
I followed Nick's advice and added some valuable advice and overview/introduction at the wiki page the link points to, I am positive that spending 10mins reading this will provide you with a valuable information regarding what's happening.
It does, thanks. A couple of remarks: 1) The SFC optimisation is purely based on static code analysis, right? I assume it takes loops into account (and just multiplies scores for inner loops)? Is that what you mean with "nesting level"? Obviously, static analysis can sometimes be misleading, e.g. when there's a rare special case with lots of loops that needs to adapt input data in some way, but in general, I'd expect that this heuristic would tend to hit the important cases, especially for well structured code with short functions. 2) The RC elimination is tricky to get right and thus somewhat dangerous, but sounds worthwhile and should work particularly well on a stack based byte code interpreter like CPython. 3) Inline caching also sounds worthwhile, although I wonder how large the savings will be here. You'd save a couple of indirect jumps at the C-API level, sure, but apart from that, my guess is that it would highly depend on the type of instruction. Certain (repeated) calls to C implemented functions would likely benefit quite a bit, for example, which would be a nice optimisation by itself, e.g. for builtins. I would expect that the same applies to iterators, even a couple of percent faster iteration can make a great deal of a difference, and a substantial set of iterators are implemented in C, e.g. itertools, range, zip and friends. I'm not so sure about arithmetic operations. In Cython, we (currently?) do not optimistically replace these with more specific code (unless we know the types at compile time), because it complicates the generated C code and indirect jumps aren't all that slow that the benefit would be important. Savings are *much* higher when data can be unboxed, so much that the slight improvement for optimistic type guesses is totally dwarfed in Cython. I would expect that the return of investment is better when the types are actually known at runtime, as in your case. 4) Regarding inlined object references, I would expect that it's much more worthwhile to speed up LOAD_GLOBAL and LOAD_NAME than LOAD_CONST. I guess that this would be best helped by watching the module dict and the builtin dict internally and invalidating the interpreter state after changes (e.g. by providing a change counter in those dicts and checking that in the instructions that access them), and otherwise keeping the objects cached. Simply watching the dedicated instructions that change that state isn't enough as Python allows code to change these dicts directly through their dict interface. All in all, your list does sound like an interesting set of changes that are both understandable and worthwhile. Stefan
1) The SFC optimisation is purely based on static code analysis, right? I assume it takes loops into account (and just multiplies scores for inner loops)? Is that what you mean with "nesting level"? Obviously, static analysis can sometimes be misleading, e.g. when there's a rare special case with lots of loops that needs to adapt input data in some way, but in general, I'd expect that this heuristic would tend to hit the important cases, especially for well structured code with short functions.
From my benchmarks and in-depth analysis of several programs, I found
Yes, currently I only use the heuristic to statically estimate utility of assigning an optimized slot to a local variable. And, another yes, nested blocks (like for-statements) is what I have in mind when using "nesting level". I was told that the algorithm itself is very similar to linear scan register allocation, modulo the ability to spill values, of course. this to work very well. In fact, the only situation I found is (unfortunately) one of the top-most executed functions in US' bm_django.py: There is one loop that gets almost never executed but this loop gives precedence to local variables used inside. Because of this, I have already an idea for a better approach: first, use the static heuristic to compute stack slot score, then count back-branches (I would need this anyways, as the _Py_CheckInterval has gone and OSR/hot-swapping is in general a good idea) and record their frequency. Next, just replace the current static weight of 100 by the dynamically recorded weight. Consequently, you should get better allocations. (Please note that I did some quantitative analysis of bython functions to determine that using 4 SFC-slots covers a substantial amount of functions [IIRC >95%] with the trivial scenario when there are at most 4 local variables.)
2) The RC elimination is tricky to get right and thus somewhat dangerous, but sounds worthwhile and should work particularly well on a stack based byte code interpreter like CPython.
Well, it was very tricky to get right when I implemented it first around Christmas 2009. The current implementation is reasonably simple to understand, however, it depends on the function refcount_effect to give me correct information at all times. I got the biggest performance improvement on one benchmark on the PowerPC and think that RISC architectures in general benefit more from this optimization (eliminate the load, add and store instructions) than x86 CISCs do (an INCREF is just an add on the memory location without data dependencies, so fairly cheap). In any case, however, you get the replication effect of improving CPU branch predicion by having these additional instruction derivatives. It would be interesting (research-wise, too) to be able to measure whether the reduction in memory operations makes Python programs use less energy, and if so, how much the difference is.
3) Inline caching also sounds worthwhile, although I wonder how large the savings will be here. You'd save a couple of indirect jumps at the C-API level, sure, but apart from that, my guess is that it would highly depend on the type of instruction. Certain (repeated) calls to C implemented functions would likely benefit quite a bit, for example, which would be a nice optimisation by itself, e.g. for builtins. I would expect that the same applies to iterators, even a couple of percent faster iteration can make a great deal of a difference, and a substantial set of iterators are implemented in C, e.g. itertools, range, zip and friends.
I'm not so sure about arithmetic operations. In Cython, we (currently?) do not optimistically replace these with more specific code (unless we know the types at compile time), because it complicates the generated C code and indirect jumps aren't all that slow that the benefit would be important. Savings are *much* higher when data can be unboxed, so much that the slight improvement for optimistic type guesses is totally dwarfed in Cython. I would expect that the return of investment is better when the types are actually known at runtime, as in your case.
Well, in my thesis I already hint at another improvement of the existing design that can work on unboxed data as well (while still being an interpreter.) I am eager to try this, but don't know how much time I can spend on this (because there are several other research projects I am actively involved in.) In my experience, this works very well and you cannot actually report good speedups without inline-caching arithmetic operations, simply because that's where all JITs shine and most benchmarks don't reflect real world scenarios but mathematics-inclined microbenchmarks. Also, if in the future compilers (gcc and clang) will be able to inline the invoked functions, higher speedups will be possible.
4) Regarding inlined object references, I would expect that it's much more worthwhile to speed up LOAD_GLOBAL and LOAD_NAME than LOAD_CONST. I guess that this would be best helped by watching the module dict and the builtin dict internally and invalidating the interpreter state after changes (e.g. by providing a change counter in those dicts and checking that in the instructions that access them), and otherwise keeping the objects cached. Simply watching the dedicated instructions that change that state isn't enough as Python allows code to change these dicts directly through their dict interface.
Ok, I thought about something along these lines, too, but in the end, decided to go with the current design, as it is easy and language neutral (for my research I primarily chose Python as a demonstration vehicle and none of these techniques is specific to Python.) LOAD_GLOBAL pays off handsomly, and I think that I could easily make it correct for all cases, if I knew the places that need to call "invalidate_cache". Most of the LOAD_CONST instructions can be replaced with the inlined-version (INCA_LOAD_CONST), and while I did not do any benchmarks only on this, simply because they are very frequently executed, even small optimizations pay off nicely. Another point is that you can slim down the activation record of PyEval_EvalFrameEx, because you don't need to use the "consts" field anymore (similarly, you could probably eliminate the "names" and "fastlocals" fields, if you find that most of the frequent and fast cases are covered by the optimized instructions.)
All in all, your list does sound like an interesting set of changes that are both understandable and worthwhile.
Thanks, I think so, too, which is why I wanted to integrate the optimizations with CPython in the first place. Thanks for the pointers to the dict stuff, I will take a look (IIRC, Antoine pointed me in the same direction last year, but I think the design was slightly different then), --stefan
stefan brunthaler, 02.09.2011 17:55:
4) Regarding inlined object references, I would expect that it's much more worthwhile to speed up LOAD_GLOBAL and LOAD_NAME than LOAD_CONST. I guess that this would be best helped by watching the module dict and the builtin dict internally and invalidating the interpreter state after changes (e.g. by providing a change counter in those dicts and checking that in the instructions that access them), and otherwise keeping the objects cached. Simply watching the dedicated instructions that change that state isn't enough as Python allows code to change these dicts directly through their dict interface. [...] Thanks for the pointers to the dict stuff, I will take a look (IIRC, Antoine pointed me in the same direction last year, but I think the design was slightly different then),
Not unlikely, Antoine tends to know the internals pretty well. The Cython project has been (hand wavingly) thinking about this also: implement our own module type with its own __setattr__ (and dict proxy) in order to speed up access to the globals in the *very* likely case that they rarely or never change after module initialisation time and that most critical code accesses them read-only from within functions. If it turns out that this makes sense for CPython in general, it wouldn't be a bad idea to join forces at some point in order to make this readily usable for both sides. Stefan
as promised, I created a publicly available preview of an implementation with my optimizations, which is available under the following location: https://bitbucket.org/py3_pio/preview/wiki/Home
One very important thing that I forgot was to indicate that you have to use computed gotos (i.e., "configure --with-computed-gotos"), otherwise it won't work (though I think that most people can figure this out easily, knowing this a priori isn't too bad.) Regards, --stefan
Am 30.08.2011 20:34, schrieb stefan brunthaler:
Um, I'm sorry, but that reply sounds incredibly naive, like you're not really sure what the on-disk format for .pyc files is or why it would matter. You're not even answering the question, except indirectly -- it seems that you've never even thought about the possibility of generating a .pyc file on one platform and copying it to a computer using a different one.
Well, it may sound incredibly naive, but the truth is: I am never storing the optimized representation to disk, it's done purely at runtime when profiling tells me it makes sense to make the switch. Thus I circumvent many of the problems outlined by you. So I am positive that a full fledged change of the representation has many more intricacies to it, but my approach is only tangentially related...
You know, instead of all these half-explanations, giving us access to the code would shut us up much more effectively. Don't worry about not passing tests, this is what the official trunk does half of the time ;) Georg
2011/8/30 stefan brunthaler
Do I sense that the bytecode format is no longer platform-independent? That will need a bit of discussion. I bet there are some things around that depend on that.
Hm, I haven't really thought about that in detail and for longer, I ran it on PowerPC 970 and Intel Atom & i7 without problems (the latter ones are a non-issue) and think that it can be portable. I just stuff argument and opcode into one word for regular instruction decoding like a RISC CPU, and I realize there might be little/big endian issues, but they surely can be conditionally compiled...
--stefan
I think that you must deal with big endianess because some RISC can't handle at all data in little endian format. In WPython I have wrote some macros which handle both endianess, but lacking big endian machines I never had the opportunity to verify if something was wrong. Regards Cesare
I think that you must deal with big endianess because some RISC can't handle at all data in little endian format.
In WPython I have wrote some macros which handle both endianess, but lacking big endian machines I never had the opportunity to verify if something was wrong.
I am sorry for the temporal lapse of not getting back to this directly yesterday, we were just heading out for lunch and I figured it only out then but immediately forgot it on our way back to the lab... So, as I have already said, I evaluated my optimizations on x86 (little-endian) and PowerPC 970 (big-endian) and I did not have to change any of my instruction decoding during interpretation. (The only nasty bug I still remember vividly was that while on gcc for x86 the data type char defaults to signed, whereas it defaults to unsigned on PowerPC's gcc.) When I have time and access to a PowerPC machine again (an ARM might be interesting, too), I will take a look at the generated assembly code to figure out why this is working. (I have some ideas why it might work without changing the code.) If I run into any problems, I'll gladly contact you :) BTW: AFAIR, we emailed last year regarding wpython and IIRC your optimizations could primarily be summarized as clever superinstructions. I have not implemented anything in that area at all (and have in fact not even touched the compiler and its peephole optimizer), but if parts my implementation gets in, I am sure that you could add some of your work on top of that, too. Cheers, --stefan
2011/8/31 stefan brunthaler
I think that you must deal with big endianess because some RISC can't handle at all data in little endian format.
In WPython I have wrote some macros which handle both endianess, but lacking big endian machines I never had the opportunity to verify if something was wrong.
I am sorry for the temporal lapse of not getting back to this directly yesterday, we were just heading out for lunch and I figured it only out then but immediately forgot it on our way back to the lab...
So, as I have already said, I evaluated my optimizations on x86 (little-endian) and PowerPC 970 (big-endian) and I did not have to change any of my instruction decoding during interpretation. (The only nasty bug I still remember vividly was that while on gcc for x86 the data type char defaults to signed, whereas it defaults to unsigned on PowerPC's gcc.) When I have time and access to a PowerPC machine again (an ARM might be interesting, too), I will take a look at the generated assembly code to figure out why this is working. (I have some ideas why it might work without changing the code.)
If I run into any problems, I'll gladly contact you :)
BTW: AFAIR, we emailed last year regarding wpython and IIRC your optimizations could primarily be summarized as clever superinstructions. I have not implemented anything in that area at all (and have in fact not even touched the compiler and its peephole optimizer), but if parts my implementation gets in, I am sure that you could add some of your work on top of that, too.
Cheers, --stefan
You're right. I took a look at our old e-mails, and I found more details about your work. It's definitely not affected by processor endianess, so you don't need any check: it just works, because you'll produce the new opcodes in memory, and consume them in memory as well. Looking at your examples, I think that WPython wordcodes usage can be useful only for the most simple ones. That's because superinstructions group together several actions that need to be splitted again to simpler ones by a tracing-JIT/compiler like your, if you want to keep it simple. You said that you added about 400 specialized instructions last year with the usual bytecodes, but wordcodes will require quite more (this can compromise performance on CPU with small data caches). So I think that it'll be better to finish your work, with all tests passed, before thinking about adding something on top (that, for me, sounds like a machine code JIT O:-) Regards, Cesare
On 8/30/2011 2:12 PM, Guido van Rossum wrote:
On Tue, Aug 30, 2011 at 10:50 AM, stefan brunthaler
wrote: Do you really need it to match a machine word? Or is, say, a 16-bit format sufficient.
Hm, technically no, but practically it makes more sense, as (at least for x86 architectures) having opargs and opcodes in half-words can be efficiently expressed in assembly. On 64bit architectures, I could also inline data object references that fit into the 32bit upper half. It turns out that most constant objects fit nicely into this, and I have used this for a special cache region (again below 2^32) for global objects, too. So, technically it's not necessary, but practically it makes a lot of sense. (Most of these things work on 32bit systems, too. For architectures with a smaller size, we can adapt or disable the optimizations.)
Do I sense that the bytecode format is no longer platform-independent? That will need a bit of discussion. I bet there are some things around that depend on that.
I find myself more comfortable with the Cesare Di Mauro's idea of expanding to 16 bits as the code unit. His basic idea was using 2, 4, or 6 bytes instead of 1, 3, or 6. It actually tended to save space because many ops with small ints (which are very common) contract from 3 bytes to 2 bytes or from 9(?) (two instructions) to 6. I am sorry he was not able to followup on the initial promising results. The dis output was probably easier to read than the current output. Perhaps he made a mistake in combining the above idea with a shift from stack to hybrid stack+register design. -- Terry Jan Reedy
2011/8/31 Terry Reedy
I find myself more comfortable with the Cesare Di Mauro's idea of expanding to 16 bits as the code unit. His basic idea was using 2, 4, or 6 bytes instead of 1, 3, or 6.
It can be expanded to longer than 6 bytes opcodes, if needed. The format is designed to be flexible enough to accommodate such changes without pains.
It actually tended to save space because many ops with small ints (which are very common) contract from 3 bytes to 2 bytes or from 9(?) (two instructions) to 6.
It can pack up to 4 (old) opcodes into one wordcode (superinstruction). Wordcodes are designed to favor instruction "grouping".
I am sorry he was not able to followup on the initial promising results.
In a few words: lack of interest. Why spending (so much) time to a project when you see that the community is oriented towards other directions (Unladen Swallow at first, PyPy in the last period, given the substantial drop of the former)? Also, Guido seems to dislike what he finds as "hacks", and never showed interest. In WPython 1.1 I "rolled back" the "hack" that I introduced in PyObject types (a couple of fields) in 1.0 alpha, to make the code more "polished" (but with a sensible drop in the performance). But again, I saw no interest on WPython, so I decided to put a stop at it, and blocking my initial idea to go for Python 3.
The dis output was probably easier to read than the current output.
Perhaps he made a mistake in combining the above idea with a shift from stack to hybrid stack+register design.
-- Terry Jan Reedy
As I already said, wordcodes are designed to favor "grouping". So It was quite natural to became an "hybrid" VM. Anyway, both space and performance gained from this wordcodes "property". ;)
Regards Cesare
2011/8/30 stefan brunthaler
Yes, indeed I have a more straightforward instruction format to allow for more efficient decoding. Just going from bytecode size to word-code size without changing the instruction format is going to require 8 (or word-size) times more memory on a 64bit system. From an optimization perspective, the irregular instruction format was the biggest problem, because checking for HAS_ARG is always on the fast path and mostly unpredictable. Hence, I chose to extend the instruction format to have word-size and use the additional space to have the upper half be used for the argument and the lower half for the actual opcode. Encoding is more efficient, and *not* more complex. Using profiling to indicate what code is hot, I don't waste too much memory on encoding this regular instruction format.
Regards, --stefan
That seems exactly the WPython approach, albeit I used the new "wordcode" in place of the old bytecode. Take a look at it. ;) Regards Cesare
2011/8/30 Antoine Pitrou
Changing the bytecode width wouldn't make the interpreter more complex.
It depends on the kind of changes. :) WPython introduced a very different "intermediate code" representation that required a big change on the peepholer optimizer on 1.0 alpha version. On 1.1 final I decided to completely move that code on ast.c (mostly for constant-folding) and compiler.c (for the usual peepholer usage: seeking for some "patterns" to substitute with better ones) because I found it simpler and more convenient. In the end, taking out some new optimizations that I've implemented "on the road", the interpreter code is a bit more complex.
Some years ago we were waiting for Unladen Swallow to improve itself and be ported to Python 3. Now it seems we are waiting for PyPy to be ported to Python 3. I'm not sure how "let's just wait" is a good trade-off if someone proposes interesting patches (which, of course, remains to be seen).
Regards
Antoine.
It isn't, because motivation to do something new with CPython vanishes, at
least on some areas (virtual machine / ceval.c), even having some ideas to experiment with. That's why in my last talk on EuroPython I decided to move on other areas (Python objects). Regards Cesare
On Tue, Aug 30, 2011 at 10:04 PM, Cesare Di Mauro
It isn't, because motivation to do something new with CPython vanishes, at least on some areas (virtual machine / ceval.c), even having some ideas to experiment with. That's why in my last talk on EuroPython I decided to move on other areas (Python objects).
Cesare, I'm really sorry that you became so disillusioned that you abandoned wordcode. I agree that we were too optimistic about Unladen Swallow. Also that the existence of PyPy and its PR machine (:-) should not stop us from improving CPython. I'm wondering if, with your experience in creating WPython, you could review Stefan Brunthaler's code and approach (once he's put it up for review) and possibly the two of you could even work on a joint project? -- --Guido van Rossum (python.org/~guido)
2011/8/31 Guido van Rossum
On Tue, Aug 30, 2011 at 10:04 PM, Cesare Di Mauro
wrote: It isn't, because motivation to do something new with CPython vanishes, at least on some areas (virtual machine / ceval.c), even having some ideas to experiment with. That's why in my last talk on EuroPython I decided to move on other areas (Python objects).
Cesare, I'm really sorry that you became so disillusioned that you abandoned wordcode. I agree that we were too optimistic about Unladen Swallow. Also that the existence of PyPy and its PR machine (:-) should not stop us from improving CPython.
I never stopped thinking about new optimization. A lot can be made on CPython, even without resorting to something like JIT et all.
I'm wondering if, with your experience in creating WPython, you could review Stefan Brunthaler's code and approach (once he's put it up for review) and possibly the two of you could even work on a joint project?
-- --Guido van Rossum (python.org/~guido)
Yes, I can. I'll wait for Stefan to update its source (reaching Python 3.2 at least) as he has intended to do, and that everything is published, in order to review the code. I also agree with you that right now it doesn't need to look as state-of-the-art. First make it work, then make it nicer. ;) Regards, Cesare
On Thu, Sep 1, 2011 at 3:28 AM, Guido van Rossum
On Tue, Aug 30, 2011 at 10:04 PM, Cesare Di Mauro Cesare, I'm really sorry that you became so disillusioned that you abandoned wordcode. I agree that we were too optimistic about Unladen Swallow. Also that the existence of PyPy and its PR machine (:-) should not stop us from improving CPython.
Yep, and I'll try to do a better job of discouraging creeping complexity (without adequate payoffs) without the harmful side effect of discouraging experimentation with CPython performance improvements in general. It's massive "rewrite the world" changes, that don't adequately account for all the ways CPython gets used or the fact that core devs need to be able to effectively *review* the changes, that are unlikely to ever get anywhere. More localised changes, or those that are relatively easy to explain have a much better chance. So I'll switch my tone to just trying to make sure that portability and maintainability concerns are given due weight :) Cheers, Nick. P.S. I suspect a big part of my attitude stems from the fact that we're still trying to untangle some of the consequences of committing the PEP 3118 new buffer API implementation with inadequate review (it turns out the implementation didn't reflect the PEP and the PEP had deficiencies of its own), and I was one of the ones advocating in favour of that patch. Once bitten, twice shy, etc. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
Nick Coghlan wrote:
Personally, I *like* CPython fitting into the "simple-and-portable" niche in the Python interpreter space.
Me, too! I like that I can read the CPython source and understand what it's doing most of the time. Please don't screw that up by attempting to perform heroic optimisations. -- Greg
On Tue, Aug 30, 2011 at 08:57, Greg Ewing
Nick Coghlan wrote:
Personally, I *like* CPython fitting into the "simple-and-portable"
niche in the Python interpreter space.
Me, too! I like that I can read the CPython source and understand what it's doing most of the time. Please don't screw that up by attempting to perform heroic optimisations.
--
Following this argument to the extreme, the bytecode evaluation code of CPython can be simplified quite a bit. Lose 2x performance but gain a lot of readability. Does that sound like a good deal? I don't intend to sound sarcastic, just show that IMHO this argument isn't a good one. I think that even clever optimized code can be properly written and *documented* to make the task of understanding it feasible. Personally, I'd love CPython to be a bit faster and see no reason to give up optimization opportunities for the sake of code readability. Eli
On Tue, Aug 30, 2011 at 4:22 PM, Eli Bendersky
On Tue, Aug 30, 2011 at 08:57, Greg Ewing
wrote: Following this argument to the extreme, the bytecode evaluation code of CPython can be simplified quite a bit. Lose 2x performance but gain a lot of readability. Does that sound like a good deal? I don't intend to sound sarcastic, just show that IMHO this argument isn't a good one. I think that even clever optimized code can be properly written and *documented* to make the task of understanding it feasible. Personally, I'd love CPython to be a bit faster and see no reason to give up optimization opportunities for the sake of code readability.
Yeah, it's definitely a trade-off - the point I was trying to make is that there *is* a trade-off being made between complexity and speed. I think the computed-gotos stuff struck a nice balance - the macro-fu involved means that you can still understand what the main eval loop is *doing*, even if you don't know exactly what's hidden behind the target macros. Ditto for the older opcode prediction feature and the peephole optimiser - separation of concerns means that you can understand the overall flow of events without needing to understand every little detail. This is where the request to extract individual orthogonal changes and submit separate patches comes from - it makes it clear that the independent changes *can* be separated cleanly, and aren't a giant ball of incomprehensible mud. It's the difference between complex (lots of moving parts, that can each be understood on their own and are then composed into a meaningful whole) and complicated (massive patches that don't work at all if any one component is delayed) Eugene Toder's AST optimiser work that I still hope to get into 3.3 will have to undergo a similar process - the current patch covers a bit too much ground and needs to be broken up into smaller steps before we can seriously consider pushing it into the core. Regards, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
2011/8/30 Nick Coghlan
Yeah, it's definitely a trade-off - the point I was trying to make is that there *is* a trade-off being made between complexity and speed.
I think the computed-gotos stuff struck a nice balance - the macro-fu involved means that you can still understand what the main eval loop is *doing*, even if you don't know exactly what's hidden behind the target macros. Ditto for the older opcode prediction feature and the peephole optimiser - separation of concerns means that you can understand the overall flow of events without needing to understand every little detail.
This is where the request to extract individual orthogonal changes and submit separate patches comes from - it makes it clear that the independent changes *can* be separated cleanly, and aren't a giant ball of incomprehensible mud. It's the difference between complex (lots of moving parts, that can each be understood on their own and are then composed into a meaningful whole) and complicated (massive patches that don't work at all if any one component is delayed)
Eugene Toder's AST optimiser work that I still hope to get into 3.3 will have to undergo a similar process - the current patch covers a bit too much ground and needs to be broken up into smaller steps before we can seriously consider pushing it into the core.
Regards, Nick.
Sometimes it cannot be done, because big changes produces big patches as
well. I don't see a problem here if the code is well written (as "required" buy the Python community :) and the developer is available to talk about his work to clear some doubts. Regards Cesare
Nick Coghlan, 30.08.2011 02:00:
On Tue, Aug 30, 2011 at 7:14 AM, Antoine Pitrou wrote:
On Mon, 29 Aug 2011 11:33:14 -0700 stefan brunthaler wrote:
* The optimized dispatch routine has a changed instruction format (word-sized instead of bytecodes) that allows for regular instruction decoding (without the HAS_ARG-check) and inlinind of some objects in the instruction format on 64bit architectures.
Having a word-sized "bytecode" format would probably be acceptable in itself, so if you want to submit a patch for that, go ahead.
Although any such patch should discuss how it compares with Cesare's work on wpython.
Personally, I *like* CPython fitting into the "simple-and-portable" niche in the Python interpreter space. Armin Rigo made the judgment years ago that CPython was a poor platform for serious optimisation when he stopped working on Psyco and started PyPy instead, and I think the contrasting fates of PyPy and Unladen Swallow have borne out that opinion. Significantly increasing the complexity of CPython for speed-ups that are dwarfed by those available through PyPy seems like a poor trade-off to me.
If Stefan can cut down his changes into smaller feature chunks, thus making their benefit reproducible and verifiable by others, it's well worth reconsidering if even a visible increase of complexity isn't worth the improved performance, one patch at a time. Even if PyPy's performance tops the improvements, it's worth remembering that that's also a very different kind of system than CPython, with different resource requirements and a different level of maturity, compatibility, portability, etc. There are many reasons to continue using CPython, not only in corners, and there are many people who would be happy about a faster CPython. Raising the bar has its virtues. That being said, I also second Nick's reference to wpython. If CPython grows its byte code size anyway (which, as I understand, is one part of the proposed changes), it's worth looking at wpython first, given that it has been around and working for a while. The other proposed changes sound like at least some of them are independent from this one. Stefan
Nick Coghlan wrote:
On Tue, Aug 30, 2011 at 7:14 AM, Antoine Pitrou
wrote: On Mon, 29 Aug 2011 11:33:14 -0700 stefan brunthaler
wrote: * The optimized dispatch routine has a changed instruction format (word-sized instead of bytecodes) that allows for regular instruction decoding (without the HAS_ARG-check) and inlinind of some objects in the instruction format on 64bit architectures. Having a word-sized "bytecode" format would probably be acceptable in itself, so if you want to submit a patch for that, go ahead.
Although any such patch should discuss how it compares with Cesare's work on wpython.
Personally, I *like* CPython fitting into the "simple-and-portable" niche in the Python interpreter space.
CPython has a a large number of micro-optimisations, scattered all of the code base. By removing these and adding large-scale optimisations, like Stephan's, the code base *might* actually get smaller overall (and thus simpler) *and* faster. Of course, CPython must remain portable. [snip]
At a bare minimum, I don't think any significant changes should be made under the "it will be faster" justification until the bulk of the real-world benchmark suite used for speed.pypy.org is available for Python 3. (Wasn't there a GSoC project about that?)
+1 Cheers, Mark.
So, the two big issues aside, is there any interest in incorporating these optimizations in Python 3?
The question really is whether this is an all-or-nothing deal. If you could identify smaller parts that can be applied independently, interest would be higher. Also, I'd be curious whether your techniques help or hinder a potential integration of a JIT generator. Regards, Martin
The question really is whether this is an all-or-nothing deal. If you could identify smaller parts that can be applied independently, interest would be higher.
Well, it's not an all-or-nothing deal. In my current architecture, I can selectively enable most of the optimizations as I see fit. The only pre-requisite (in my implementation) is that I have two dispatch loops with a changed instruction format. It is, however, not a technical necessity, just the way I implemented it. Basically, you can choose whatever you like best, and I could extract that part. I am just offering to add all the things that I have done :)
Also, I'd be curious whether your techniques help or hinder a potential integration of a JIT generator.
This is something I have previously frequently discussed with several JIT people. IMHO, having my optimizations in-place also helps a JIT compiler, since it can re-use the information I gathered to generate more aggressively optimized native machine code right away (the inline caches can be generated with the type information right away, some functions could be inlined with the guard statements subsumed, etc.) Another benefit could be that the JIT compiler can spend longer time on generating code, because the interpreter is already faster (so in some cases it would probably not make sense to include a non-optimizing fast and simple JIT compiler). There are others on the list, who probably can/want to comment on this, too. That aside, I think that while having a JIT is an important goal, I can very well imagine scenarios where the additional memory consumption (for the generated native machine code) of a JIT for each process (I assume that the native machine code caches are not shared) hinders scalability. I have in fact no data to back this up, but I think that would be an interesting trade off, say if I have 30% gain in performance without substantial additional memory requirements on my existing hardware, compared to higher achievable speedups that require more machines, though. Regards, --stefan
On Mon, Aug 29, 2011 at 2:05 PM, stefan brunthaler
The question really is whether this is an all-or-nothing deal. If you could identify smaller parts that can be applied independently, interest would be higher.
Well, it's not an all-or-nothing deal. In my current architecture, I can selectively enable most of the optimizations as I see fit. The only pre-requisite (in my implementation) is that I have two dispatch loops with a changed instruction format. It is, however, not a technical necessity, just the way I implemented it. Basically, you can choose whatever you like best, and I could extract that part. I am just offering to add all the things that I have done :)
+1 from me on going forward with your performance improvements. The more you can break them down into individual smaller patch sets the better as they can be reviewed and applied as needed. A prerequisites patch, a patch for the wide opcodes, etc.. For benchmarks given this is python 3, just get as many useful ones running as you can. Some in this thread seemed to give the impression that CPython performance is not something to care about. I disagree. I see CPython being the main implementation of Python used in most places for a long time. Improving its performance merely raises the bar to be met by other implementations if they want to compete. That is a good thing! -gps
Also, I'd be curious whether your techniques help or hinder a potential integration of a JIT generator.
This is something I have previously frequently discussed with several JIT people. IMHO, having my optimizations in-place also helps a JIT compiler, since it can re-use the information I gathered to generate more aggressively optimized native machine code right away (the inline caches can be generated with the type information right away, some functions could be inlined with the guard statements subsumed, etc.) Another benefit could be that the JIT compiler can spend longer time on generating code, because the interpreter is already faster (so in some cases it would probably not make sense to include a non-optimizing fast and simple JIT compiler). There are others on the list, who probably can/want to comment on this, too.
That aside, I think that while having a JIT is an important goal, I can very well imagine scenarios where the additional memory consumption (for the generated native machine code) of a JIT for each process (I assume that the native machine code caches are not shared) hinders scalability. I have in fact no data to back this up, but I think that would be an interesting trade off, say if I have 30% gain in performance without substantial additional memory requirements on my existing hardware, compared to higher achievable speedups that require more machines, though.
Regards, --stefan _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/greg%40krypto.org
On Tue, Aug 30, 2011 at 12:38 PM, Gregory P. Smith
Some in this thread seemed to give the impression that CPython performance is not something to care about. I disagree. I see CPython being the main implementation of Python used in most places for a long time. Improving its performance merely raises the bar to be met by other implementations if they want to compete. That is a good thing!
Not the impression I intended to give. I merely want to highlight that we need to be careful that incremental increases in complexity are justified with real, measured performance improvements. PyPy has set the bar on how to do that - people that seriously want to make CPython faster need to focus on getting speed.python.org sorted *first* (so we know where we're starting) and *then* work on trying to improve CPython's numbers relative to that starting point. The PSF has the hardware to run the site, but, unless more has been going in the background than I am aware of, is still lacking trusted volunteers to do the following: 1. Getting codespeed up and running on the PSF hardware 2. Hooking it in to the CPython source control infrastructure 3. Getting a reasonable set of benchmarks running on 3.x (likely starting with the already ported set in Mercurial, but eventually we want the full suite that PyPy uses) 4. Once PyPy, Jython and IronPython offer 3.x compatible versions, start including them as well (alternatively, offer 2.x performance comparisons as well, although that's less interesting from a CPython point of view since it can't be used to guide future CPython optimisation efforts) Anecdotal, non-reproducible performance figures are *not* the way to go about serious optimisation efforts. Using a dedicated machine is vulnerable to architecture-specific idiosyncracies, but ad hoc testing on other systems can still be used as a sanity check. Regards, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
On Tue, 30 Aug 2011 13:29:59 +1000
Nick Coghlan
Anecdotal, non-reproducible performance figures are *not* the way to go about serious optimisation efforts.
What about anecdotal *and* reproducible performance figures? :) I may be half-joking, but we already have a set of py3k-compatible benchmarks and, besides, sometimes a timeit invocation gives a good idea of whether an approach is fruitful or not. While a permanent public reference with historical tracking of performance figures is even better, let's not freeze everything until it's ready. (for example, do we need to wait for speed.python.org before PEP 393 is accepted?) Regards Antoine.
On Tue, Aug 30, 2011 at 9:38 PM, Antoine Pitrou
On Tue, 30 Aug 2011 13:29:59 +1000 Nick Coghlan
wrote: Anecdotal, non-reproducible performance figures are *not* the way to go about serious optimisation efforts.
What about anecdotal *and* reproducible performance figures? :) I may be half-joking, but we already have a set of py3k-compatible benchmarks and, besides, sometimes a timeit invocation gives a good idea of whether an approach is fruitful or not. While a permanent public reference with historical tracking of performance figures is even better, let's not freeze everything until it's ready. (for example, do we need to wait for speed.python.org before PEP 393 is accepted?)
Yeah, I'd neglected the idea of just running perf.py for pre- and post-patch performance comparisons. You're right that that can generate sufficient info to make a well-informed decision. I'd still really like it if some of the people advocating that we care about CPython performance actually volunteered to spearhead the effort to get speed.python.org up and running, though. As far as I know, the hardware's spinning idly waiting to be given work to do :P Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
On Aug 30, 2011, at 9:05 AM, Nick Coghlan
On Tue, Aug 30, 2011 at 9:38 PM, Antoine Pitrou
wrote: On Tue, 30 Aug 2011 13:29:59 +1000 Nick Coghlan
wrote: Anecdotal, non-reproducible performance figures are *not* the way to go about serious optimisation efforts.
What about anecdotal *and* reproducible performance figures? :) I may be half-joking, but we already have a set of py3k-compatible benchmarks and, besides, sometimes a timeit invocation gives a good idea of whether an approach is fruitful or not. While a permanent public reference with historical tracking of performance figures is even better, let's not freeze everything until it's ready. (for example, do we need to wait for speed.python.org before PEP 393 is accepted?)
Yeah, I'd neglected the idea of just running perf.py for pre- and post-patch performance comparisons. You're right that that can generate sufficient info to make a well-informed decision.
I'd still really like it if some of the people advocating that we care about CPython performance actually volunteered to spearhead the effort to get speed.python.org up and running, though. As far as I know, the hardware's spinning idly waiting to be given work to do :P
Cheers, Nick.
Discussion of speed.python.org should happen on the mailing list for that project if possible.
On Wed, Aug 31, 2011 at 9:21 AM, Jesse Noller
Discussion of speed.python.org should happen on the mailing list for that project if possible.
Hah, that's how out of the loop I am on that front - I didn't even know there *was* a mailing list for it :) Subscribed! Cheers, Nick. P.S. For anyone else that is interested: http://mail.python.org/mailman/listinfo/speed -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
Martin v. Löwis wrote:
So, the two big issues aside, is there any interest in incorporating these optimizations in Python 3?
The question really is whether this is an all-or-nothing deal. If you could identify smaller parts that can be applied independently, interest would be higher.
Also, I'd be curious whether your techniques help or hinder a potential integration of a JIT generator.
A JIT compiler is not a silver bullet, translation to machine code is just one of many optimisations performed by PyPy. A compiler merely removes interpretative overhead, at the cost of significantly increased code size, whereas Stephan's work attacks both interpreter overhead and some of the inefficiencies due to dynamic typing. If Unladen Swallow achieved anything it was to demonstrate that a JIT alone does not work well. My (experimental) HotPy VM has similar base-line speed to CPython, yet is able to outperform Unladen Swallow using interpreter-only optimisations. (It goes even faster with the compiler turned on :) ) Cheers, Mark.
Regards, Martin _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/mark%40hotpy.org
Does it speed up Python? :-) Could you provide numbers (benchmarks)?
Yes, it does ;) The maximum overall speedup I achieved was by a factor of 2.42 on my i7-920 for the spectralnorm benchmark of the computer language benchmark game. Others from the same set are: binarytrees: 1.9257 (1.9891) fannkuch: 1.6509 (1.7264) fasta: 1.5446 (1.7161) mandelbrot: 2.0040 (2.1847) nbody: 1.6165 (1.7602) spectralnorm: 2.2538 (2.4176) --- overall: 1.8213 (1.9382) (The first number is the combination of all optimizations, the one in parentheses is with my last optimization [Interpreter Instruction Scheduling] enabled, too.) For a comparative real world benchmark I tested Martin von Loewis' django port (there are not that many meaningful Python 3 real world benchmarks) and got a speedup of 1.3 (without IIS). This is reasonably well, US got a speedup of 1.35 on this benchmark. I just checked that pypy-c-latest on 64 bit reports 1.5 (the pypy-c-jit-latest figures seem to be not working currently or *really* fast...), but I cannot tell directly how that relates to speedups (it just says "less is better" and I did not quickly find an explanation). Since I did this benchmark last year, I have spent more time investigating this benchmark and found that I could do better, but I would have to guess as to how much (An interesting aside though: on this benchmark, the executable never grew on more than 5 megs of memory usage, exactly like the vanilla Python 3 interpreter.) hth, --stefan
For a comparative real world benchmark I tested Martin von Loewis' django port (there are not that many meaningful Python 3 real world benchmarks) and got a speedup of 1.3 (without IIS). This is reasonably well, US got a speedup of 1.35 on this benchmark. I just checked that pypy-c-latest on 64 bit reports 1.5 (the pypy-c-jit-latest figures seem to be not working currently or *really* fast...), but I cannot tell directly how that relates to speedups (it just says "less is better" and I did not quickly find an explanation). Since I did this benchmark last year, I have spent more time investigating this benchmark and found that I could do better, but I would have to guess as to how much (An interesting aside though: on this benchmark, the executable never grew on more than 5 megs of memory usage, exactly like the vanilla Python 3 interpreter.)
PyPy is ~12x faster on the django benchmark FYI
Maciej Fijalkowski, 02.09.2011 20:42:
For a comparative real world benchmark I tested Martin von Loewis' django port (there are not that many meaningful Python 3 real world benchmarks) and got a speedup of 1.3 (without IIS). This is reasonably well, US got a speedup of 1.35 on this benchmark. I just checked that pypy-c-latest on 64 bit reports 1.5 (the pypy-c-jit-latest figures seem to be not working currently or *really* fast...), but I cannot tell directly how that relates to speedups (it just says "less is better" and I did not quickly find an explanation).
PyPy is ~12x faster on the django benchmark FYI
FYI, there's a recent thread up on the pypy ML where someone is complaining about PyPy being substantially slower than CPython when running Django on top of SQLite. Also note that PyPy doesn't implement Py3 yet, so the benchmark results are not comparable anyway. As usual, benchmark results depend on what you do in your benchmarks. Stefan
On Fri, Sep 2, 2011 at 9:20 PM, Stefan Behnel
Maciej Fijalkowski, 02.09.2011 20:42:
For a comparative real world benchmark I tested Martin von Loewis' django port (there are not that many meaningful Python 3 real world benchmarks) and got a speedup of 1.3 (without IIS). This is reasonably well, US got a speedup of 1.35 on this benchmark. I just checked that pypy-c-latest on 64 bit reports 1.5 (the pypy-c-jit-latest figures seem to be not working currently or *really* fast...), but I cannot tell directly how that relates to speedups (it just says "less is better" and I did not quickly find an explanation).
PyPy is ~12x faster on the django benchmark FYI
FYI, there's a recent thread up on the pypy ML where someone is complaining about PyPy being substantially slower than CPython when running Django on top of SQLite. Also note that PyPy doesn't implement Py3 yet, so the benchmark results are not comparable anyway.
Yes, sqlite is slow. It's also much faster in trunk than in 1.6 and there is an open ticket about it :) The "django" benchmark is just templating, so it does not involve a database.
participants (20)
-
"Martin v. Löwis"
-
Antoine Pitrou
-
Benjamin Peterson
-
Cesare Di Mauro
-
Eli Bendersky
-
Georg Brandl
-
Glyph Lefkowitz
-
Greg Ewing
-
Gregory P. Smith
-
Guido van Rossum
-
Jesse Noller
-
Maciej Fijalkowski
-
Mark Shannon
-
Ned Batchelder
-
Nick Coghlan
-
Stefan Behnel
-
stefan brunthaler
-
stefan brunthaler
-
Terry Reedy
-
Victor Stinner