I tried adding a variety of new instructions to the PVM, initially with a code compression goal for the bytecodes, and later with a performance goal. definitions: USING_LOAD_FAST_N accesses to locals with an index<16 using a one byte instruction (no oparg) USING_LOAD_CONST_N accesses to consts with an index<16 using a one byte instruction (no oparg) USING_STORE_FAST_N accesses to locals with an index<16 using a one byte instruction (no oparg) USING_SHORT_CMP compare ops using a one byte instruction (no oparg) PyStone score for best of 10 runs. umodified 2.3a2 22200 using enum, 22200 (compacting the opcode numeric space using an enum instead of #defines) USING_LOAD_FAST_N 22700 USING_LOAD_CONST_N 22400 USING_STORE_FAST_N 22400 USING_LOAD_FAST_N, USING_LOAD_CONST_N 22350 USING_LOAD_FAST_N, USING_STORE_FAST_N, 22000 USING_LOAD_FAST_N, USING_LOAD_CONST_N, USING_STORE_FAST_N 22200 USING_SHORT_CMP 21500 USING_LOAD_FAST_N, USING_LOAD_CONST_N, USING_STORE_FAST_N, USING_SHORT_CMP 22000 Conclusions: While reducing the size of compiled bytecodes by about 1%, the proposed modifications at best increase performance by 2%, and at worst reduce performance by 3%. Enabling all of the proposed opcodes results in a 1% performance loss. In general, it would seem that adding opcodes in bulk, even if many opcodes switch to the same labels, results in a minor performance loss. Running PyStone under windows results in a fairly large variation in results. A zip file containing the source files I modified can be found at http://www.bitfurnace.com/python/modified-source.zip. If someone would like to try this code on their systems, I would be grateful to know what kind of results they achieve. The various proposed opcodes are controlled by a set of #defines in the file opcode.h Next steps: The results of my static analysis indicate that the indices used on LOAD_FAST, LOAD_CONST, STORE_FAST are almost always small. There may be some benefit to optimising these instructions to use single byte opargs. The results of my static and dynamic analysis indicate that the (COMPARE_OP, JUMP_IF_FALSE, POP_TOP) pattern is highly used. Im looking at what changes would need to be made to the compiler to remove the need for this sequence of instructions.
At 1:54 PM -0500 2/27/03, Damien Morton wrote:
In general, it would seem that adding opcodes in bulk, even if many opcodes switch to the same labels, results in a minor performance loss.
While I'm somewhat loathe to help you guys go faster (that whole pie thing) you might want to take a look at commonly used pairs of opcodes and reordering the code in the switch body so pairs are adjacent and more likely to be in cache. -- Dan --------------------------------------"it's like this"------------------- Dan Sugalski even samurai dan@sidhe.org have teddy bears and even teddy bears get drunk
"Dan Sugalski" <dan@sidhe.org>
At 1:54 PM -0500 2/27/03, Damien Morton wrote:
In general, it would seem that adding opcodes in bulk, even if many opcodes switch to the same labels, results in a minor performance loss.
While I'm somewhat loathe to help you guys go faster (that whole pie thing) you might want to take a look at commonly used pairs of opcodes and reordering the code in the switch body so pairs are adjacent and more likely to be in cache.
I caught the whole pie conversation, but I missed how you were going to test Parrot against Python. How exactly will this competion be measured?
At 5:14 PM -0500 2/27/03, Damien Morton wrote:
"Dan Sugalski" <dan@sidhe.org>
At 1:54 PM -0500 2/27/03, Damien Morton wrote:
In general, it would seem that adding opcodes in bulk, even if many opcodes switch to the same labels, results in a minor performance loss.
While I'm somewhat loathe to help you guys go faster (that whole pie thing) you might want to take a look at commonly used pairs of opcodes and reordering the code in the switch body so pairs are adjacent and more likely to be in cache.
I caught the whole pie conversation, but I missed how you were going to test Parrot against Python.
How exactly will this competion be measured?
We've still got to hash out the details, but in december someone'll generate a bytecode file for the program(s) we're going to be running. We both get to run converters/optimizers over them if we choose. Then at the 2004 OSCON we'll run the converted programs and whichever takes less time (we've not set whether it's wall or CPU time, though I'm tempted to go for CPU so neither gets penalized for the odd background task that might be running) wins. The official details, limits, and pie flavors will likely be set at this summer's OSCON. -- Dan --------------------------------------"it's like this"------------------- Dan Sugalski even samurai dan@sidhe.org have teddy bears and even teddy bears get drunk
"Dan Sugalski" <dan@sidhe.org> wrote in message news:a05200f00ba84455b2bc8@[63.120.19.221]...
We've still got to hash out the details, but in december someone'll generate a bytecode file for the program(s) we're going to be running. We both get to run converters/optimizers over them if we choose.
I find this a bit surprising. The problems of teaching to the test and programming to the benchmark are well-known. I presume the extreme of 'optimizing' the bytecode down to a single byte (that calls a custom compiled-to-machine-code function) will be disallowed. But I wonder how and where you plan to draw the line between that and out-of-the-box behavior.
Then at the 2004 OSCON we'll run the converted programs and
With enough 'conversion', the result could be seen as no longer being a compilation of standard Python, but of a variant with type declarations and pragmas added and certain dynamic features deleted. In any case, this seems to makes the contest one more of special-case optimization and less of general-case running time, and therefore of less relevance to users who want *their* programs to run faster with an out-of-the-box interpreter.
whichever takes less time (we've not set whether it's wall or CPU time, though I'm tempted to go for CPU so neither gets penalized for the odd background task that might be running) wins.
[Terry Reedy]
... The problems of teaching to the test and programming to the benchmark are well-known. ... Your contest, my questions and comments.
Not to worry -- I made my living for 15 years writing optimizing compilers in the highly competitive (at the time -- it competed itself to death) supercomputer biz. I've forgotten more dirty tricks than even a Perl guy can pick up in a lifetime <wink>. One of my favorites: one of the rules in a particular competitive benchmark was that editing the source code was not allowed. Everyone had to run the source file verbatim from the prospective customer. One of Cray's competitors (and, no, I'm not making this up) implemented a simple sed-like subsystem driven by text on the compiler *command line*, which effectively allowed them to edit the source code without physically changing the file. They used this to dump all sorts of compiler directives into the benchmark code, and even to replace a particular critical loop nest with a call to a highly optimized library subroutine. Their results were quite impressive. But ours were even better: general global analysis deduced that the computed results weren't used for anything, so we optimized away all the expensive loops. Our results weren't all to the good, though: the program ran so fast then that the "operations per second" output at the end overflowed to a negative integer. Heh. Believe it or not, customers were once so naive that they actually looked at benchmark results as if they measured something relevant to their business <wink>. beauty-is-a-more-objective-measure-ly y'rs - tim
At 10:21 PM -0500 2/27/03, Tim Peters wrote:
[Terry Reedy]
... The problems of teaching to the test and programming to the benchmark are well-known. ... Your contest, my questions and comments.
Not to worry -- I made my living for 15 years writing optimizing compilers in the highly competitive (at the time -- it competed itself to death) supercomputer biz. I've forgotten more dirty tricks than even a Perl guy can pick up in a lifetime <wink>.
Yes, Tim, but the question isn't how much you've forgotten. It's how much you still *remember*... :-) I expect you to keep us honest, so I expect I'd better unconditionally buy you a beer so as to maintain your neutrality. ;) -- Dan --------------------------------------"it's like this"------------------- Dan Sugalski even samurai dan@sidhe.org have teddy bears and even teddy bears get drunk
Terry Reedy wrote:
In any case, this seems to makes the contest one more of special-case optimization and less of general-case running time, and therefore of less relevance to users who want *their* programs to run faster with an out-of-the-box interpreter.
Indeed. I suspect, though, that if the test program exercises a sizable chunk of Python's functionality, we won't see a Parrot-based implementation that can run it. 2004 is a long time away, time enough for the PyPy effort to bear fruit. Guido, when it comes time to write the benchmark program, please get in touch with me; I'd like to help produce a realistic test program. --amk (www.amk.ca) Orchestra: Anagram of carthorse. -- Peter Greenaway, _Rosa: The Death of a Composer_
Why wait? In the last year of lurking on this list, I've seen requests for a good python benchmark no less than 4 times - the most recent being damien morton's attempt to prove/disprove his optimizations. Having an "approved" good benchmark/realistic test program would make it easy to validate optimizations, and head off the consistent 'pystone is not a realistic benchmark' arguments that come up each time.... Besides, it will take a 6 months just to agree to a basic framework, and another 6 months to work around all the "competitive optimization tricks" timbot has up his sleeve... just-looking-at-it-realistically, - Dan On Friday, February 28, 2003, at 05:57 AM, A.M. Kuchling wrote:
Guido, when it comes time to write the benchmark program, please get in touch with me; I'd like to help produce a realistic test program.
[Dan Wolfe]
In the last year of lurking on this list, I've seen requests for a good python benchmark no less than 4 times - the most recent being damien morton's attempt to prove/disprove his optimizations.
Having an "approved" good benchmark/realistic test program would make it easy to validate optimizations, and head off the consistent 'pystone is not a realistic benchmark' arguments that come up each time....
pystone is a very good benchmark for one thing: testing the "general speed" of the interpreter. Perhaps because it *is* so atypical, it's hard to do something that gives pystone a significant speed boost. Rewriting the eval loop several years ago managed to do that, and ruthlessly cutting slop out of the dict implementation gave it an 8% boost more recently. I can't recall any other single thing that helped pystone as much as those. Jim Fulton claims that pystone is a good predictor of Zope speed on a new box, and now that I know more about Zope than I used to, I believe that: while Zope may look like Python code, there are so many meta-tricks being played under the covers that it's plausible that the only thing that really matters is how fast you can get around the eval loop. Anyway, several years ago I offered to collect and organize a set of "typical" benchmarks. Nobody responded, so that turned out to be a lot easier than I thought it would be <wink>.
Besides, it will take a 6 months just to agree to a basic framework, and another 6 months to work around all the "competitive optimization tricks" timbot has up his sleeve...
You can't help it. If you know the code in advance, the implementation *will* get warped to favor it. The best you can hope for is that warping won't be done at the expense of other code. For example, if you decide to reorder the eval loop case statements, and use pystone as your measure of goodness, you'll end up with a different order than if you use test_descr.py as your measure. Is that cheating? I suppose it depends on who's doing it <wink>.
From: "Tim Peters" <tim.one@comcast.net>
[Dan Wolfe]
In the last year of lurking on this list, I've seen requests for a good python benchmark no less than 4 times - the most recent being damien morton's attempt to prove/disprove his optimizations.
Having an "approved" good benchmark/realistic test program would make it easy to validate optimizations, and head off the consistent 'pystone is not a realistic benchmark' arguments that come up each time....
pystone is a very good benchmark for one thing: testing the "general speed" of the interpreter. Perhaps because it *is* so atypical, it's hard to do something that gives pystone a significant speed boost
I've been working with Damien to make sure the improvements are not pystone specific. We've run against my highly optimized matrix code, against pybench, and against another one of my programs which heavily exercises a broad range of python tools. Overall, his improvements have helped across the board. I think his lastest and greatest should be accepted unless there is a maintainability hit. However, the core concept and code seems clean enough to me. FWIW, Raymond Hettinger
At 8:57 AM -0500 2/28/03, A.M. Kuchling wrote:
Terry Reedy wrote:
In any case, this seems to makes the contest one more of special-case optimization and less of general-case running time, and therefore of less relevance to users who want *their* programs to run faster with an out-of-the-box interpreter.
Indeed. I suspect, though, that if the test program exercises a sizable chunk of Python's functionality, we won't see a Parrot-based implementation that can run it. 2004 is a long time away, time enough for the PyPy effort to bear fruit.
The challenge is core interpreter speed, so if the challenge programs use that and Parrot can't cope, I lose and pie will be flung. If the PyPy stuff gives a good turbo boost to the python core, great, that'll make it more challenging. -- Dan --------------------------------------"it's like this"------------------- Dan Sugalski even samurai dan@sidhe.org have teddy bears and even teddy bears get drunk
At 9:59 PM -0500 2/27/03, Terry Reedy wrote:
"Dan Sugalski" <dan@sidhe.org> wrote in message news:a05200f00ba84455b2bc8@[63.120.19.221]...
We've still got to hash out the details, but in december someone'll generate a bytecode file for the program(s) we're going to be running. We both get to run converters/optimizers over them if we choose.
I find this a bit surprising. The problems of teaching to the test and programming to the benchmark are well-known. I presume the extreme of 'optimizing' the bytecode down to a single byte (that calls a custom compiled-to-machine-code function) will be disallowed. But I wonder how and where you plan to draw the line between that and out-of-the-box behavior.
I find your trust in my personal integrity refreshing. Putting that aside, Guido isn't particularly interested in looking like a fool in public, and we've already agreed that the dirty tricks that are often used won't be, since that's not the point. Make no mistake, I fully plan on winning, as resoundingly as I can manage. I'm not going to cheat, though, and I don't want there to be any doubt on either side as to the results. There'll be at least full disclosure on my part as to what the parrot bytecode I execute looks like. While I can't speak for Guido (and we've not talked about this bit) I expect he'll do the same.
Then at the 2004 OSCON we'll run the converted programs and
With enough 'conversion', the result could be seen as no longer being a compilation of standard Python, but of a variant with type declarations and pragmas added and certain dynamic features deleted. In any case, this seems to makes the contest one more of special-case optimization and less of general-case running time, and therefore of less relevance to users who want *their* programs to run faster with an out-of-the-box interpreter.
What it means is that I don't have to build a python bytecode loader into the stock parrot, or weld the Python parser into it. It also means that if Guido rolls out some sort of Miracle Technology or major bytecode revision between the time the challenge code is generated he can run a transition program on it. Works both ways. I for one don't plan on doing much past writing a simple stack model to register model conversion program, though I might give a run with the code in pure stack mode just for chuckles. Don't overestimate how much effort I'm going to put into this, please.
whichever takes less time (we've not set whether it's wall or CPU time, though I'm tempted to go for CPU so neither gets penalized for the odd background task that might be running) wins.
From my user viewpoint, I am more interested in total time from source input to finish.
We'll measure both, I'm sure. The bigger question is whether it's fair to be penalized by other things running in the background, though I expect we'll do multiple runs and average or something to account for that. Still, it'd suck to lose because a cron job fired up and snagged 20% of the CPU and saturated the disk channel during the test run, so wall time may be collected but not count. We'll see.
The official details, limits, and pie flavors will likely be set at this summer's OSCON.
Will you allow runtime compilation (as with psyco)? How about pre-contest dynamic compilation? (I believe Armin plans to add a save/restore feature). What if psyco-like dynamic compilation is a 'stock' feature of one system and not the other?
If it's in the stock python interpreter or parrot interpreter, it's fair game. (We've agreed that latest CVS checkout versions are acceptable) If by the time the contest rolls around you have some form of JIT technology, great. If Parrot's got JIT versions of the python ops, good for us. If one of us does and the other doesn't, well... the results may be somewhat lopsided. Or not, as JITs certainly aren't a panacea. -- Dan --------------------------------------"it's like this"------------------- Dan Sugalski even samurai dan@sidhe.org have teddy bears and even teddy bears get drunk
Damien Morton wrote:
I tried adding a variety of new instructions to the PVM, initially with a code compression goal for the bytecodes, and later with a performance goal. ... Conclusions:
While reducing the size of compiled bytecodes by about 1%, the proposed modifications at best increase performance by 2%, and at worst reduce performance by 3%.
Enabling all of the proposed opcodes results in a 1% performance loss.
In general, it would seem that adding opcodes in bulk, even if many opcodes switch to the same labels, results in a minor performance loss.
The general problem with the ceval switch statement is that it is too big. Adding new opcodes will only make it bigger, so I doubt that much can be gained in general by trying to come up with new do-everything-in-one-opcode cases. Back in the 1.5.2 days I played a lot with the ceval loop and the best results I got came from: a) moving the LOAD_FAST opcode out of the switch b) splitting the switch statement in two: the first one for more commonly used opcodes, the second one for less often used opcodes c) ordering cases in the switch statements by usage frequency (using average opcode usage frequencs obtained by instrumenting the interpreter) The last point is probably compiler dependent. GCC has the tendency to use the same layout for the assembler code as you use in the C source code, so placing often used code close to the top results in better locality (at least on my machines). -- Marc-Andre Lemburg eGenix.com Professional Python Software directly from the Source (#1, Feb 27 2003)
Python/Zope Products & Consulting ... http://www.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
Python UK 2003, Oxford: 33 days left EuroPython 2003, Charleroi, Belgium: 117 days left
"M.-A. Lemburg" <mal@lemburg.com> wrote
Damien Morton wrote:
Conclusions:
While reducing the size of compiled bytecodes by about 1%, the proposed modifications at best increase performance by 2%, and at worst reduce performance by 3%.
Enabling all of the proposed opcodes results in a 1% performance loss.
In general, it would seem that adding opcodes in bulk, even if many opcodes switch to the same labels, results in a minor performance loss.
The general problem with the ceval switch statement is that it is too big. Adding new opcodes will only make it bigger, so I doubt that much can be gained in general by trying to come up with new do-everything-in-one-opcode cases.
Each of the LOAD_FAST_N, LOAD_CONST_N, etc opcodes I added contributed only 1 line of code to the inner loop. LOAD_FAST_0: ... LOAD_FAST_15: oparg = opcode - LOAD_FAST_0 LOAD_FAST: <body of load_fast> break It is the growth of the switch jumptable that I suspect caused the slowdown. Im told that, under MSVC, the switch is implemented as two tables - the first a table of bytes, and the second a table of addresses. If thats the case, adding non-code-bearing opcodes that direct to the same label should only increase the switch jumptables by 1 byte for each opcode added. If not, then the switch jumptables would increase in size by 4 bytes or each opcode added. Either way, there doesnt seem to be any advantage in this approach.
From: M.-A. Lemburg [mailto:mal@lemburg.com]
Back in the 1.5.2 days I played a lot with the ceval loop and the best results I got came from:
a) moving the LOAD_FAST opcode out of the switch
Are you suggesting a test for LOAD_FAST before the switch, e.g. if (opcode == LOAD_FAST) { // load fast } else switch (opcode) { // body }
b) splitting the switch statement in two: the first one for more commonly used opcodes, the second one for less often used opcodes
Are you suggesting something like: switch (opcode) { case COMMON_OP: case COMMON_OP: ... default: switch (opcode) { case UNCOMMON_OP: case UNCOMMON_OP: } Or something like this: switch (opcode) { case COMMON_OP: case COMMON_OP: ... default: handle_uncommon_ops(); }
c) ordering cases in the switch statements by usage frequency (using average opcode usage frequencs obtained by instrumenting the interpreter)
I might try a little simulated annealing to generate layouts with high frequency opcodes near the front and coorcurring opcodes near each other.
The last point is probably compiler dependent. GCC has the tendency to use the same layout for the assembler code as you use in the C source code, so placing often used code close to the top results in better locality (at least on my machines).
damien morton wrote:
From: M.-A. Lemburg [mailto:mal@lemburg.com]
Back in the 1.5.2 days I played a lot with the ceval loop and the best results I got came from:
a) moving the LOAD_FAST opcode out of the switch
Are you suggesting a test for LOAD_FAST before the switch,
e.g. if (opcode == LOAD_FAST) { // load fast } else switch (opcode) { // body }
Yes.
b) splitting the switch statement in two: the first one for more commonly used opcodes, the second one for less often used opcodes
Are you suggesting something like:
switch (opcode) { case COMMON_OP: case COMMON_OP: ... default: switch (opcode) { case UNCOMMON_OP: case UNCOMMON_OP: }
Yes, except that in the default case the code is placed after the first switch. switch (opcode) { case COMMON_OP: goto nextInstruction; case COMMON_OP: ... goto nextInstruction; default: ; } switch (opcode) { case UNCOMMON_OP: case UNCOMMON_OP: } goto nextInstruction;
Or something like this:
switch (opcode) { case COMMON_OP: case COMMON_OP: ... default: handle_uncommon_ops(); }
That's hard to do because the loop has so many variables to pass into handle_uncommon_ops(). I haven't tried it though (or at least I don't remember trying it :-), so perhaps this is even better.
c) ordering cases in the switch statements by usage frequency (using average opcode usage frequencs obtained by instrumenting the interpreter)
I might try a little simulated annealing to generate layouts with high frequency opcodes near the front and coorcurring opcodes near each other.
I did that by hand, sort of :-) The problem is that the scoring phases takes rather long, so you better start with a good guess.
The last point is probably compiler dependent. GCC has the tendency to use the same layout for the assembler code as you use in the C source code, so placing often used code close to the top results in better locality (at least on my machines).
-- Marc-Andre Lemburg eGenix.com Professional Python Software directly from the Source (#1, Feb 28 2003)
Python/Zope Products & Consulting ... http://www.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
Python UK 2003, Oxford: 32 days left EuroPython 2003, Charleroi, Belgium: 116 days left
On Fri, Feb 28, 2003 at 09:48:32AM +0100, M.-A. Lemburg wrote:
damien morton wrote:
From: M.-A. Lemburg [mailto:mal@lemburg.com]
Back in the 1.5.2 days I played a lot with the ceval loop and the best results I got came from:
b) splitting the switch statement in two: the first one for more commonly used opcodes, the second one for less often used opcodes
Are you suggesting something like:
switch (opcode) { case COMMON_OP: case COMMON_OP: ... default: handle_uncommon_ops(); }
That's hard to do because the loop has so many variables to pass into handle_uncommon_ops(). I haven't tried it though (or at least I don't remember trying it :-), so perhaps this is even better.
The patch I mentioned before did that, but for all opcodes. You could apply only part of the patch. That may yield better results. But for all opcodes it was worse. http://python.org/sf/693638 Neal
c) ordering cases in the switch statements by usage frequency (using average opcode usage frequencs obtained by instrumenting the interpreter)
I might try a little simulated annealing to generate layouts with high frequency opcodes near the front and coorcurring opcodes near each other.
I did that by hand, sort of :-) The problem is that the scoring phases takes rather long, so you better start with a good guess.
Im wondering what good scoring scheme would look like. I tried a scoring scheme in which layouts were scored thusly: for (i = 0; i < MAXOP; i++) for (j = 0; j < MAXOP; j++) score += pairfreq[layout[i]][layout[j]] * (i < j ? j-i : i-j) This works fine, but Im thinking that a simpler scoring scheme which looks only at the frequencies of adjacent ops might be sufficient, and would certainly be faster. for (i = 1; i < MAXOP; i++) score += pairfreq[layout[i-1]][layout[i]] The idea is that while caches favour locality of reference, because a cache line is finite in size and relatively small (16 or 64 bytes), there arent any long-range effects. In other words, caches favour adjacency of reference rather than locality of reference.
I just realised that scoring layouts based on adjacency is the traveling salesman problem, where the distance beteween two opcodes is freq[op1][op2]+freq[op2][op1], and the goal is to maximise the total distance traveled. Solving for 150 or so opcodes is well within reach. "Damien Morton" <newsgroups1@bitfurnace.com> wrote in message news:b3o4ti$nsl$1@main.gmane.org...
c) ordering cases in the switch statements by usage frequency (using average opcode usage frequencs obtained by instrumenting the interpreter)
I might try a little simulated annealing to generate layouts with high frequency opcodes near the front and coorcurring opcodes near each other.
I did that by hand, sort of :-) The problem is that the scoring phases takes rather long, so you better start with a good guess.
Im wondering what good scoring scheme would look like.
I tried a scoring scheme in which layouts were scored thusly:
for (i = 0; i < MAXOP; i++) for (j = 0; j < MAXOP; j++) score += pairfreq[layout[i]][layout[j]] * (i < j ? j-i : i-j)
This works fine, but Im thinking that a simpler scoring scheme which looks only at the frequencies of adjacent ops might be sufficient, and would certainly be faster.
for (i = 1; i < MAXOP; i++) score += pairfreq[layout[i-1]][layout[i]]
The idea is that while caches favour locality of reference, because a cache line is finite in size and relatively small (16 or 64 bytes), there arent any long-range effects. In other words, caches favour adjacency of reference rather than locality of reference.
I optimised the layout of the python opcodes using a simulated annealing process that scored adjacent opcodes according to their frequency of co-occurence. This raised my PyStone benchmark from 22100 to 22700, for a 3% gain. Ive been using Skip's DXP server to gather statistics, but there isnt much data there. I should be able to achieve better results if more people contributed stats to his server, more information about which can be found here: http://manatee.mojam.com/~skip/python/ The process of layout the opcodes and switch cases has largely been automated, and generating new layouts is relatively painless and quick. Do please contribute stats for 2.3a2 to Skip's DXP server. I also implemented a LOAD_FASTER opcode, with the argument encoded into the opcode. This raised my PyStone benchmark from 22700 to 23150, for a total 5% gain. The main switch loop looks like this now: if (opcode >= LOAD_FASTER) { load_fast(opcode - LOAD_FASTER); ... goto fast_next_opcode; } switch(opcode) { case LOAD_ATTR: oparg = NEXTARG(); w = GETITEM(names, oparg); ... break; ... } Each opcode case now loads its own argument as necessary. The test for HAVE_ARGUMENT is now implemented using an array of bytes. The test now happens very infrequently, so any performance loss is negligible. const char HASARG[] = { 0 , /* STOP_CODE */ 1 , /* LOAD_ATTR */ 1 , /* CALL_FUNCTION */ 1 , /* STORE_FAST */ 0 , /* BINARY_ADD */ 0 , /* SLICE+0 */ 0 , /* SLICE+1 */ 0 , /* SLICE+2 */ ... }
Are you suggesting a test for LOAD_FAST before the switch,
e.g. if (opcode == LOAD_FAST) { // load fast } else switch (opcode) { // body }
Yes.
Hmm, I might even be able to do something like this: if (opcode >= LOAD_FAST_0) { oparg = opcode - LOAD_FAST_0; ... } else switch (opcode) { }
Yes, except that in the default case the code is placed after the first switch.
switch (opcode) { case COMMON_OP: goto nextInstruction; case COMMON_OP: ... goto nextInstruction; default: ; }
switch (opcode) { case UNCOMMON_OP: case UNCOMMON_OP: } goto nextInstruction;
Why would you do it as two consecutive switches, rather than two nested switches? The current ceval code has nested switches for when CASE_TOO_BIG is defined. Hmmm. I would guess that the handling of default within a switch is probably highly optimised by the compiler, using if/then range tests. If we structured the opcodes so that uncommon ops were a two-byte opcode, we could do something like this: switch (opcode) { default: load_fast(opcode - LOAD_FAST_0); goto nextInstruction; case COMMON_OP1: ... goto nextInstruction; case COMMON_OP2: ... goto nextInstruction; case UNCOMMON_OP: break; } opcode = NEXTBYTE(); switch(opcode) { case UNCOMMON_OP1: ... break; case UNCOMMON_OP2: ... break; } This would free up the opcode numeric space for fast/frequent ops with immediate args encoded into them.
>> > Are you suggesting a test for LOAD_FAST before the switch, >> > >> > e.g. >> > if (opcode == LOAD_FAST) { >> > // load fast >> > } >> > else switch (opcode) { >> > // body >> > } >> >> Yes. Damien> Hmm, I might even be able to do something like this: Damien> if (opcode >= LOAD_FAST_0) { Damien> oparg = opcode - LOAD_FAST_0; Damien> ... Damien> } Damien> else switch (opcode) { Damien> } I think you want "&& opcode <= LOAD_FAST_15" in there somewhere, or something to cap the range of the test. Now you've increased the cost of the check, maybe making it no longer worthwhile. Skip
participants (11)
-
A.M. Kuchling
-
damien morton
-
Damien Morton
-
Dan Sugalski
-
Dan Wolfe
-
M.-A. Lemburg
-
Neal Norwitz
-
Raymond Hettinger
-
Skip Montanaro
-
Terry Reedy
-
Tim Peters