
Ive done a static analysis of the bytecodes from compiling the python standard library: Python 2.2 (#28, Dec 21 2001, 12:21:22) [MSC 32 bit (Intel)] on win32 Some stats about JUMP_IF_FALSE opcodes Of the 2768 JUMP_IF_FALSE opcodes encountered, 2429 have a POP_TOP on both branches. Id like to propose that JUMP_IF_FALSE consume the top-of-stack. Some stats about constants 50% of constant accesses are to a fixed set of 5 constants http://www.bitfurnace.com/python/stats-consts.txt rank, freq, cum%, const 1, 1277, 18.7, None 2, 929, 32.3, 1 3, 741, 43.1, 0 4, 254, 46.8, '' 5, 228, 50.1, 2 Id like to propose the following opcodes be added LOAD_CONST(NONE) LOAD_CONST(1) LOAD_CONST(0) LOAD_CONST(EMPTY_STR) Some stats about the number of constants and locals used in functions 97% of functions use 16 or less constants 83% of functions use 8 or less constants http://www.bitfurnace.com/python/stats-func-consts.txt 98% of functions use 16 or less locals 85% of functions use 8 or less locals http://www.bitfurnace.com/python/stats-func-locals.txt Id like to propose the following opcodes be added (I suggest n=15) LOAD_CONST_[0..n] LOAD_FAST_[0..n] STORE_FAST_[0..n] DELETE_FAST_[0..n] Some stats about instruction traces Please see the following links for detailed stats http://www.bitfurnace.com/python/stats-traces-original.txt http://www.bitfurnace.com/python/stats-traces-modified.txt The second file contains stats on instruction traces incorporating the proposed changes The score column, in both files, is computed by multiplying the frequency by the length of the trace Id like to propose the following opcodes, which should reduce number of bytecode instructions used by 20%: RETURN_FAST == LOAD_FAST, RETURN_VALUE RETURN_CONST == LOAD_CONST, RETURN_VALUE LOAD_FAST+1 == LOAD_FAST, LOAD_FAST STORE_FAST+1 == STORE_FAST, STORE_FAST POP_TOP+1 == POP_TOP, POP_TOP POP_TOP+2 == POP_TOP, POP_TOP, POP_TOP BRANCH_IF == COMPARE_OP, JUMP_IF_FALSE, POP_TOP Notes: LOAD_FAST+1 and STORE_FAST+1 could be implemented as a 1 byte instruction code followed by two nibbles encoding the local index numbers. See above for a discussion of local variable index numbers. BRANCH_IF could be implimented as a set of opcodes, one for each of the possible compare_ops

Damien> Ive done a static analysis of the bytecodes from compiling the Damien> python standard library: ... Nice input, but I think you should deal with dynamic opcode frequencies. See the #define's in the ceval.c source for how to enable that and the recent thread in c.l.py. In short, it doesn't really matter how many LOAD_CONST instructions appear in the bytecode stream, only how many of them are executed. While LOAD_CONST is a relatively frequently executed opcode, if I recall correctly, LOAD_FAST is much more frequently executed, and LOAD_CONST is already pretty fast, so adding special opcodes probably won't buy you much and runs the risk of disturbing what locality of reference exists. The LOAD_IF_FALSE issue was discussed recently (here I think). The problem is that chained operations like if a < b < c: ... require the value be retained. You could get rid of the separate POP_TOP instruction but it would require some work in the compiler. as-always-patches-are-welcome-ly, y'rs, Skip

From: Skip Montanaro [mailto:skip@pobox.com]
Damien> Ive done a static analysis of the bytecodes from compiling the Damien> python standard library:
...
Nice input, but I think you should deal with dynamic opcode frequencies. See the #define's in the ceval.c source for how to enable that and the recent thread in c.l.py. In short, it doesn't really matter how many LOAD_CONST instructions appear in the bytecode stream, only how many of them are executed. While LOAD_CONST is a relatively frequently executed opcode, if I recall correctly, LOAD_FAST is much more frequently executed, and LOAD_CONST is already pretty fast, so adding special opcodes probably won't buy you much and runs the risk of disturbing what locality of reference exists.
I will indeed take a look at dynamic opcode frequencies. This is my first stab at it, though. I had originaly taken a look at this from a code compression perspective. The proposals I made should result in a non-trival reduction in the size of the compiled bytecodes. This, in turn should result in _some_ speedup. As you say, LOAD_FAST is a very frequently occuring instruction, both statically and dynamically. Reducing it from a 3 byte instruction to a 1 byte instruction in 97% of (static) cases should be an overall good. Most of the opcodes I proposed could be added without disturbing locality of reference. e.g. switch (op = *p++) { ... case LOAD_FAST: index = (*p++) + (*p++)<<8 goto LOAD_FAST_MAIN; break; case LOAD_FAST_0: case LOAD_FAST_1: case LOAD_FAST_15: index = op - LOAD_FAST_0 LOAD_FAST_MAIN: ... break; }
The LOAD_IF_FALSE issue was discussed recently (here I think). The problem is that chained operations like
if a < b < c: ...
require the value be retained. You could get rid of the separate POP_TOP instruction but it would require some work in the compiler.
Those kind of chained operations are by far the minority (20% of static cases). They can be handled by a DUP instruction before the JUMP opcode, or a separate non-consuming JUMP opcode can be created.
as-always-patches-are-welcome-ly, y'rs,
Skip
not-quite-ready-to-push-my-hands-in-up-to-the-elbows-ly, yours Damien

As you say, LOAD_FAST is a very frequently occuring instruction, both statically and dynamically. Reducing it from a 3 byte instruction to a 1 byte instruction in 97% of (static) cases should be an overall good.
Most of the opcodes I proposed could be added without disturbing locality of reference.
e.g.
switch (op = *p++) { ... case LOAD_FAST: index = (*p++) + (*p++)<<8 goto LOAD_FAST_MAIN; break; case LOAD_FAST_0: case LOAD_FAST_1: case LOAD_FAST_15: index = op - LOAD_FAST_0 LOAD_FAST_MAIN: ... break;
}
Good idea. Can you benchmark this? --Guido van Rossum (home page: http://www.python.org/~guido/)

I implemented LOAD_FAST_n, STORE_FAST_n, LOAD_CONST_n for n < 16 Getting a small 2% improvement in speed Going from about 21800 PyStones to 22300 PyStones; very hard to get consistent readings on the PyStones - anyone got any tips on how to get more consistent results under windows? Getting a small 3% reduction in .pyc filesizes os.path 24,929 unmodified os.path 24,149 with modifications I sort of cheated on the switch statement to avoid the use of a goto. opcode = NEXTOP(); if (HAS_ARG(opcode)) oparg = NEXTARG(); ... switch (opcode) { ... case LOAD_FAST_14: case LOAD_FAST_15: oparg = opcode - LOAD_FAST_0; case LOAD_FAST: x = GETLOCAL(oparg); if (x != NULL) { Py_INCREF(x); ... I also altered the opcode.h file to use an enum for the opcodes instead of all those #defines. Much easier to re-arrange things that way. I have a feeling that most of the speedup (such that it is) comes from that re-arrangment, which packs the opcodes into a contiguous numeric space. I suspect that sorting the opcodes by frequency of access might also have some positive effect. Also, organising the opcodes and the switch statement so that frequently co-occuring opcodes are adjacent to each other might also have some positive effect.
-----Original Message----- From: guido@python.org [mailto:guido@python.org] Sent: Tuesday, 25 February 2003 20:25 To: damien morton Cc: python-dev@python.org Subject: Re: [Python-Dev] Bytecode analysis
As you say, LOAD_FAST is a very frequently occuring instruction, both statically and dynamically. Reducing it from a 3 byte instruction to a 1 byte instruction in 97% of (static) cases should be an overall good.
Most of the opcodes I proposed could be added without disturbing locality of reference.
e.g.
switch (op = *p++) { ... case LOAD_FAST: index = (*p++) + (*p++)<<8 goto LOAD_FAST_MAIN; break; case LOAD_FAST_0: case LOAD_FAST_1: case LOAD_FAST_15: index = op - LOAD_FAST_0 LOAD_FAST_MAIN: ... break;
}
Good idea. Can you benchmark this?
--Guido van Rossum (home page: http://www.python.org/~guido/)

I implemented LOAD_FAST_n, STORE_FAST_n, LOAD_CONST_n for n < 16
Getting a small 2% improvement in speed Going from about 21800 PyStones to 22300 PyStones; very hard to get consistent readings on the PyStones - anyone got any tips on how to get more consistent results under windows?
Upgrade to a real operating system. :-)
Getting a small 3% reduction in .pyc filesizes os.path 24,929 unmodified os.path 24,149 with modifications
I sort of cheated on the switch statement to avoid the use of a goto.
opcode = NEXTOP(); if (HAS_ARG(opcode)) oparg = NEXTARG(); ... switch (opcode) { ... case LOAD_FAST_14: case LOAD_FAST_15: oparg = opcode - LOAD_FAST_0; case LOAD_FAST: x = GETLOCAL(oparg); if (x != NULL) { Py_INCREF(x); ...
This is ok.
I also altered the opcode.h file to use an enum for the opcodes instead of all those #defines. Much easier to re-arrange things that way. I have a feeling that most of the speedup (such that it is) comes from that re-arrangment, which packs the opcodes into a contiguous numeric space. I suspect that sorting the opcodes by frequency of access might also have some positive effect. Also, organising the opcodes and the switch statement so that frequently co-occuring opcodes are adjacent to each other might also have some positive effect.
Try that without the LOAD_XXX_n idea, so you can measure the effect of each idea separately. One problem with renumbering the opcodes is that you have to update the "dis" module, which exports the opcodes as Python constants. Maybe you should add some code there to regenerate the values from the .h file, as is done in keyword.py and symbol.py. --Guido van Rossum (home page: http://www.python.org/~guido/)

Guido> One problem with renumbering the opcodes is that you have to Guido> update the "dis" module, which exports the opcodes as Python Guido> constants. Maybe you should add some code there to regenerate Guido> the values from the .h file, as is done in keyword.py and Guido> symbol.py. I have long had a version of dis.py which imports opcode info from a separate opcodes.py. Perhaps it's time to make that change. opcodes.py could then be mostly generated automatically from Include/opcode.h. (The reason for separating this out is to make opcode info available to other modules, like peephole optimizers.) Should I submit a patch? Skip

I have long had a version of dis.py which imports opcode info from a separate opcodes.py. Perhaps it's time to make that change. opcodes.py could then be mostly generated automatically from Include/opcode.h.
(The reason for separating this out is to make opcode info available to other modules, like peephole optimizers.)
Should I submit a patch?
Yes please! --Guido van Rossum (home page: http://www.python.org/~guido/)

-----Original Message----- From: python-dev-admin@python.org [mailto:python-dev-admin@python.org] On Behalf Of Guido van Rossum Sent: Wednesday, February 26, 2003 10:52 AM To: skip@pobox.com Cc: python-dev@python.org Subject: Re: [Python-Dev] Bytecode analysis
I have long had a version of dis.py which imports opcode info from a separate opcodes.py. Perhaps it's time to make that change. opcodes.py could then be mostly generated automatically from Include/opcode.h.
(The reason for separating this out is to make opcode info available to other modules, like peephole optimizers.)
Should I submit a patch?
Yes please!
While the subject is up, I previously submitted a patch [683074] to allow you to specify an output stream. Right now dis just uses print, which made it a little difficult to redirect the output. Is there any interest in this feature? If so, I'll update the patch against Skip's. If not, or if there are objections, the patch can be closed out. -Grant

While the subject is up, I previously submitted a patch [683074] to allow you to specify an output stream. Right now dis just uses print, which made it a little difficult to redirect the output. Is there any interest in this feature? If so, I'll update the patch against Skip's. If not, or if there are objections, the patch can be closed out.
-1 You can redirect everything's output to a file by assigning that file to sys.stdout. I really don't think we need to add output file options to every class in the standard library. dis is primarily an interactive and debugging tool. And you can always subclass it. --Guido van Rossum (home page: http://www.python.org/~guido/)

Damien Morton wrote:
Ive done a static analysis of the bytecodes from compiling the python standard library:
Python 2.2 (#28, Dec 21 2001, 12:21:22) [MSC 32 bit (Intel)] on win32
Some stats about JUMP_IF_FALSE opcodes
Of the 2768 JUMP_IF_FALSE opcodes encountered, 2429 have a POP_TOP on both branches.
Id like to propose that JUMP_IF_FALSE consume the top-of-stack.
I'm all against changing existing opcodes for a minor speed-up. Real speed-up is available by a specializing compiler which turns things into real code. If you really want to change the engine, I would consider to emit an extra opcode and try how the change performs against a couple of applications. I doubt the effect, since the little POP_TOP is pretty fast. Where you really can save some time is to shortcut some of the very short opcodes to not jump back to the ticker counting code, but into a shorter circle. This avoids quite some register moves and gives some local optimization possibilities. I would anyway suggest not to change the semantics of existing opcodes for just little win. ...
Id like to propose the following opcodes be added LOAD_CONST(NONE) LOAD_CONST(1) LOAD_CONST(0) LOAD_CONST(EMPTY_STR)
I'd be careful here, too. The interpreter loop is quite large, already, and there is a good chance to loose locality of reference by adding a little bit of code. I had that several times. You don't think you changed much, but where are these 10 percent gone now? Not trying to demoralize you completely, but there are limits about what can be gathered by optimizing the interpreter loop. There was once the p2c project, which gave an overall improvement of 25-40 percent, by totally removing the interpreter loop. Since I knew that, I stopped wasting brain cycles on optimizing it, because the possible win is not really promising. There are much better ways to speed Python up. One is to completely rewrite Python in Python and applying a specializing compiler to everything. This is what we are trying on pypy-dev. cheers - chris -- Christian Tismer :^) <mailto:tismer@tismer.com> Mission Impossible 5oftware : Have a break! Take a ride on Python's Johannes-Niemeyer-Weg 9a : *Starship* http://starship.python.net/ 14109 Berlin : PGP key -> http://wwwkeys.pgp.net/ work +49 30 89 09 53 34 home +49 30 802 86 56 pager +49 173 24 18 776 PGP 0x57F3BF04 9064 F4E1 D754 C2FF 1619 305B C09C 5A3B 57F3 BF04 whom do you want to sponsor today? http://www.stackless.com/

Christian Tismer wrote:
Where you really can save some time is to shortcut some of the very short opcodes to not jump back to the ticker counting code, but into a shorter circle.
2.2 -> 2.3 includes this optimization for some opcodes.
Not trying to demoralize you completely, but there are limits about what can be gathered by optimizing the interpreter loop. There was once the p2c project, which gave an overall improvement of 25-40 percent, by totally removing the interpreter loop.
Yes, but p2c was probably not nice to the icache. I doubt 25-40% is an upper bound. Memory bandwidth really sucks now (relatively speaking). I think reference counting is now starting to look like a smart design (in terms of performance). :-) Neil

Neil Schemenauer wrote:
Christian Tismer wrote:
Where you really can save some time is to shortcut some of the very short opcodes to not jump back to the ticker counting code, but into a shorter circle.
2.2 -> 2.3 includes this optimization for some opcodes.
Oh! *blush* I should read more of this :-)
Not trying to demoralize you completely, but there are limits about what can be gathered by optimizing the interpreter loop. There was once the p2c project, which gave an overall improvement of 25-40 percent, by totally removing the interpreter loop.
Yes, but p2c was probably not nice to the icache. I doubt 25-40% is an upper bound. Memory bandwidth really sucks now (relatively speaking). I think reference counting is now starting to look like a smart design (in terms of performance). :-)
Yes, it created lots of code, and for sure, code repetition can be helpful, if the relevant pieces are kept tightly together. But I still doubt bigger improvements, since most of the time is spent in the C library code. You need to optimize this away, too. ciao - chris -- Christian Tismer :^) <mailto:tismer@tismer.com> Mission Impossible 5oftware : Have a break! Take a ride on Python's Johannes-Niemeyer-Weg 9a : *Starship* http://starship.python.net/ 14109 Berlin : PGP key -> http://wwwkeys.pgp.net/ work +49 30 89 09 53 34 home +49 30 802 86 56 pager +49 173 24 18 776 PGP 0x57F3BF04 9064 F4E1 D754 C2FF 1619 305B C09C 5A3B 57F3 BF04 whom do you want to sponsor today? http://www.stackless.com/

Some stats about JUMP_IF_FALSE opcodes
Of the 2768 JUMP_IF_FALSE opcodes encountered, 2429 have a POP_TOP on both branches.
Id like to propose that JUMP_IF_FALSE consume the top-of-stack.
I'm all against changing existing opcodes for a minor speed-up. Real speed-up is available by a specializing compiler which turns things into real code.
If you are referring Psyco, I expect that it is several years before maturity. Currently it uses too much memory to be realistic.
If you really want to change the engine, I would consider to emit an extra opcode and try how the change performs against a couple of applications. I doubt the effect, since the little POP_TOP is pretty fast.
OTOH, compared to the work that POP_TOP does, the work of decoding the opcodes is significant.
Where you really can save some time is to shortcut some of the very short opcodes to not jump back to the ticker counting code, but into a shorter circle. This avoids quite some register moves and gives some local optimization possibilities.
Some of that is already done (these say 'continue' instead of 'break'). But I'm sure more can be done.
I would anyway suggest not to change the semantics of existing opcodes for just little win.
Why not? The opcodes are an internal detail of the PVM, and they change a bit with almost every Python version.
...
Id like to propose the following opcodes be added LOAD_CONST(NONE) LOAD_CONST(1) LOAD_CONST(0) LOAD_CONST(EMPTY_STR)
I'd be careful here, too. The interpreter loop is quite large, already, and there is a good chance to loose locality of reference by adding a little bit of code. I had that several times. You don't think you changed much, but where are these 10 percent gone now?
Agreed. This adds more cases to the switch and doesn't reduce the number of opcodes to be decoded (it only reduces the number of bytes per opcode, a very meagre gain indeed).
Not trying to demoralize you completely, but there are limits about what can be gathered by optimizing the interpreter loop. There was once the p2c project, which gave an overall improvement of 25-40 percent, by totally removing the interpreter loop.
Yes, that's an upper bound for what you can gain by fiddling upcodes. There are other gains possible though. The PVM isn't just the switch in ceval.c: it is also all the object implementations. While most are pretty lean, there's still fluff, e.g. in the lookup of builtins (SF patch 597907). --Guido van Rossum (home page: http://www.python.org/~guido/)

Guido van Rossum wrote: [Chris (not the bot)]
I'm all against changing existing opcodes for a minor speed-up. Real speed-up is available by a specializing compiler which turns things into real code.
[Guido]
If you are referring Psyco, I expect that it is several years before maturity. Currently it uses too much memory to be realistic.
You are probably right about maturity, but I don't think it will be years. In a month, he will be able to work full-time on this, and I just spent a week with him on the PyPy sprint and saw a genious fly... amazing! ...
OTOH, compared to the work that POP_TOP does, the work of decoding the opcodes is significant.
Good point.
Where you really can save some time is to shortcut some of the very short opcodes to not jump back to the ticker counting code, but into a shorter circle. This avoids quite some register moves and gives some local optimization possibilities.
Some of that is already done (these say 'continue' instead of 'break'). But I'm sure more can be done.
Hey, this is amazing! This was kind of the stuff that I did to the interpreter couple of years ago, where you really disliked anybody hacking on the core for saving some cycles. Now I see this evolve from alone, with some pleasure. Every path that I can dispose with is a good patch. :-)
I would anyway suggest not to change the semantics of existing opcodes for just little win.
Why not? The opcodes are an internal detail of the PVM, and they change a bit with almost every Python version.
Maybe I've become a bit conservative, trying to keep more compatibility between versions than necessary. There have been opcode changes from time to time, but semantics were never changed, and new opcodes were always added to the end of the table, so I thought you wanted to avoid unneccessary changes to dis.dis and friends, and maybe stay able to run older .pycs with newer interpreters, but I really don't insist. I just found a minor speedup not worth the change at all. (Hey did you expect such words from me? :-) [locality of reference and possible code bloat]
Agreed. This adds more cases to the switch and doesn't reduce the number of opcodes to be decoded (it only reduces the number of bytes per opcode, a very meagre gain indeed).
Well, if we are going to refactor opcodes, it might be worth considering to do changes that both don't increase the number of opcodes to be executed (shorten the code objects) and don't increase the number of opcodes to be distinguished. Modifying the JUMP_IF_XXX is towards this goal. There was this other proposal of CALL_METHOD (or so), which adds an opcode but shortens interpretation and code size. Maybe it makes sense to get rid of some more seldomly used opcodes like BUILD_CLASS, which could be replaced by a regular call to a special function? EXEC_STMT is a candidate as well. This is used not so often and could be a special function call. All the PRINT_XXX opcodes are doing some time consuming stuff, so we could replace them by special function calls, since the overhead doesn't count here. They do not need to sit in the big switch. These were just some random ideas. There are time critical, often used opcodes, together with seldomly used ones, which take quite some time, anyway. While compressing patterns of the frequently used ones into fast combined opcodes (CALL_METHOD, JUMP_IF_XXX with pop), the space of others can be reclaimed.
Not trying to demoralize you completely, but there are limits about what can be gathered by optimizing the interpreter loop. There was once the p2c project, which gave an overall improvement of 25-40 percent, by totally removing the interpreter loop.
Yes, that's an upper bound for what you can gain by fiddling upcodes.
Which made opcode optimization rather pointless for me, and I stopped hacking on this.
There are other gains possible though. The PVM isn't just the switch in ceval.c: it is also all the object implementations. While most are pretty lean, there's still fluff, e.g. in the lookup of builtins (SF patch 597907).
And specialized string-only namespaces, cached method lookups, melting down local variables into primitive C types, ... there are tons of stuff which I could add, but I don't want to try this in C. Python is much better for prototyping. :-) cheers - chris -- Christian Tismer :^) <mailto:tismer@tismer.com> Mission Impossible 5oftware : Have a break! Take a ride on Python's Johannes-Niemeyer-Weg 9a : *Starship* http://starship.python.net/ 14109 Berlin : PGP key -> http://wwwkeys.pgp.net/ work +49 30 89 09 53 34 home +49 30 802 86 56 pager +49 173 24 18 776 PGP 0x57F3BF04 9064 F4E1 D754 C2FF 1619 305B C09C 5A3B 57F3 BF04 whom do you want to sponsor today? http://www.stackless.com/

We should probably stop while we're chasing Fredrik away again :-), but there could be some merit in Christian's idea of replacing some of the more rarely-used opcodes by function calls, or by some other way of encoding these that uses fewer cases in the switch. But there aren't very many of these, so I'm not sure how much it will help. (And we should be careful not to change the semantics.) --Guido van Rossum (home page: http://www.python.org/~guido/)

Guido van Rossum wrote:
We should probably stop while we're chasing Fredrik away again :-),
not by this discussion which he probably likes :-)
but there could be some merit in Christian's idea of replacing some of the more rarely-used opcodes by function calls, or by some other way of encoding these that uses fewer cases in the switch. But there aren't very many of these, so I'm not sure how much it will help.
Well, I haven't done a complete analysis yet, but there are quite some other candidates for refactoring. I just stumbled over IMPORT_XXX, which is normally only used during some initialization code. (Yeah, some people use it from inside a function, but I don't think it would break semantics to cache that after the first call...) UNPACK_SEQUENCE could stop to go through every alternative, just do the most common tule stuff and otherwise transfer to a function, which makes the code shorter. Then, there are 13 new opcodes for the INPLACE_XXX things. Are they used often enough to deserve so much space? The code patterns are very similar, despite add and subtract, which have inlined integer optimization. The latter special case *might* be considered to be folded into the BINARY_ADD special case, which looks almost exactly the same? The code only differs by the x = PyNumber_Add(v, w); vs. x = PyNumber_InPlaceAdd(v, w); fallbacks. Then, for example, see BINARY_XOR vs. INPLACE_XOR. These have exactly the same pattern, just a different function call inside. It can make sense to fold them back into one, i.e. one case for two opcodes but indexing the function, or even folding a whole bunch of these which look so very similar, and having a plain table of functions, directlyindexec by the opcode which is probably still sitting in a register. There are other patterns (randomly picked, since I didn't intend to really do an analysis now), like here: case INPLACE_TRUE_DIVIDE: case INPLACE_FLOOR_DIVIDE: case INPLACE_MODULO: The code has the same pattern all the time, popping stuff, calling a different function, decrefing, pushing result and handling errors. I might (might, but that needs investigation) pay off to fold such similar patterns into one routine, which picks the only difference (the internal function to be called) out of a table, instead of repeating all the surrounding code, explicitly. (although I generally agree that explicit is better than...:) So there are lots of ways of folding to try, if somebody cares. This is hard, anyway, since it is platform dependant whether it creates an advantage or the opposite, I fear.
(And we should be careful not to change the semantics.)
Absolutely d'acor - ly -- chris -- Christian Tismer :^) <mailto:tismer@tismer.com> Mission Impossible 5oftware : Have a break! Take a ride on Python's Johannes-Niemeyer-Weg 9a : *Starship* http://starship.python.net/ 14109 Berlin : PGP key -> http://wwwkeys.pgp.net/ work +49 30 89 09 53 34 home +49 30 802 86 56 pager +49 173 24 18 776 PGP 0x57F3BF04 9064 F4E1 D754 C2FF 1619 305B C09C 5A3B 57F3 BF04 whom do you want to sponsor today? http://www.stackless.com/

Christian Tismer wrote: [much about bytecode, folding, ...] Well, to stop myself blathering about bytecode and how to make it different, which I'm expected to do other things or at least sleep, I can't resist to throw this rigorous idea into py-dev, which struck me at 4:30 in the morning: WHat would happen if we re-structured the ceval loop this way: We look at every similar pattern and group together all the opcodes which have a similar pattern. By "pattern" I mean the surrounding calls of simple things like stack operations before and after, and error handling. So what I would like to try is to merge all opcodes which have exactly the same pattern into one case, which then does the common preparation stuff and postprocessing including error handling, but internally dispatches on the function to be called. No idea if it may be an improvement, but just let me throw it in. Example: case UNARY_POSITIVE: v = POP(); x = PyNumber_Positive(v); Py_DECREF(v); PUSH(x); if (x != NULL) continue; break; case UNARY_NEGATIVE: v = POP(); x = PyNumber_Negative(v); Py_DECREF(v); PUSH(x); if (x != NULL) continue; break; ... case UNARY_CONVERT: v = POP(); x = PyObject_Repr(v); Py_DECREF(v); PUSH(x); if (x != NULL) continue; break; case UNARY_INVERT: v = POP(); x = PyNumber_Invert(v); Py_DECREF(v); PUSH(x); if (x != NULL) continue; break; is converted into case UNARY_POSITIVE: case UNARY_NEGATIVE: case UNARY_CONVERT: case UNARY_INVERT: v = POP(); switch (opcode) { case UNARY_POSITIVE: x = PyNumber_Positive(v); break; case UNARY_NEGATIVE: x = PyNumber_Negative(v); break; case UNARY_CONVERT: x = PyObject_Repr(v); break; case UNARY_INVERT: x = PyNumber_Invert(v); } Py_DECREF(v); PUSH(x); if (x != NULL) continue; break; Maybe it also makes sense to use indexing into a static array, instead of the case construct. Note that there can be one single such table for all opcodes and all cases, since opcodes are still disjoint. It depends where this table is stored and if this can get in the cache. While I don't know if this really makes the interpreter more efficient, at least it makes it shorter to read and maybe easier to maintain. Ok ok, I will sleep now :-) cheers - chris -- Christian Tismer :^) <mailto:tismer@tismer.com> Mission Impossible 5oftware : Have a break! Take a ride on Python's Johannes-Niemeyer-Weg 9a : *Starship* http://starship.python.net/ 14109 Berlin : PGP key -> http://wwwkeys.pgp.net/ work +49 30 89 09 53 34 home +49 30 802 86 56 pager +49 173 24 18 776 PGP 0x57F3BF04 9064 F4E1 D754 C2FF 1619 305B C09C 5A3B 57F3 BF04 whom do you want to sponsor today? http://www.stackless.com/

def f(): while True:
There is a piece of low hanging fruit in "while True" constructs. If the code generator can recognize it, then three op-codes can be shaved off of every pass (LOAD_GLOBAL, JUMP_IF_FALSE, and POP_TOP). Raymond Hettinger --------------------------------------------------------- print "hello"
dis.dis(f) 2 0 SETUP_LOOP 17 (to 20) >> 3 LOAD_GLOBAL 0 (True) 6 JUMP_IF_FALSE 9 (to 18) 9 POP_TOP
3 10 LOAD_CONST 1 ('hello') 13 PRINT_ITEM 14 PRINT_NEWLINE 15 JUMP_ABSOLUTE 3 -------------------------------------------------------- It would be great if lines 3, 6, and 9 were not generated or if line 15 was JUMP_ABSOLUTE 10.

Raymond> There is a piece of low hanging fruit in "while True" Raymond> constructs. If the code generator can recognize it, ... Yes, that would improve things, but it gets back to the issue that True isn't strictly speaking a constant, but has to be evaluated each pass around the loop. None is moving slowly toward becoming a true constant. Perhaps the same should be true of True and False. Skip

None is moving slowly toward becoming a true constant. Perhaps the same should be true of True and False.
Problem with this is that there's currently lots of code out there that was recently modified to read try: True except NameError: True = 1 False = 0 That would become a syntax error if True/False were true constants. :-( --Guido van Rossum (home page: http://www.python.org/~guido/)

>> None is moving slowly toward becoming a true constant. Perhaps the >> same should be true of True and False. Guido> Problem with this is that there's currently lots of code out Guido> there that was recently modified to read Guido> try: Guido> True Guido> except NameError: Guido> True = 1 Guido> False = 0 Guido> That would become a syntax error if True/False were true Guido> constants. :-( Hmmm... Those code blocks would only be executed on older versions of the interpreter. Is there some way to use that knowledge to finesse the problem? It would be a real hack, but if the compiler recognized precisely the above construct (or at least assignment in an except block guarded by NameError), it could shut up about the assignment. Skip

>> None is moving slowly toward becoming a true constant. Perhaps the >> same should be true of True and False.
Guido> Problem with this is that there's currently lots of code out Guido> there that was recently modified to read
Guido> try: Guido> True Guido> except NameError: Guido> True = 1 Guido> False = 0
Guido> That would become a syntax error if True/False were true Guido> constants. :-(
Hmmm... Those code blocks would only be executed on older versions of the interpreter. Is there some way to use that knowledge to finesse the problem? It would be a real hack, but if the compiler recognized precisely the above construct (or at least assignment in an except block guarded by NameError), it could shut up about the assignment.
You would have to do a fairly significant analysis of all code in the module to know that there are no assignments to globals True or False anywhere, *and* you would have to add a heuristic to cope with the very code fragment above, *and* you would have to have a firm belief that no other code would do import foo foo.True = "True" I've often mentioned the first item above (per-module code analysis) as doable, but it's not done. --Guido van Rossum (home page: http://www.python.org/~guido/)

Skip Montanaro writes:
Hmmm... Those code blocks would only be executed on older versions of the interpreter. Is there some way to use that knowledge to finesse the problem? It would be a real hack, but if the compiler recognized precisely the above construct (or at least assignment in an except block guarded by NameError), it could shut up about the assignment.
Another possibility would be that in a module containing *no* assignments to True and False, references could be converted to special byte codes. Not quite the same, but allows for optimizations at any rate. -Fred -- Fred L. Drake, Jr. <fdrake at acm.org> PythonLabs at Zope Corporation

Guido> Problem with this is that there's currently lots of code out Guido> there that was recently modified to read Guido> try: Guido> True Guido> except NameError: Guido> True = 1 Guido> False = 0 Guido> That would become a syntax error if True/False were true Guido> constants. :-(
Hmmm... Those code blocks would only be executed on older versions of the interpreter. Is there some way to use that knowledge to finesse the problem? It would be a real hack, but if the compiler recognized precisely the above construct (or at least assignment in an except block guarded by NameError), it could shut up about the assignment.
Would it be simpler to institute a special rule that True = 1 is silently ignored, but True = anything else generates an error message? Or am I overlooking something important? If you want to handle assigning to True an expression whose value is known only at run time, translate it into if expression != 1 : raise Exception, "Attempt to assign invalid value to True" which would give a run-time error without costing cycles anywhere else. Paul Hughett

Would it be simpler to institute a special rule that True = 1 is silently ignored, but True = anything else generates an error message? Or am I overlooking something important?
Some people prefer True = (1 > 0) because this accesses the special integer (with value 1) that is used by the interpreter for the outcome of boolean results .
If you want to handle assigning to True an expression whose value is known only at run time, translate it into
if expression != 1 : raise Exception, "Attempt to assign invalid value to True"
which would give a run-time error without costing cycles anywhere else.
I suggest you try coming up with a patch for that to see how much work it is. I expect it to be tough. --Guido van Rossum (home page: http://www.python.org/~guido/)

Guido van Rossum wrote: ...
If you want to handle assigning to True an expression whose value is known only at run time, translate it into
if expression != 1 : raise Exception, "Attempt to assign invalid value to True"
hehe. raise Exception, "Attempt to re-define truth :-)"
which would give a run-time error without costing cycles anywhere else.
I suggest you try coming up with a patch for that to see how much work it is. I expect it to be tough.
Yes. ciao - chris -- Christian Tismer :^) <mailto:tismer@tismer.com> Mission Impossible 5oftware : Have a break! Take a ride on Python's Johannes-Niemeyer-Weg 9a : *Starship* http://starship.python.net/ 14109 Berlin : PGP key -> http://wwwkeys.pgp.net/ work +49 30 89 09 53 34 home +49 30 802 86 56 pager +49 173 24 18 776 PGP 0x57F3BF04 9064 F4E1 D754 C2FF 1619 305B C09C 5A3B 57F3 BF04 whom do you want to sponsor today? http://www.stackless.com/

Christian Tismer:
if expression != 1 : raise Exception, "Attempt to assign invalid value to True"
hehe. raise Exception, "Attempt to re-define truth :-)"
To be known as NineteenEightyFourError. Greg Ewing, Computer Science Dept, +--------------------------------------+ University of Canterbury, | A citizen of NewZealandCorp, a | Christchurch, New Zealand | wholly-owned subsidiary of USA Inc. | greg@cosc.canterbury.ac.nz +--------------------------------------+

Paul> Would it be simpler to institute a special rule that True = 1 is Paul> silently ignored, but True = anything else generates an error Paul> message? Or am I overlooking something important? For code that wants to cleanly cross the boundary between Python with no boolean type and Python with a boolean type, you sometimes see True = 1==1 False = 1==0 You get True and False if it didn't exist before and have the added benefit that if it does exist, it gets found in globals() or locals() instead of in __builtins__ and has the right type. Skip

"Skip Montanaro" <skip@pobox.com> wrote in message news:15964.59840.342254.176499@montanaro.dyndns.org...
For code that wants to cleanly cross the boundary between Python with no boolean type and Python with a boolean type, you sometimes see
True = 1==1 False = 1==0
You get True and False if it didn't exist before and have the added benefit that if it does exist, it gets found in globals() or locals() instead of in __builtins__ and has the right type.
This suggests a more radical optimization that might do more speedup than all the byte-code fiddling currently proposed: automatically localize, at definition time, builtins used within a function. This would be like giving each function a customized implicit set of default args: True=True, None=None, len=len, ...etc. If a program alters __builtins__, def statement placement would determine which version gets used. A non-code-breaking alternative: add a localize() builtin that does the fixup after the fact, something like staticmethod() and classmethod(). If made recursively applicable to classes and modules as well as functions, one could end the setup phase of a long running program with localize(__import__(__name__)) to localize everything A speedup of fraction x at cost of time t would pay if the program runs longer than t/x. (Example: .05 reduction with 1 minute fixup pays after 1/.05 = 20 minutes.) I don't really know, but I imagine that psyco, for instance, would benefit by having builtins localized and detectibly constant (by the lack of any redefinition within the function). As it is now, neither the compiler nor psyco can assume that True, for instance, is not rebound between function invocations. Turning builtins into default locals isolates a function much more than it now is. Well, just some crazy thoughts... Terry J. Reedy

On Wed, Feb 26, 2003 at 01:06:44PM -0500, Terry Reedy wrote:
"Skip Montanaro" <skip@pobox.com> wrote in message news:15964.59840.342254.176499@montanaro.dyndns.org...
For code that wants to cleanly cross the boundary between Python with no boolean type and Python with a boolean type, you sometimes see
True = 1==1 False = 1==0
You get True and False if it didn't exist before and have the added benefit that if it does exist, it gets found in globals() or locals() instead of in __builtins__ and has the right type.
This suggests a more radical optimization that might do more speedup than all the byte-code fiddling currently proposed: automatically localize, at definition time, builtins used within a function. This would be like giving each function a customized implicit set of default args: True=True, None=None, len=len, ...etc. If a program alters __builtins__, def statement placement would determine which version gets used.
Hi, I have a piece of code that does (almost) exactly that, except at runtime. It's purpose is to optimize access to globals used as constants by replacing the LOAD_GLOBAL opcode by a LOAD_CONST. It does that by creating a new code object for the function you provide and using a list of symbols to consider constant. I called it bind, because it works like the postscript bind operator the code for the main function is below, you can find the full module at ftp://ftp.logilab.org/pub/tmp/bind.py it can provide a 10% speedup although most of the time it is around 3% to 5% regards, Ludovic from dis import HAVE_ARGUMENT from new import code as make_code, function as make_function import inspect LOAD_GLOBAL = 116 LOAD_CONST = 100 EXTENDED_ARG = 143 STORE_GLOBAL = 97 def bind_code(co,globals): """Take a code object and a dictionnary and returns a new code object where the opcodes LOAD_GLOBAL are replaced by LOAD_CONST whenever the global's name appear in the dictionnary""" consts = list(co.co_consts) assigned = {} code = co.co_code new_code="" n = len(code) i = 0 while i < n: c = code[i] op = ord(c) i+=1 if op >= HAVE_ARGUMENT: oparg = ord(code[i]) + ord(code[i+1])*256 i += 2 else: oparg=None if op == LOAD_GLOBAL: name = co.co_names[oparg] if globals.has_key(name): k = assigned.get(name,None) if k==None: k=len(consts) assigned[name]=len(consts) consts.append(globals[name]) op = LOAD_CONST oparg=k new_code+=chr(op) if oparg is not None: new_code += chr(oparg & 255) new_code += chr( (oparg>>8) & 255 ) return make_code(co.co_argcount, co.co_nlocals, co.co_stacksize, co.co_flags, new_code, tuple(consts), co.co_names, co.co_varnames, co.co_filename, co.co_name, co.co_firstlineno, co.co_lnotab ) def bind(f,globals): """Returns a new function whose code object has been bound by bind_code()""" newcode = bind_code(f.func_code,globals) defaults = f.func_defaults or () return make_function(newcode,f.func_globals,f.func_name,defaults) -- Ludovic Aubry LOGILAB, Paris (France). http://www.logilab.com http://www.logilab.fr http://www.logilab.org

Ludovic Aubry wrote:
On Wed, Feb 26, 2003 at 01:06:44PM -0500, Terry Reedy wrote:
This suggests a more radical optimization that might do more speedup than all the byte-code fiddling currently proposed: automatically localize, at definition time, builtins used within a function. This would be like giving each function a customized implicit set of default args: True=True, None=None, len=len, ...etc. If a program alters __builtins__, def statement placement would determine which version gets used.
I have a piece of code that does (almost) exactly that, except at runtime. It's purpose is to optimize access to globals used as constants by replacing the LOAD_GLOBAL opcode by a LOAD_CONST. It does that by creating a new code object for the function you provide and using a list of symbols to consider constant.
[code]
I suppose this kind of code could be put to some good use if we ever get some of the recently discussed function modifiers into Python: def myFunction(a,b) [bindglobals]: for i in range(len(a)): b[i] = math.sin(a[i]) * math.cos(2) Hmm, now we'd only need a way to describe "this function has no side-effects and behaves like a mathematical function (same inputs map to same outputs)"... then we could also optimize cos(2) into the constants area :-) -- Marc-Andre Lemburg eGenix.com Professional Python Software directly from the Source (#1, Feb 26 2003)
Python/Zope Products & Consulting ... http://www.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
Python UK 2003, Oxford: 34 days left EuroPython 2003, Charleroi, Belgium: 118 days left

On Wed, Feb 26, 2003 at 09:55:50PM +0100, M.-A. Lemburg wrote:
Ludovic Aubry wrote:
I have a piece of code that does (almost) exactly that, except at runtime. It's purpose is to optimize access to globals used as constants by replacing the LOAD_GLOBAL opcode by a LOAD_CONST. It does that by creating a new code object for the function you provide and using a list of symbols to consider constant.
[code]
I suppose this kind of code could be put to some good use if we ever get some of the recently discussed function modifiers into Python:
def myFunction(a,b) [bindglobals]: for i in range(len(a)): b[i] = math.sin(a[i]) * math.cos(2)
Hmm, now we'd only need a way to describe "this function has no side-effects and behaves like a mathematical function (same inputs map to same outputs)"... then we could also optimize cos(2) into the constants area :-)
With a [bindglobals] modifier we need a way to specify which objects we want to bind. We can pass a list or dict as with my example or we can have the container of the definition tell us if a symbol should be a constant. Actually my first implementation of this code was using a __consts__ variable (like Jeff's __solid__) to tell the bind function what symbol it could replace. I patched PyFunction_New to create a new code object at the time the function object was created, thus binding or not was decided by looking for the __consts__ variable in the globals. There was no need for a [bindglobals] but people not knowing I patched the interpreter were a little bit confused ;) -- Ludovic Aubry LOGILAB, Paris (France). http://www.logilab.com http://www.logilab.fr http://www.logilab.org

For my native-code translation system (which died an early death), I intended to introduce a new attribute which would indicate what (module) attributes could be considered constant. I think that for whatever reason I decided to call this __solid__. So for instance, you might write import math assert 'pi' in math.__solid__ __solid__ = ['sys', 'math', 'MYCONST', 'MyClass', 'myfun'] MYCONST = 22./7 class MyClass: __solid__ = ['mymethod'] def mymethod(self): return MYCONST - math.pi mymethod = staticmethod(jit(mymethod)) def myfun(): print MyClass.mymethod() myfun = jit(myfun) myfun() This allows a whole range of optimizations. LOAD_GLOBAL optimizes to LOAD_CONST. LOAD_CONST + LOAD_ATTR optimizes to LOAD_CONST. Then more constant folding becomes possible. In MyFun, the sequence LOAD_GLOBAL MyClass LOAD_ATTR mymethod CALL_FUNCTION can be reduced to LOAD_CONST MyClass.mymethod CALL_FUNCTION which could even allow MyClass.mymethod to be inlined In MyClass.mymethod, the arithmetic reduces from LOAD_GLOBAL, LOAD_GLOBAL, LOAD_ATTR, BINARY_SUBTRACT to LOAD_CONST, LOAD_CONST, BINARY_SUBTRACT with compiletime-known types which can be turned into a single constant load. In the most extreme case, the code (myfun + mymethod) could reduce to the sequence LOAD_CONST, PRINT_ITEM, PRINT_NEWLINE, RETURN_NONE avoiding all namespace lookups and function calls. Jeff

there's currently lots of code out there that was recently modified to read
try: True except NameError: True = 1 False = 0
Maybe assignment to them could be allowed provided the value being assigned is == to their official value? (The actual assignment needn't be done, just suppression of any exception.) Greg Ewing, Computer Science Dept, +--------------------------------------+ University of Canterbury, | A citizen of NewZealandCorp, a | Christchurch, New Zealand | wholly-owned subsidiary of USA Inc. | greg@cosc.canterbury.ac.nz +--------------------------------------+

Maybe it also makes sense to use indexing into a static array, instead of the case construct.
That makes sense to me. I would be stunned if adding an inner case statement helped (the cost of a second unpredicable branch would have to be less than the cost of cache line utilization of a bit of redundant code). My thought on the big case statement was to have it do more disaggregration. For instance, each op code already knows whether it needs an oparg, so there is no need for a separate test for it on every pass with HAS_ARG(opcode).
While I don't know if this really makes the interpreter more efficient
Time it and see. Raymond Hettinger

Raymond Hettinger wrote:
Maybe it also makes sense to use indexing into a static array, instead of the case construct.
That makes sense to me. I would be stunned if adding an inner case statement helped (the cost of a second unpredicable branch would have to be less than the cost of cache line utilization of a bit of redundant code).
It depends. The unpredictable branche isn't better than the unpredictable call to a function from an indexed table. The case just might have the advantage to live in the same code segment, at least as I know from the M$ compiler. The indirect jump table may live elsewhere in the data...
My thought on the big case statement was to have it do more disaggregration. For instance, each op code already knows whether it needs an oparg, so there is no need for a separate test for it on every pass with HAS_ARG(opcode).
This is true. I used this in my optimization from two years ago, and moved the oparg preparation into the opcode cases, not doing any test, but just fetching the argument. I also turned this into macros which added to the insn pointer only once. Incredible but true: Most of the win I gathered was by typecasting the oparg access differently into a reference to a short int, instead of oring two bytes. But this is due to the dumbness of those compilers...
While I don't know if this really makes the interpreter more efficient
Time it and see.
No, I will not. I just wanted to give some hints to where I see optimizations which might be promising. But I have no time to try this by myself. Also, as already stated, I don't see enough to be gatherable at all, to start an effort. What can I win: 25 to 40 percent? What I'm after is an order of magnitude. cheers - chris -- Christian Tismer :^) <mailto:tismer@tismer.com> Mission Impossible 5oftware : Have a break! Take a ride on Python's Johannes-Niemeyer-Weg 9a : *Starship* http://starship.python.net/ 14109 Berlin : PGP key -> http://wwwkeys.pgp.net/ work +49 30 89 09 53 34 home +49 30 802 86 56 pager +49 173 24 18 776 PGP 0x57F3BF04 9064 F4E1 D754 C2FF 1619 305B C09C 5A3B 57F3 BF04 whom do you want to sponsor today? http://www.stackless.com/

This is true. I used this in my optimization from two years ago, and moved the oparg preparation into the opcode cases, not doing any test, but just fetching the argument.
What happened to the optimization. It is not in the current code?
I also turned this into macros which added to the insn pointer only once. Incredible but true: Most of the win I gathered was by typecasting the oparg access differently into a reference to a short int, instead of oring two bytes.
I skipped over that one because I thought that it would fail on a big-endian computer. Raymond Hettinger

Raymond Hettinger wrote:
This is true. I used this in my optimization from two years ago, and moved the oparg preparation into the opcode cases, not doing any test, but just fetching the argument.
What happened to the optimization. It is not in the current code?
No. At that time, ceval speedups were not popular, so I used the optimization just to make Stackless appear faster than CPython, although CPython would have been *even more* faster.
I also turned this into macros which added to the insn pointer only once. Incredible but true: Most of the win I gathered was by typecasting the oparg access differently into a reference to a short int, instead of oring two bytes.
I skipped over that one because I thought that it would fail on a big-endian computer.
Sure it would fail. But on little endian like X86, it produced much faster code, so I only defined it for certain platforms. ciao - chris -- Christian Tismer :^) <mailto:tismer@tismer.com> Mission Impossible 5oftware : Have a break! Take a ride on Python's Johannes-Niemeyer-Weg 9a : *Starship* http://starship.python.net/ 14109 Berlin : PGP key -> http://wwwkeys.pgp.net/ work +49 30 89 09 53 34 home +49 30 802 86 56 pager +49 173 24 18 776 PGP 0x57F3BF04 9064 F4E1 D754 C2FF 1619 305B C09C 5A3B 57F3 BF04 whom do you want to sponsor today? http://www.stackless.com/

Christian Tismer wrote: So what I would like to try is to merge all opcodes which have exactly the same pattern into one case, which then does the common preparation stuff and postprocessing including error handling, but internally dispatches on the function to be called.
I tried this on the 2.3a2 source, modifying it to implement your idea. This was implemented on top of the LOAD_FAST_n, etc, changes. No luck - it makes the code slower by nearly 5%, decreasing a base 22300 PyStones to 21100 PyStones. That is, the idea of using function pointers. I imagine that a case statement would be even slower. I didnt implement the table of function pointers as one monolithic table, as the unary and binary functions have different signatures. static binaryfunc binop[] = { PyNumber_Multiply, PyNumber_TrueDivide, PyNumber_TrueDivide, PyNumber_FloorDivide, PyNumber_Remainder, PyNumber_Lshift, PyNumber_Rshift, PyNumber_And, PyNumber_Xor, PyNumber_Or, PyNumber_InPlaceMultiply, PyNumber_InPlaceTrueDivide, PyNumber_InPlaceTrueDivide, PyNumber_InPlaceFloorDivide, PyNumber_InPlaceRemainder, PyNumber_InPlaceLshift, PyNumber_InPlaceRshift, PyNumber_InPlaceAnd, PyNumber_InPlaceXor, PyNumber_InPlaceOr }; switch (opcode) { ... case BINARY_MULTIPLY: case BINARY_DIVIDE: case BINARY_TRUE_DIVIDE: case BINARY_FLOOR_DIVIDE: case BINARY_MODULO: case BINARY_LSHIFT: case BINARY_RSHIFT: case BINARY_AND: case BINARY_XOR: case BINARY_OR: case INPLACE_MULTIPLY: case INPLACE_DIVIDE: case INPLACE_TRUE_DIVIDE: case INPLACE_FLOOR_DIVIDE: case INPLACE_MODULO: case INPLACE_LSHIFT: case INPLACE_RSHIFT: case INPLACE_AND: case INPLACE_XOR: case INPLACE_OR: w = POP(); v = TOP(); x = binop[opcode-BINARY_MULTIPLY](v, w); Py_DECREF(v); Py_DECREF(w); SET_TOP(x); if (x != NULL) continue; break; ... "Christian Tismer" <tismer@tismer.com> wrote in message news:3E5C3915.6070600@tismer.com...
Christian Tismer wrote: [much about bytecode, folding, ...]
Well, to stop myself blathering about bytecode and how to make it different, which I'm expected to do other things or at least sleep, I can't resist to throw this rigorous idea into py-dev, which struck me at 4:30 in the morning:
WHat would happen if we re-structured the ceval loop this way: We look at every similar pattern and group together all the opcodes which have a similar pattern. By "pattern" I mean the surrounding calls of simple things like stack operations before and after, and error handling.
So what I would like to try is to merge all opcodes which have exactly the same pattern into one case, which then does the common preparation stuff and postprocessing including error handling, but internally dispatches on the function to be called.
No idea if it may be an improvement, but just let me throw it in. Example:
case UNARY_POSITIVE: v = POP(); x = PyNumber_Positive(v); Py_DECREF(v); PUSH(x); if (x != NULL) continue; break;
case UNARY_NEGATIVE: v = POP(); x = PyNumber_Negative(v); Py_DECREF(v); PUSH(x); if (x != NULL) continue; break; ... case UNARY_CONVERT: v = POP(); x = PyObject_Repr(v); Py_DECREF(v); PUSH(x); if (x != NULL) continue; break;
case UNARY_INVERT: v = POP(); x = PyNumber_Invert(v); Py_DECREF(v); PUSH(x); if (x != NULL) continue; break;
is converted into
case UNARY_POSITIVE: case UNARY_NEGATIVE: case UNARY_CONVERT: case UNARY_INVERT: v = POP(); switch (opcode) { case UNARY_POSITIVE: x = PyNumber_Positive(v); break; case UNARY_NEGATIVE: x = PyNumber_Negative(v); break; case UNARY_CONVERT: x = PyObject_Repr(v); break; case UNARY_INVERT: x = PyNumber_Invert(v); } Py_DECREF(v); PUSH(x); if (x != NULL) continue; break;
Maybe it also makes sense to use indexing into a static array, instead of the case construct. Note that there can be one single such table for all opcodes and all cases, since opcodes are still disjoint. It depends where this table is stored and if this can get in the cache.
While I don't know if this really makes the interpreter more efficient, at least it makes it shorter to read and maybe easier to maintain.
Ok ok, I will sleep now :-)
cheers - chris
-- Christian Tismer :^) <mailto:tismer@tismer.com> Mission Impossible 5oftware : Have a break! Take a ride on Python's Johannes-Niemeyer-Weg 9a : *Starship* http://starship.python.net/ 14109 Berlin : PGP key -> http://wwwkeys.pgp.net/ work +49 30 89 09 53 34 home +49 30 802 86 56 pager +49 173 24 18 776 PGP 0x57F3BF04 9064 F4E1 D754 C2FF 1619 305B C09C 5A3B 57F3 BF04 whom do you want to sponsor today? http://www.stackless.com/

Damien Morton wrote:
Christian Tismer wrote: So what I would like to try is to merge all opcodes which have exactly the same pattern into one case, which then does the common preparation stuff and postprocessing including error handling, but internally dispatches on the function to be called.
I tried this on the 2.3a2 source, modifying it to implement your idea.
This was implemented on top of the LOAD_FAST_n, etc, changes.
No luck - it makes the code slower by nearly 5%, decreasing a base 22300 PyStones to 21100 PyStones. That is, the idea of using function pointers. I imagine that a case statement would be even slower.
Too bad. Hmm. Would you mind to send me the source code? I'd like to see what code is generated.
I didnt implement the table of function pointers as one monolithic table, as the unary and binary functions have different signatures.
They could live in the same table, but probably this doesn't change memory layout at all. A case statement might or might not be slower, a branch might become a little more predictable than though a function table. And then, five percent is not bad. Did you measure how much code you saved in the binary? thanks - chris -- Christian Tismer :^) <mailto:tismer@tismer.com> Mission Impossible 5oftware : Have a break! Take a ride on Python's Johannes-Niemeyer-Weg 9a : *Starship* http://starship.python.net/ 14109 Berlin : PGP key -> http://wwwkeys.pgp.net/ work +49 30 89 09 53 34 home +49 30 802 86 56 pager +49 173 24 18 776 PGP 0x57F3BF04 9064 F4E1 D754 C2FF 1619 305B C09C 5A3B 57F3 BF04 whom do you want to sponsor today? http://www.stackless.com/

On Wed, Feb 26, 2003 at 04:48:37AM +0100, Christian Tismer wrote:
Maybe it also makes sense to use indexing into a static array, instead of the case construct. Note that there can be one single such table for all opcodes and all cases, since opcodes are still disjoint. It depends where this table is stored and if this can get in the cache.
While I don't know if this really makes the interpreter more efficient, at least it makes it shorter to read and maybe easier to maintain.
Been there, done that: http://python.org/sf/693638 I already rejected the patch. :-) Making my own jump table, rather than using a switch was about 15% slower. Read the patch for more info. While I'm sure the patch could be improved, I don't think it would have made enough of a difference to make a change. Neal

Neal Norwitz wrote:
On Wed, Feb 26, 2003 at 04:48:37AM +0100, Christian Tismer wrote:
Maybe it also makes sense to use indexing into a static array, instead of the case construct. Note that there can be one single such table for all opcodes and all cases, since opcodes are still disjoint. It depends where this table is stored and if this can get in the cache.
While I don't know if this really makes the interpreter more efficient, at least it makes it shorter to read and maybe easier to maintain.
Been there, done that: http://python.org/sf/693638
I already rejected the patch. :-) Making my own jump table, rather than using a switch was about 15% slower.
Oh, that was not what I meant. I also did this two years ago and tossed it. Function calls are too expensive. What I mean was to fold opcodes by common patterns. Unfortunately this is slower, too. Anyway, I didn't want to get too deep into this. Stopping wasting time now :-) cheers - chris -- Christian Tismer :^) <mailto:tismer@tismer.com> Mission Impossible 5oftware : Have a break! Take a ride on Python's Johannes-Niemeyer-Weg 9a : *Starship* http://starship.python.net/ 14109 Berlin : PGP key -> http://wwwkeys.pgp.net/ work +49 30 89 09 53 34 home +49 30 802 86 56 pager +49 173 24 18 776 PGP 0x57F3BF04 9064 F4E1 D754 C2FF 1619 305B C09C 5A3B 57F3 BF04 whom do you want to sponsor today? http://www.stackless.com/

Chris Tismer wrote:
Oh, that was not what I meant. I also did this two years ago and tossed it. Function calls are too expensive. What I mean was to fold opcodes by common patterns. Unfortunately this is slower, too.
Anyway, I didn't want to get too deep into this. Stopping wasting time now :-)
Chris already knows this, but it's worth repeating for people who don't. A function call isn't always too expensive, it depends on how much work the opcode is doing. And it depends on lots of other hard-to-predict effects of the generated code and its interaction with the memory system. The various function call opcodes regularly call out to separate functions. I recall benchmarking various options and often moving big chunks of code out of the mainloop and into functions improved performance slightly. Except when it didn't <0.3 wink>. If you are benchmarking various opcode effects, I'd recommend trying to revive the simple cycle counter instrumentation I did for Python 2.2. The idea is to use the Pentium cycle counter to measure the number of cycles spent on each trip through the mainloop. A rough conclusion from the previous measurements was that trivial opcodes like POP_TOP can execute in less than 100 cycles, including opcode dispatch. An opcode that involves calling out to a C function never executes in less than 100 cycles, and often takes 100s of cycles. There's a patch floating around sourceforge somewhere. Jeremy

Excuse me for not following this thread closely. I believe many of these ideas have been suggested before, but most haven't been benchmarked rigorously. If you really want fame and fortune, try designing a more representative benchmark. --Guido van Rossum (home page: http://www.python.org/~guido/)

Neil Schemenauer wrote:
Guido van Rossum wrote:
Some of that is already done (these say 'continue' instead of 'break'). But I'm sure more can be done.
I think Christian was referring to the ones that say "goto fast_next_opcode".
(Nodding fast and repeatedly, with a smile) -- Christian Tismer :^) <mailto:tismer@tismer.com> Mission Impossible 5oftware : Have a break! Take a ride on Python's Johannes-Niemeyer-Weg 9a : *Starship* http://starship.python.net/ 14109 Berlin : PGP key -> http://wwwkeys.pgp.net/ work +49 30 89 09 53 34 home +49 30 802 86 56 pager +49 173 24 18 776 PGP 0x57F3BF04 9064 F4E1 D754 C2FF 1619 305B C09C 5A3B 57F3 BF04 whom do you want to sponsor today? http://www.stackless.com/
participants (18)
-
Christian Tismer
-
Damien Morton
-
damien morton
-
Damien Morton
-
Fred L. Drake, Jr.
-
Greg Ewing
-
Guido van Rossum
-
Jeff Epler
-
Jeremy Hylton
-
logistix
-
Ludovic Aubry
-
M.-A. Lemburg
-
Neal Norwitz
-
Neil Schemenauer
-
Paul Hughett
-
Raymond Hettinger
-
Skip Montanaro
-
Terry Reedy