further optimising the micro-optimisations for cache locality (fwd)
FYI. It would be interesting to collect some feedback on these ideas for popular combos. Who knows... Forwarded message:
From morton@dennisinter.com Fri Jul 28 20:48:29 2000 From: morton@dennisinter.com (Damien Morton) To: <Vladimir.Marangozov@inrialpes.fr> Subject: further optimising the micro-optimisations for cache locality Date: Fri, 28 Jul 2000 14:34:29 -0400 Message-ID: <NDBBLENPLCPAHKODFHNGOEADEBAA.morton@dennisinter.com> MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit X-Priority: 3 (Normal) X-MSMail-Priority: Normal X-Mailer: Microsoft Outlook IMO, Build 9.0.2416 (9.0.2910.0) X-MimeOLE: Produced By Microsoft MimeOLE V5.50.4133.2400 Importance: Normal
Folded version:
case BINARY_ADD: w = POP(); v = POP(); x = PyNumber_Add(v, w); goto end_binop;
case BINARY_MULTIPLY: w = POP(); v = POP(); x = PyNumber_Multiply(v, w); goto end_binop;
... end_binop: Py_DECREF(v); Py_DECREF(w); PUSH(x); if (x != NULL) continue; break;
Further folded version:
case BINARY_POWER: f = PyNumber_Add; goto do_binop;
case BINARY_MULTIPLY: f = PyNumber_Multiply; goto do_binop;
... do_binop: w = POP(); v = POP(); x = f(v, w); Py_DECREF(v); Py_DECREF(w); PUSH(x); if (x != NULL) continue; break;
Further Further folded version: (i havent looked at the mainloop code, but can guess its structure)
i assume 'op' is the opcode we are swtching on
// the function to call for this op (void *)() op_func[] = { ..., PyNumber_Add, PyNumber_Multiply, ... };
// the kind of op this func describes unsigned char op_type[] = { ..., DO_BINOP1, DO_BINOP2, DO_UNOP, ... };
// these two tables will be cached because of the frequency the are accessed // ive used a table of pointers and a table of bytes to reduce the memory required // because the tables are small, locality within and between the tables isnt important // might be an idea to force the tables into contiguous memory somehow // i could also have used a table of structs, but alignment would increase memory usage
unsigned char op;
switch(op_type[op]) { case DO_BINOP1: w = POP(); v = POP(); x = op_func[op](v, w); Py_DECREF(v); Py_DECREF(w); PUSH(x); if (x != NULL) continue; break; case DO_BINOP2: w = POP(); v = POP(); x = op_func[op](v, w, Py_None); Py_DECREF(v); Py_DECREF(w); PUSH(x); if (x != NULL) continue; break; case DO_UNOP: v = POP(); x = op_func[op](v); Py_DECREF(v); PUSH(x); if (x != NULL) continue; break;
===================== original message ================================= Guido van Rossum wrote:
[me, on micro-optimizing the main loop]
Fair enough (for one particular CPU at least).
Since we're at it, it's worth mentioning another conclusion we came across at the time: the cache effects in the main loop are significant -- it is more important to try keeping at best the main loop small enough, so that those effects are minimized.
An experiment I did at the time which gave some delta-speedup: I folded most of the UNOP & BINOP opcodes since they differ only by the functon they call and they duplicate most of the opcode body. Like this:
Original: case BINARY_POWER: w = POP(); v = POP(); x = PyNumber_Power(v, w, Py_None); Py_DECREF(v); Py_DECREF(w); PUSH(x); if (x != NULL) continue; break;
case BINARY_MULTIPLY: w = POP(); v = POP(); x = PyNumber_Multiply(v, w); Py_DECREF(v); Py_DECREF(w); PUSH(x); if (x != NULL) continue; break; ...
Folded version:
case BINARY_POWER: w = POP(); v = POP(); x = PyNumber_Power(v, w, Py_None); goto end_binop;
case BINARY_MULTIPLY: w = POP(); v = POP(); x = PyNumber_Multiply(v, w); goto end_binop;
... end_binop: Py_DECREF(v); Py_DECREF(w); PUSH(x); if (x != NULL) continue; break;
This reduced the code size of ceval.c, which resulted in less cache effects and gave more speed, despite the additional jumps. It possibly results in less page-faults too, although this is unlikely.
Which makes me think that, if we want to do something about cache effects, it is probably not a bad idea to just "reorder" the bytecodes in the big switch by decreasing frequency (we have some stats about this -- I believe Skip and MAL have discussed the opcodes' frequency and the charts lie somewhere in the archives). I remember Marc-Andre had done something in this direction and reported some perf improvements too. Since reordering the opcodes doesn't really hurt, if I'm about to do something with the main loop, it'll be only this.
Micro-optimization doesn't worth the effort. As to the 2-byte arg limit, I think I'd be happy if the compiler raises an exception if this limit is exceeded.
That would be a good first step indeed!
I'll try to have a look, if someone else is not more reactive than me.
-- Vladimir MARANGOZOV | Vladimir.Marangozov@inrialpes.fr http://sirac.inrialpes.fr/~marangoz | tel:(+33-4)76615277 fax:76615252
-- Vladimir MARANGOZOV | Vladimir.Marangozov@inrialpes.fr http://sirac.inrialpes.fr/~marangoz | tel:(+33-4)76615277 fax:76615252
On Fri, Jul 28, 2000 at 09:24:36PM +0200, Vladimir Marangozov wrote:
It would be interesting to collect some feedback on these ideas for popular combos. Who knows...
I like these ideas, though I think anything beyond 'further folded' requires a seperate switch for the non-common operators and those that do a tad more than call a function with a certain number of arguments and push the result on the stack. Re-numbering the ops into fast-ops and slow-ops, as well as argument-ops and nonargument-ops. (I hope all non-argument ops fall in the 'fast' category, or it might get tricky ;-P) I'm also wondering whether they really speed things up. The confusion might force the compiler to generate *less* efficient code. Then again, it removes some of the burden from the compiler, too, so it probably depends very heavily on the compiler whether this is going to have a positive effect.
// the function to call for this op (void *)() op_func[] = { ..., PyNumber_Add, PyNumber_Multiply, ... };
// the kind of op this func describes unsigned char op_type[] = { ..., DO_BINOP1, DO_BINOP2, DO_UNOP, ... };
// these two tables will be cached because of the frequency the are // accessed // ive used a table of pointers and a table of bytes to reduce the // memory required because the tables are small, locality within and // between the tables isnt important // might be an idea to force the tables into contiguous memory somehow // i could also have used a table of structs, but alignment would // increase memory usage
I'm not so sure about any of these comments, given that we do jump to a function right after accessing these tables. I suggest heavy testing, and I can offer only two architectures myself (linux-i386 and solaris-sparc-gcc.) -- Thomas Wouters <thomas@xs4all.net> Hi! I'm a .signature virus! copy me into your .signature file to help me spread!
Thomas Wouters wrote:
On Fri, Jul 28, 2000 at 09:24:36PM +0200, Vladimir Marangozov wrote:
It would be interesting to collect some feedback on these ideas for popular combos. Who knows...
I like these ideas, though I think anything beyond 'further folded' requires a seperate switch for the non-common operators and those that do a tad more than call a function with a certain number of arguments and push the result on the stack.
I agree. Note that all we can get though this game is some dubious percents that we shouldn't ignore, but which persistency in the long term is quite questionable. There's no way that, to pick a guy at random, Tim won't confirm this! And just like Guido, I'm not ready to trade code cleanliness for dubious speed. However, I take the opportunity to invite you again to "heavy test" the object allocator's candidate -- obmalloc.c, which, in my educated understanding of the current C implementation is the only piece of code that is capable of increasing Python's overall performance by > 10% assuming that your script involves object allocations. It is my strong belief that this perfomance comes for free, by instrumenting Python at its most inner internals, not on top of it! Of course, improvements to this code are welcome and I'd be happy to discuss them in this forum. -- Vladimir MARANGOZOV | Vladimir.Marangozov@inrialpes.fr http://sirac.inrialpes.fr/~marangoz | tel:(+33-4)76615277 fax:76615252
participants (2)
-
Thomas Wouters
-
Vladimir.Marangozov@inrialpes.fr