FYI. It would be interesting to collect some feedback on these ideas for popular combos. Who knows... Forwarded message:
From morton@dennisinter.com Fri Jul 28 20:48:29 2000 From: morton@dennisinter.com (Damien Morton) To: <Vladimir.Marangozov@inrialpes.fr> Subject: further optimising the micro-optimisations for cache locality Date: Fri, 28 Jul 2000 14:34:29 -0400 Message-ID: <NDBBLENPLCPAHKODFHNGOEADEBAA.morton@dennisinter.com> MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit X-Priority: 3 (Normal) X-MSMail-Priority: Normal X-Mailer: Microsoft Outlook IMO, Build 9.0.2416 (9.0.2910.0) X-MimeOLE: Produced By Microsoft MimeOLE V5.50.4133.2400 Importance: Normal
Folded version:
case BINARY_ADD: w = POP(); v = POP(); x = PyNumber_Add(v, w); goto end_binop;
case BINARY_MULTIPLY: w = POP(); v = POP(); x = PyNumber_Multiply(v, w); goto end_binop;
... end_binop: Py_DECREF(v); Py_DECREF(w); PUSH(x); if (x != NULL) continue; break;
Further folded version:
case BINARY_POWER: f = PyNumber_Add; goto do_binop;
case BINARY_MULTIPLY: f = PyNumber_Multiply; goto do_binop;
... do_binop: w = POP(); v = POP(); x = f(v, w); Py_DECREF(v); Py_DECREF(w); PUSH(x); if (x != NULL) continue; break;
Further Further folded version: (i havent looked at the mainloop code, but can guess its structure)
i assume 'op' is the opcode we are swtching on
// the function to call for this op (void *)() op_func[] = { ..., PyNumber_Add, PyNumber_Multiply, ... };
// the kind of op this func describes unsigned char op_type[] = { ..., DO_BINOP1, DO_BINOP2, DO_UNOP, ... };
// these two tables will be cached because of the frequency the are accessed // ive used a table of pointers and a table of bytes to reduce the memory required // because the tables are small, locality within and between the tables isnt important // might be an idea to force the tables into contiguous memory somehow // i could also have used a table of structs, but alignment would increase memory usage
unsigned char op;
switch(op_type[op]) { case DO_BINOP1: w = POP(); v = POP(); x = op_func[op](v, w); Py_DECREF(v); Py_DECREF(w); PUSH(x); if (x != NULL) continue; break; case DO_BINOP2: w = POP(); v = POP(); x = op_func[op](v, w, Py_None); Py_DECREF(v); Py_DECREF(w); PUSH(x); if (x != NULL) continue; break; case DO_UNOP: v = POP(); x = op_func[op](v); Py_DECREF(v); PUSH(x); if (x != NULL) continue; break;
===================== original message ================================= Guido van Rossum wrote:
[me, on micro-optimizing the main loop]
Fair enough (for one particular CPU at least).
Since we're at it, it's worth mentioning another conclusion we came across at the time: the cache effects in the main loop are significant -- it is more important to try keeping at best the main loop small enough, so that those effects are minimized.
An experiment I did at the time which gave some delta-speedup: I folded most of the UNOP & BINOP opcodes since they differ only by the functon they call and they duplicate most of the opcode body. Like this:
Original: case BINARY_POWER: w = POP(); v = POP(); x = PyNumber_Power(v, w, Py_None); Py_DECREF(v); Py_DECREF(w); PUSH(x); if (x != NULL) continue; break;
case BINARY_MULTIPLY: w = POP(); v = POP(); x = PyNumber_Multiply(v, w); Py_DECREF(v); Py_DECREF(w); PUSH(x); if (x != NULL) continue; break; ...
Folded version:
case BINARY_POWER: w = POP(); v = POP(); x = PyNumber_Power(v, w, Py_None); goto end_binop;
case BINARY_MULTIPLY: w = POP(); v = POP(); x = PyNumber_Multiply(v, w); goto end_binop;
... end_binop: Py_DECREF(v); Py_DECREF(w); PUSH(x); if (x != NULL) continue; break;
This reduced the code size of ceval.c, which resulted in less cache effects and gave more speed, despite the additional jumps. It possibly results in less page-faults too, although this is unlikely.
Which makes me think that, if we want to do something about cache effects, it is probably not a bad idea to just "reorder" the bytecodes in the big switch by decreasing frequency (we have some stats about this -- I believe Skip and MAL have discussed the opcodes' frequency and the charts lie somewhere in the archives). I remember Marc-Andre had done something in this direction and reported some perf improvements too. Since reordering the opcodes doesn't really hurt, if I'm about to do something with the main loop, it'll be only this.
Micro-optimization doesn't worth the effort. As to the 2-byte arg limit, I think I'd be happy if the compiler raises an exception if this limit is exceeded.
That would be a good first step indeed!
I'll try to have a look, if someone else is not more reactive than me.
-- Vladimir MARANGOZOV | Vladimir.Marangozov@inrialpes.fr http://sirac.inrialpes.fr/~marangoz | tel:(+33-4)76615277 fax:76615252
-- Vladimir MARANGOZOV | Vladimir.Marangozov@inrialpes.fr http://sirac.inrialpes.fr/~marangoz | tel:(+33-4)76615277 fax:76615252