[Python-Dev] further optimising the micro-optimisations for cache locality (fwd)

28 Jul 2000

      FYI.

It would be interesting to collect some feedback on these ideas for
popular combos. Who knows...

Forwarded message:
...
From morton@dennisinter.com  Fri Jul 28 20:48:29 2000
From: morton@dennisinter.com (Damien Morton)
To: <Vladimir.Marangozov@inrialpes.fr>
Subject: further optimising the micro-optimisations for cache locality
Date: Fri, 28 Jul 2000 14:34:29 -0400
Message-ID: <NDBBLENPLCPAHKODFHNGOEADEBAA.morton@dennisinter.com>
MIME-Version: 1.0
Content-Type: text/plain;
  charset="iso-8859-1"
Content-Transfer-Encoding: 7bit
X-Priority: 3 (Normal)
X-MSMail-Priority: Normal
X-Mailer: Microsoft Outlook IMO, Build 9.0.2416 (9.0.2910.0)
X-MimeOLE: Produced By Microsoft MimeOLE V5.50.4133.2400
Importance: Normal
Folded version:
case BINARY_ADD:
                        w = POP();
                        v = POP();
                        x = PyNumber_Add(v, w);
      	goto end_binop;
case BINARY_MULTIPLY:
                        w = POP();
                        v = POP();
                        x = PyNumber_Multiply(v, w);
      	goto end_binop;
...
      end_binop:
                        Py_DECREF(v);
                        Py_DECREF(w);
                        PUSH(x);
                        if (x != NULL) continue;
                        break;
Further folded version:
case BINARY_POWER:
      		f = PyNumber_Add;
      	goto do_binop;
case BINARY_MULTIPLY:
      		f = PyNumber_Multiply;
      	goto do_binop;
...
      do_binop:
                        w = POP();
                        v = POP();
                        x = f(v, w);
                        Py_DECREF(v);
                        Py_DECREF(w);
                        PUSH(x);
                        if (x != NULL) continue;
                        break;
Further Further folded version:
(i havent looked at the mainloop code, but can guess its structure)
i assume 'op' is the opcode we are swtching on
// the function to call for this op
(void *)() op_func[] = { ..., PyNumber_Add, PyNumber_Multiply, ... };
// the kind of op this func describes
unsigned char op_type[] = { ..., DO_BINOP1, DO_BINOP2, DO_UNOP, ... };
// these two tables will be cached because of the frequency the are accessed
// ive used a table of pointers and a table of bytes to reduce the memory
required
// because the tables are small, locality within and between the tables isnt
important
// might be an idea to force the tables into contiguous memory somehow
// i could also have used a table of structs, but alignment would increase
memory usage
unsigned char op;
switch(op_type[op]) {
  case DO_BINOP1:
                        w = POP();
                        v = POP();
                        x = op_func[op](v, w);
                        Py_DECREF(v);
                        Py_DECREF(w);
                        PUSH(x);
                        if (x != NULL) continue;
                        break;
  case DO_BINOP2:
                        w = POP();
                        v = POP();
                        x = op_func[op](v, w, Py_None);
                        Py_DECREF(v);
                        Py_DECREF(w);
                        PUSH(x);
                        if (x != NULL) continue;
                        break;
  case DO_UNOP:
                        v = POP();
                        x = op_func[op](v);
                        Py_DECREF(v);
                        PUSH(x);
                        if (x != NULL) continue;
                        break;
===================== original message =================================
Guido van Rossum wrote:
...
...
[me, on micro-optimizing the main loop]
Fair enough (for one particular CPU at least).
Since we're at it, it's worth mentioning another conclusion we came across
at the time: the cache effects in the main loop are significant -- it is
more important to try keeping at best the main loop small enough, so that
those effects are minimized.
An experiment I did at the time which gave some delta-speedup:
I folded most of the UNOP & BINOP opcodes since they differ only by the
functon they call and they duplicate most of the opcode body. Like this:
Original:
                case BINARY_POWER:
                        w = POP();
                        v = POP();
                        x = PyNumber_Power(v, w, Py_None);
                        Py_DECREF(v);
                        Py_DECREF(w);
                        PUSH(x);
                        if (x != NULL) continue;
                        break;
case BINARY_MULTIPLY:
                        w = POP();
                        v = POP();
                        x = PyNumber_Multiply(v, w);
                        Py_DECREF(v);
                        Py_DECREF(w);
                        PUSH(x);
                        if (x != NULL) continue;
                        break;
                ...
Folded version:
case BINARY_POWER:
                        w = POP();
                        v = POP();
                        x = PyNumber_Power(v, w, Py_None);
      	goto end_binop;
case BINARY_MULTIPLY:
                        w = POP();
                        v = POP();
                        x = PyNumber_Multiply(v, w);
      	goto end_binop;
...
      end_binop:
                        Py_DECREF(v);
                        Py_DECREF(w);
                        PUSH(x);
                        if (x != NULL) continue;
                        break;
This reduced the code size of ceval.c, which resulted in less cache effects
and gave more speed, despite the additional jumps. It possibly results in
less page-faults too, although this is unlikely.
Which makes me think that, if we want to do something about cache effects,
it is probably not a bad idea to just "reorder" the bytecodes in the big
switch by decreasing frequency (we have some stats about this -- I believe
Skip and MAL have discussed the opcodes' frequency and the charts lie
somewhere in the archives).  I remember Marc-Andre had done something in
this direction and reported some perf improvements too.  Since reordering
the opcodes doesn't really hurt, if I'm about to do something with the
main loop, it'll be only this.
...
...
Micro-optimization doesn't worth the effort. As to the 2-byte arg limit,
I think I'd be happy if the compiler raises an exception if this limit
is exceeded.
That would be a good first step indeed!
I'll try to have a look, if someone else is not more reactive than me.
--
       Vladimir MARANGOZOV          | Vladimir.Marangozov@inrialpes.fr
http://sirac.inrialpes.fr/~marangoz | tel:(+33-4)76615277 fax:76615252
-- 
       Vladimir MARANGOZOV          | Vladimir.Marangozov@inrialpes.fr
http://sirac.inrialpes.fr/~marangoz | tel:(+33-4)76615277 fax:76615252

[Python-Dev] further optimising the micro-optimisations for cache locality (fwd)

Vladimir.Marangozov＠inrialpes.fr