
Guido van Rossum wrote:
Since we're at it, it's worth mentioning another conclusion we came across at the time: the cache effects in the main loop are significant -- it is more important to try keeping at best the main loop small enough, so that those effects are minimized.
Yes, that's what Tim keeps hammering on too.
+1 from here ;-) I have done quite a bit of testing with the old 1.5 ceval loop (patch still available in case someone wants to look at it) and found these things: 1. It pays off to special case LOAD_FAST by handling it before going into the switch at all, since this is the one most often used opcode in Python. 2. Reordering the opcodes has some noticable effect on performance too, even though it is very touchy to cache lines. 3. Splitting the big switch in two with the first one holding the most of often used opcodes while the second one takes care of the more esoteric ones helps keeping the inner loop in the CPU cache and thus increases performance.