I optimised the layout of the python opcodes using a simulated annealing process that scored adjacent opcodes according to their frequency of co-occurence. This raised my PyStone benchmark from 22100 to 22700, for a 3% gain. Ive been using Skip's DXP server to gather statistics, but there isnt much data there. I should be able to achieve better results if more people contributed stats to his server, more information about which can be found here: http://manatee.mojam.com/~skip/python/ The process of layout the opcodes and switch cases has largely been automated, and generating new layouts is relatively painless and quick. Do please contribute stats for 2.3a2 to Skip's DXP server. I also implemented a LOAD_FASTER opcode, with the argument encoded into the opcode. This raised my PyStone benchmark from 22700 to 23150, for a total 5% gain. The main switch loop looks like this now: if (opcode >= LOAD_FASTER) { load_fast(opcode - LOAD_FASTER); ... goto fast_next_opcode; } switch(opcode) { case LOAD_ATTR: oparg = NEXTARG(); w = GETITEM(names, oparg); ... break; ... } Each opcode case now loads its own argument as necessary. The test for HAVE_ARGUMENT is now implemented using an array of bytes. The test now happens very infrequently, so any performance loss is negligible. const char HASARG[] = { 0 , /* STOP_CODE */ 1 , /* LOAD_ATTR */ 1 , /* CALL_FUNCTION */ 1 , /* STORE_FAST */ 0 , /* BINARY_ADD */ 0 , /* SLICE+0 */ 0 , /* SLICE+1 */ 0 , /* SLICE+2 */ ... }