Python and the need for speed
python at lucidity.plus.com
Tue Apr 18 20:07:44 EDT 2017
On 19/04/17 00:33, bartc wrote:
> So that's 'label-pointers' which I assume must correspond to computed
Yes - just different terminology. Being able to take the address of a
label and "goto address" rather than "goto label".
> (I don't know why they should be faster than a switch; they just
In C, the code generated for a switch() can do almost anything. It might
cache the expression in a temporary and generate a linear "if .. else if
.. else if ... else if" sequence (which is probably quite common for a
sparsely populated set of values, after an initial range check is done)
or it might generate a jump table (similar to the computed gotos) (which
is probably quite common for a densely populated set of values, after an
initial range check is done and an adjustment to make the index
zero-based). It could also generate code that is some sort of hybrid of
the two (for example, a switch with several densely-populated areas in
an otherwise sparse set of values might do linear range-checks to find
the right jump-table to use for each dense area).
What the computed goto stuff does is effectively reduce all of that to a
single jump table with a known set of indices - there are 256 opcodes
starting from 0 and each has an entry in an array of code pointers. To
jump to the handler for an opcode, just 'goto handlers[op]'. No range
checking, no index adjustment, no linear test sequences. This is what
makes the dispatch fast.
> With the sort of lower level programs I write (in another dynamic
> language not Python), such an assembly layer improved performance 2-3
> times over using 100% HLL compiled using C and gcc-O3.
Did you give the C compiler enough hints though? Much like the above
computed-goto stuff (which is less of a hint and more of an instruction
that the compiler should be using jump tables and nothing else for that
bit of code) there are lots of other ways of spelling things differently
in C that can give better performance than what you might get by default
(and I'm not even talking about compiler-specific #pragmas or whatever).
Also, remember that -O3 might (and by that I mean probably will! ;))
make your code larger. If you have some specific core areas of your
interpreter that are now large enough to cause instruction cache misses
then a smaller -O2 (or even -Os) compiled version might perform better
on your hardware.
More information about the Python-list