[Python-Dev] Reordering opcodes (PEP 203 Augmented Assignment)

Tim Peters tim_one@email.msn.com
Sun, 30 Jul 2000 18:41:42 -0400

[Vladimir Marangozov]
> ...
> Which shows, among others, that important cache effects are
> still here, bacause "python-top5-load_fast" is slower than
> "python-top5" alone...

I would rather call them "second-order effects" (SOE), as we can actually
have no idea what causes them without a whale of a lot of work.  They
*could* be unfortunate accidents in the data or instruction caches, or
*could* be that rearranging the code caused the compiler to spill a variable
to memory that would have been better off in a register, or *could* be that
the compiler optimizer runs out of steam in a different place than before
and so generates sub-optimal code in any number of other respects, or
*could* be that the sequence of jumps just happened to change in a way that
causes one to get mis-predicted by the HW, or causes the HW prediction cache
to flush at a bad moment, or etc etc etc.

A gloss:  almost all interesting optimizations a compiler performs, from
register assignment to instruction scheduling, take exponential time to
optimize "for real".  No compiler can afford that much time (although there
are some platform-specific "superoptimizers" out there that you can run for
days over small pieces of code), so in practice compilers string together a
long chain of fast but more-or-less fragile heuristic approaches, or even
resort to general "mystery schemes" (like simulated annealing).  Predicting
what pops out the other end generally can't be done easier than by running
the optimizer and *seeing* what pops out!  So tiny changes can have
wonderfully <wink> unexpected effects, and in general you cannot out-think
them in advance.  And as CPU HW has gotten ever-more complex, SOE has spread
to cover a much larger territory than it used to (note that Intel doesn't
even pretend to document the actual P3 timing rules anymore -- they've
gotten too complicated to explain!  compilers only work from an
*approximation* to HW truth now).

That said, there's good cross-platform reason to believe that Vladimir's
"top 5" rearrangement is a helpful change, and more helpful than sucking out
one popular opcode.  It also has less potential to trigger harmful SOE,
because it doesn't change the basic-block structure of the code (optimizers
crawl over a graph of the basic blocks, and changes to the *topology* of the
graph can cause heuristics to "get stuck" in new & exciting places <0.9
wink>; this is important in ceval because the loop is soooooo big --
optimizers are likely to fall into a state of unstable equilibrium when
chewing over this much code).

+1 on top5 because it's an easy change that's likely to help and unlikely to
hurt.  -0 on peeing away more time on micro-optimization right now.

    work-on-python...<wink>-ly y'rs  - tim