
[Vladimir Marangozov]
... Which shows, among others, that important cache effects are still here, bacause "python-top5-load_fast" is slower than "python-top5" alone...
I would rather call them "second-order effects" (SOE), as we can actually have no idea what causes them without a whale of a lot of work. They *could* be unfortunate accidents in the data or instruction caches, or *could* be that rearranging the code caused the compiler to spill a variable to memory that would have been better off in a register, or *could* be that the compiler optimizer runs out of steam in a different place than before and so generates sub-optimal code in any number of other respects, or *could* be that the sequence of jumps just happened to change in a way that causes one to get mis-predicted by the HW, or causes the HW prediction cache to flush at a bad moment, or etc etc etc. A gloss: almost all interesting optimizations a compiler performs, from register assignment to instruction scheduling, take exponential time to optimize "for real". No compiler can afford that much time (although there are some platform-specific "superoptimizers" out there that you can run for days over small pieces of code), so in practice compilers string together a long chain of fast but more-or-less fragile heuristic approaches, or even resort to general "mystery schemes" (like simulated annealing). Predicting what pops out the other end generally can't be done easier than by running the optimizer and *seeing* what pops out! So tiny changes can have wonderfully <wink> unexpected effects, and in general you cannot out-think them in advance. And as CPU HW has gotten ever-more complex, SOE has spread to cover a much larger territory than it used to (note that Intel doesn't even pretend to document the actual P3 timing rules anymore -- they've gotten too complicated to explain! compilers only work from an *approximation* to HW truth now). That said, there's good cross-platform reason to believe that Vladimir's "top 5" rearrangement is a helpful change, and more helpful than sucking out one popular opcode. It also has less potential to trigger harmful SOE, because it doesn't change the basic-block structure of the code (optimizers crawl over a graph of the basic blocks, and changes to the *topology* of the graph can cause heuristics to "get stuck" in new & exciting places <0.9 wink>; this is important in ceval because the loop is soooooo big -- optimizers are likely to fall into a state of unstable equilibrium when chewing over this much code). +1 on top5 because it's an easy change that's likely to help and unlikely to hurt. -0 on peeing away more time on micro-optimization right now. then-again-if-it's-the-only-way-guido-can-take-time-to-actually- work-on-python...<wink>-ly y'rs - tim