getting rid of data movement instructions

Sun Aug 19 15:16:20 EDT 2001

[Skip Montanaro]
> ...
> The line in MAL's table that has me scratching my head is the one that
> indicates about 8.5% of all instructions executed in his mix are
> LOAD_NAMEs.
> My understanding is that LOAD_NAME is only generated in the presence of
> "from m import *" or "exec".
> Either the code that MAL profiled was unusual in this respect or
> there's a lot more of this lookup-busting code being executed than I
> thought.

Quite a while ago Python also generated LOAD_NAME when it *could* have been
generating LOAD_GLOBAL; that was repaired quite a while ago too.  In any
case, it's a waste of time to worry about optimizing LOAD_NAME now ("import
*" isn't even legal except at module scope (see the Ref Man -- it's been
officially undefined since the first Python release!), and "exec" w/o
explicitly naming a dict is at best sloppy practive -- no need to cater to
either).

> ...
>   10915 LOAD_FAST (18.52%)                 3024 LOAD_ATTR (5.13%)
>    6825 POP_TOP (11.58%)                   2615 JUMP_IF_FALSE (4.44%)
>    6704 LOAD_CONST (11.37%)                2430 JUMP_FORWARD (4.12%)
>    4236 LOAD_GLOBAL (7.19%)                2139 RETURN_VALUE (3.63%)
>    4235 CALL_FUNCTION (7.18%)              1753 COMPARE_OP (2.97%)
>    4218 STORE_FAST (7.16%)                  988 BINARY_ADD (1.68%)
>
> I'm not quite sure what to make of the relatively large number of JUMP
> instructions in the static frequencies, but the more frequently
> they occur, the smaller the size of the average basic block will be
> (which limits the range of instructions over which many optimizations
< can be made).

Welcome to the real world <sigh>.  Python actually-- and these stats show
it --has an exceptionally *low* dynamic frequency of jumps, compared to
almost anything else.  The primarly cause, of course, is that it has so many
opcodes that barely do anything (POP_TOP indeed <wink>).  Reduce the number
of do-little opcodes we need to execute, and the relative frequency of
branches necessarily rises.  OTOH, Python also generates jumps to jumps, and
such chains can be collapsed with relative ease.

"Common wisdom" in the compiler biz is that the average size of a basic
block is about 5(!) machine instructions, including the branch that ends it.
So if you raise the dynamic frequency of Python jumps to 20%, you'd be doing
great.  BTW, Intel has produced some nice slides for Itanium summarizing
many of the ways SW and HW can cooperate to effectively increase the basic
block size (register renaming, predicated execution, speculative execution,
loop unrolling & pipelining, and branch prediction are biggies).

This reminds me of the definition of a supercomputer commonly heard in the
80s:  a supercomputer is a box that turns your CPU-bound problem into an
I/O-bound problem.  IOW-- and take this to heart! --no matter what you
optimize, if you do a good job, one consequence is that "the bottleneck"
moves to something you weren't even thinking about.  Or, as Gordon Bell's
First Law of supercomputer design put it, Everything Counts.

so-pick-the-cherries-and-declare-victory<wink>-ly y'rs  - tim