[pypy-dev] LLAbstractInterpreter

Wed Dec 14 21:01:19 CET 2005

Hi all,

As you know, the JIT work has started during the sprint.  Most of the
work went into pypy/jit/llabstractinterpreter.py.  As it's still an
obscure piece of code, I will try to describe it a bit in this e-mail
in a documentation-ish way.  Sorry, it's very draftish.

-~-~-

The LLAbstractInterpreter is a tool that takes low-level flow graphs and
converts them into some more low-level flow graphs, by propagating
constants and "virtualizing" structures (more about it later).  Consider
the example of pypy/jit/tl.py.  It is a small interpreter for the "Toy
Language", a simple stack machine.  Here is how the main loop of the
interpreter looks like:

    def interp(code):    # 'code' is a bytecode string
        pc = 0           # position counter
        code_len = len(code)
        stack = []
        while pc < code_len:
            opcode = ord(code[pc])
            pc += 1
            if opcode == PUSH:
                stack.append( ... )
                pc += 1
            elif opcode == POP:
                stack.pop()
            elif ...

I let you imagine how the low-level flow graph of this function looks
like: an 'int_lt' for the loop condition, a 'getarrayitem followed by a
'cast_char_to_int' for reading the opcode, an 'int_add' for incrementing
the pc, and a lot of comparisons and branches for the long if/elif/...
part.

The LLAbstractInterpreter takes this graph as input and follows it
operation by operation.  While doing so, it regenerates a new graph,
similar to the input one, by putting there a copy of the operations it
encounters (similar to the flow object space).  For some of the input
graph's variables the LLAbstractInterpreter propagates constant values;
it performs the operations between such constants directly, instead of
putting them into the new generated graph (again, this is similar to the
flow object space).

~-~-

Consider the Toy Language example again.  We use the
LLAbstractInterpreter on the low-level graph of the interp() function
above, providing it with a constant Toy Language code string.  The
LLAbstractInterpreter must be written in a completely general way -- it
should not depend on the details of the interp() function, but just use
it as input.  (It is allowed to give a few hints to guide the process,
though.)  The goal is to obtain the following effect: 'code', 'code_len'
and 'pc' should all be constants, and operations on them should be
constant-folded.  On the first iteration through the loop, 'opcode' will
thus be a constant -- the value of the first opcode.  All the
if/elif/... branching can also be constant-folded, because the 'opcode'
is known.  All that's left to be written in the new graph is the actual
content of the selected 'if', i.e. the actual bytecode implementation.

Then 'pc+1' is a constant too, so on the second iteration through the
loop, 'opcode' will again be a constant, and the if/elif/... branching
will melt again -- with only the implementation of the second opcode
left in the new graph.  And so on, iteration after iteration.  The loop
is unrolled, in the sense that it's not present in the new graph any
more, but each unrolled version of the loop is much smaller than the
original code because all the dispatching logic disappeared and all the
opcode implementations but one are ignored.

The delicate part -- the difference with the flow space -- is that to
obtain this effect we need more precise control over what occurs when
the input graph loops back over itself, or generally when two control
flow paths are merged into the same target block.  Consider what occurs
after one iteration through the 'while' loop in interp().  The value of
'pc' is different: it was 0 at the start of the first iteration, and it
is 1 at the start of the second iteration.  In this case, the intended
goal is to unroll the loop, i.e. follow the same loop body again with a
different constant for 'pc'.

However, we don't want to unroll *all* loops of the input graph.  There
are cases where we need to ignore the different constants we get, and
generalize them to a variable, so that the graph we generate can contain
a loop similar to the original one.  For example, if the Toy Language
had a FACTORIAL bytecode, it would probably be written as:

    ...elif opcode == FACTORIAL:
           n = stack.pop()
           result = 1
           i = 2
           while i <= n:
               result *= i
               i += 1
           stack.append(result)

Here, 'i' is 2 at the start of the first iteration, 3 at the start of
the next one, then 4, and so on.  But still, unrolling this loop makes
no sense -- particularly if 'n' is unknown!

So there is some delicate adjusting to do.  For the short-term future,
this will require hints in the input low-level flow graphs (for example,
by putting calls to a dummy hint() function in the source code of
interp()).  What we have so far is two kinds of constants: "concrete"
constants and so-called "little" constants.  Only the "concrete"
constants force duplication of blocks of code; if two different "little"
constants reach the same input block, they are merged into a variable.
There are heuristics like "when an operation is constant-folded, it
produces a concrete constant if at least one of the input arguments was
already a concrete constant, and a little constant otherwise".  The
problem with that approach is that it doesn't really work :-)  We can't
seem to get the correct effect already in the case of the Toy Language.
I will propose a plan for this in the next e-mail, to avoid mixing
speculations with the presentation of what we have so far.

~-~

There are two more points to discuss about the current
LLAbstractInterpreter.  A bit of terminology first.  The generated graph
is called the "residual" graph, containing "residual operations" --
operations left over by the process.  We call "LLAbstractValue" what the
LLAbstractInterpreter propagates.  Concrete and little constants are
LLAbstractValues; so are the residual variables that will be put in the
residual graph.

A "virtual structure" is another type of LLAbstractValue.  When a
'malloc' operation is seen, we propagate for its result a "virtual
structure" instead of a variable holding the pointer.  The virtual
structure describes the constent of all the fields.  It is updated when
we see 'setfield' operations, and read from when we see 'getfield'
operations.  Each field is described by a further LLAbstractValue.  The
goal is similar to (but more powerful than) the malloc removal back-end
optimization.  Malloc removal can already cope with code such as:

    def f(n):
        x = X()
        x.n = 42
        return x.n

The 'x' instance does not have to be allocated at all in the residual
graph; we can just return 42.  But this optimization is particularly
interesting in the LLAbstractInterpreter because the loop unrolling and
general constant propagation tends to bring together pieces of code that
were originally distant.  This is what occurs in interp() with the
'stack'.  In its original source code, we have no choice but use a
memory-allocated list to propagate the values from one bytecode to the
next.  However, after constant propgatation (say, for 'code=[PUSH N,
PUSH M, ADD]'), the residual code looks like:

    stack = []
    stack.append(N)
    stack.append(M)
    stack.append(stack.pop()+stack.pop())
    return stack.pop()

Of course we would like this to become:

    return N+M

We could use malloc removal on the result, but doing it directly in the
LLAbstractInterpreter is more "natural" (and more powerful).  All we
have to do is propagate LLAbstractValues for the stack that describe its
the current state:

    stack = []          # stack: LLVirtualArray([])
    stack.append(N)     # stack: LLVirtualArray([LLVar(N)])
    stack.append(M)     # stack: LLVirtualArray([LLVar(N), LLVar(M)])
    y = stack.pop()     # stack: LLVirtualArray([LLVar(N)])        y: LLVar(M)
    x = stack.pop()     # stack: LLVirtualArray([])   x: LLVar(N)  y: LLVar(M)
    z = x + y           # residual operation 'Z=int_add(N,M)'
                        # stack: LLVirtualArray([])   z: LLVar(Z)
    stack.append(z)     # stack: LLVirtualArray([LLVar(Z)])
    res = stack.pop()   # stack: LLVirtualArray([])   res: LLVar(Z)
    return res          # 'return Z'

Note that this example is not really accurate.  In fact, both normal
constant propagation and virtualization occurs at the same time, so that
the code in the left column is not actually generated at all in the case
of the Toy Language interpreter: the real left column we would see is
the whole interp() function, with its loop unrolled.  The states on the
right column contain more constants, like the 'pc' and the 'opcode'.  So
during each iteration through the main 'while' loop, we get a new state
that differs from the previous one in more than one way: not only do the
'pc' and 'opcode' constants have different values, but the shape and
content of the 'stack' virtual array is different.  The point of doing
both at the same time is that the virtual structure or array could
contain constants that are later read out and used for more constant
propagation.  We cannot do that with a separate post-processing malloc
removal phase.

(Also note that RPython lists become actually GcStructs containing a
GcArray in the low-level graphs, so the 'stack' is actually a
LLVirtualStruct whose 'items' field is itself described as a
LLVirtualArray.)

-~-~

Phew!  A short note about the last noteworthy point of the
LLAbstractInterpreter: function calls.  Most function calls are actually
inlined.  The reason is that LLVirtualStructs and LLVirtualArrays are
difficult to track across function calls and returns.  Indeed, in code
like that:

    def f():
        lst = [1]
        do_something(lst)
        return lst[0]

it could be the case that do_something() actually modifies the list.
Propagating information about what is in the list is very difficult
across a call.  To solve this problem, we just follow the call and
continue to look at the operations in do_something(), while still
generating code in the same residual graph.  When do_something()
returns, we come back to the calling point in the input graph, and we
continue to - still - generate residual operations in the same, unique
residual graph.

A call that only passes variables is not inlined nor further analysed,
because there would be no point - no constant to propagate.  In this
case, we just write a residual call operation to the original function.
For example, if the hypothetical factorial bytecode were implemented as:

    ...elif opcode == FACTORIAL:
           n = stack.pop()
           stack.append(factorial(n))

Then the residual graph would just contain a direct_call to the
unmodified graph of the factorial() function.  This is how the
LLAbstractInterpreter can hope to cope with very large programs like
PyPy: starting from the interpreter main loop, at some point, it just
stops following calls because there is nothing more to propagate.  The
residual graph will just contain calls to the original input graphs
(i.e. to "the rest" of PyPy).

~-~-~

A bientot,

Armin