[pypy-dev] Rethinking genc with graph rewriting

Mon Sep 27 18:10:24 CEST 2004

Hi,

I thought a bit more about how to generate the low-level translation of
RPython code.  It might indeed be possible to do it by rewriting the flow
graph until it contains mostly lower-level operations, and then translating it
to C code straightforwardly, if we insert an intermediate optimization phase
between these two phases.  Here are some more details.

I'm not sure in which order to present the ideas.  I apologize for the lengthy
e-mail, it's still all a bit too early to write down as a formal ReST
documentation...

===Motivation===

See for example how W_ListObject is implemented in the stdobjspace (which is
RPython, too).  It has got a size in 'ob_size' and list of items in 'ob_item'.  
The naive implementation of a list in RPython is with a PyListObject*, i.e. as
a pointer to a structure containing (1) the length and (2) another pointer to
the actual items.  If we do it this way, then W_ListObject will end up being
implemented badly:

struct W_BadListObject {
    ...header and refcount...
    int ob_size;                      // for the ob_size field
    PySomeKindOfListObject* ob_item;  // for the ob_item field
};
struct PySomeKindOfListObject {
    int len;
    PyObject** items;
};

So it takes two indirections from a W_ListObject to its array of items,
although it only takes one in CPython's own lists.  Why?  Because the
'ob_item' list inside W_ListObjects could, maybe, escape the W_ListObject
instances and outlive them, or be modified from somewhere else.  If you think
about it this way then the extra indirection is needed.  But by looking
closely at the source of listobject.py (or its flow graph version), an
automatic analysis can figure out that this particular 'ob_item' field never
escapes the control of W_ListObject.  Thus we don't need the first indirection
in this case.  The PySomeKindOfListObject can be *inlined* into the structure
implementing W_ListObject:

struct W_ListObject {
    ...type and refcount headers...
    int ob_size;
    PySomeKindOfListObject ob_item;   // not a pointer any more!
};

Now, it looks a lot like CPython's PyListObject structure, with 'ob_item.len'
playing the role of the 'allocated' field of CPython.

I believe that this kind of "structure-inlining" optimization is important.  
It is something that cannot easily be done in the current genc_* files.  
That's why I started thinking along the lines I will described below in more
details.  I like this idea because it also works with simpler types like
integers, not just lists:  we don't have to do all the type-juggling in genc_*
any more; instead, we consider integers as PyIntObject*, and by the same
inlining mecanisms replace them with PyIntObject only -- even better,
PyIntObject without the type and refcount headers.  What remains?  The ob_ival
field only.  In other words by inlining a PyIntObject* declaration we get a
structure with only one C "long".  (From there it is easy to actually get rid
of the struct and replace it with its single field.)

===In more details===

Let's say we start with a flow graph.  For this example, let's consider the
flow graph generated for the function 'def f(x): return x+1'.

block1(x):
  add(x, 1) -> y
  goto return_block(y)

The annotation code as it is today will infer that y is a SomeInteger() if we
tell it that x is a SomeInteger().  Moreover the '1' is also a SomeInteger()  
with the attribute 'const' set to 1.

What I propose is that we add a set of rewrite rules, which could be freely
applied.  The rules say essentially that whenever we see a given operation
with the given annotations, it can be rewritten into one (or possibly several)  
other operations.  For example, a long rule could be:

method_extend(lst1: SomeList, lst2: SomeList) -> result: SomeList
-----------------------------------------------------------------
  len(lst2) -> c
  growlist(lst1, c)
  goto loop[i=0]
loop:
  lt(i, c) -> cond
  switch cond:
    case False: goto sequel
    case True:  goto body
body:
  getitem(lst2, i) -> x
  fastappend(lst1, x)
  add(i, 1) -> i1
  goto loop[i=i1]
sequel:

Maybe rules should be put in a file not in Python syntax, and parsed by the
rule-applying code.  I'm not sure if the above pseudo-syntax would do, but
maybe something better along these lines would work.  Anyway, there would also
be simpler rules like:

add(int1: SomeInteger, int2: SomeInteger) -> result: SomeInteger
----------------------------------------------------------------
add_i(int1.ob_ival, int2.ob_ival) -> result.ob_ival

Note the reference to the field ob_ival of PyIntObjects.  The above line says
that the operation 'add' (which is PyNumber_Add()), although fine for any kind
of objects, could be optimized if we knew that all three involved objects
follow the "structure" of SomeInteger.  In this case we can perform it using
the new operation 'add_i' on the ob_ival fields of the structures.  Once more
the intended meaning of the rule is: the operation above the line is perfectly
valid on its own -- it would produce a call to PyNumber_Add() in C -- but for
efficiency it can be replaced with the operation below the line.

The idea is thus to start from the basic C code generator that just writes
PyNumber_Add() and similar calls for any operation, and gradually move to
"inlined" operation like add_i which correspond just to "+" in C.  If we only
do that, the C code would still manipulate heap-built objects only, i.e.  
PyIntObjects;  we would just have replaced a call to PyNumber_Add() with
inlined code like:

   result = ...create the PyIntObject...;
   ((PyIntObject*) result)->ob_ival =
       ((PyIntObject*) int1)->ob_ival +
       ((PyIntObject*) int2)->ob_ival;

The idea is then to use structure inlining to detect and get rid of the
PyIntObject (and other heap structures).  This requires some kind of
whole-program analysis.  This analysis be done as an intermediate phase,
between the rule-rewriting and the C generation phase.  It would look at how
each field of each object is used, and based on this deduce how each structure
is best implemented.  What I'm thinking about is something like this:

* look if an object's structure is mutable or not, i.e. if we ever write to
  its fields.  (This should distinguish initialization from later rewrites.)

* look if an object is shared or not: if references to an object don't escape
  too far into various parts of the program then we don't need to refcount it.
  More precisely, we might be able to assign to the object a "parent" object,
  a container which is guaranteed to exist at least as long as its child.
  Thus the "parent" has got the only reference (no refcount needed) and others
  can borrow it.  The parent can be the frame of a function, too, for objects
  that never outlive the function in which they were created.

The point is that both non-shared and shared but immutable objects can be
inlined into their parent, so that we can get rid of the heap allocation and
the access indirection.  For example, as PyIntObjects are all immutable, they
can all be inlined: any "PyObject*" field or variable known to point to a
PyIntObject can be replaced with a headerless PyIntObject structure in-place,
or even directly with its single "long" field.

For the above example we'd get code like this:

   result.ob_ival = int1.ob_ival + int2.ob_ival;

Or after inlining the single field:

   result__ob_ival = int1__ob_ival + int2__ob_ival;

Apart from the long variable names (ugh! __ob_ival after each RPython variable
containing an integer :-), this is good !

Sorry for the long messages, I hope I could get some of my motivations
through.

Armin