[pypy-dev] List dormant?

Armin Rigo arigo at tunes.org
Sat Sep 25 12:57:53 CEST 2004


Hi Terry, hi Holger, hi everyone else,

On Fri, Sep 24, 2004 at 09:25:17PM +0200, holger krekel wrote:
> he is happy to answer questions or talk about the state
> of how this is going if someone asks him.  Hum, hey 
> Armin, what is the current state of the C backend? :-) 

Hum, it is a piece of experimental code that has grown quite large.  Here is a
large e-mail to explain it and why I'd rather like it to be smaller.

It produces a C extension module.  The functions' body are, literally, basic
blocks in the C sense, with labels, and jumps to each other, following
directly the control flow graph's structure.  Additionally, the end of each
function contains code that decrefs the variables that need to be decrefed in
case of error; each operation that can fail will, in case of failure, jump to
some label there.

The kind of objects supported is enumerated in genc_repr.py: ints, generic
PyObjects, tuples, classes, instances, lists, function pointers, method
pointers.  It all works reasonably well.  For example, instances can have both
instance and (read-only) class attributes.  Methods are just a particular case
of class attributes, as in regular Python.  (Class attributes are difficult to
do in Pyrex; it was one of the motivations to switch to C.)

But at the same time I am not fully satisfied with the C backend.  I have both
small and big concerns.

The biggest concern is about its overall structure.  It is really like a
classical compiler, taking graph blocks, analysing the operations inside,
using the annotations computed previously as a guide; then it produces
lower-level operations with explicit conversions, and finally a separate pass
turns these into real C code.

Here is where all this occurs:

* typer.py          turns the block's SpaceOperations into low-level ops
* genc_typeset.py   defines which low-level ops exist
* genc_op.py        classes to write the C code for each low-level op
* genc_repr.py      maps block's Variables to C types
* genc.py           calls all other modules; writes the C module
* classtyper.py     maps user-defined classes to structs

Additionally, genc.h is inserted verbatim at the beginning of the C module; it
contains macros to do the common operations.  These macro definitions are also
parsed (!) by genc_typeset.py, so that typer.py can know about their
existence.

In some sense it is fragile: it depends on the annotations being generated
exactly as expected.  The (earlier) annotation phase analyses the
SpaceOperations and somehow promizes that there is a way to do the given
operation and guarantee that it produces some results; e.g. an 'add' when
applied to two SomeInteger()s gives another SomeInteger().  But then
genc_typeset.py or genc.h must make sure that it actually implements an
operation with that signature.  There is some duplication there.

For example the backend currently expects that a list object is never
converted, i.e. if an object is created as a "list of ints" then it will
remain a "list of ints" for its whole life.  The annotations currently
produced have this property, but it's kind of accidental.  For the C backend
itself it is an implicit external assumption.  We could devise by hand
annotations that crash the C backend although they are a priori reasonable.

Well, another concern is that typer.py is quite confusing and was difficult to
get right.  I'm not too sure that the code that generates the Py_DECREF()
after an error will decref all the correct variables in all cases.

Something I don't even dare speak about too much is the way a Variable in the
flow graphs maps to potentially a list of unrelated C variables (or fields in
a struct).  For example, a Variable annotated as a SomeTuple() of two integers
will become two distinct C variables.  And Variables that are sufficiently
constant become zero C variables!  This is a nice idea but it makes typer.py,
genc_typeset.py and genc_op.py all the more obscure.

Of course, I'm always thinking about reorganizing it all...  The latest such
idea was inspired by Seo who, for the Lisp backend, put in transform.py some
code that actually modifies the control flow graph before it is passed to the
backend.  He replaces the 'newlist' and 'mul' operations corresponding to an
expression '[a] * b' with a single custom operation, 'alloc_and_set'.  After a
lot of unsuccessful efforts in implementing 'list += list' in the C backend I
remembered this idea and now transform.py turns a list-based 'inplace_add'
into a whole new bunch of control flow blocks:

#    a = inplace_add(b, c)
# becomes the following graph:
#
#  clen = len(c)
#  growlist(b, clen)     # ensure there is enough space for clen new items
#        |
#        |  (pass all variables to next block, plus i=0)
#        V
#  ,--> z = lt(i, clen)
#  |    exitswitch(z):
#  |     |          |        False
#  |     | True     `------------------>  ...sequel...
#  |     V
#  |    x = getitem(c, i)
#  |    fastappend(b, x)
#  |    i1 = add(i, 1)
#  |     |
#  `-----'  (pass all variables, with i=i1)

So now the backend only has to worry about the two simple operations
'growlist' and 'fastappend'.  The latter is an append where we can assume that
there is already enough preallocated space.

The graph transformation code itself is quite verbose, but with more developed
utility routines it could be made simpler.  The point is that it looks like a
good idea to perform as many optimizations and transformations on the flow
graph itself before passing it to the backend.  (With hindsight it's obvious.)  
So I'm now thinking about how more of the typer.py mess could be moved there.

One extreme idea would be to say that the flow graph should be transformed
much more, step by step, until it eventually contains only operations that
have an obvious direct C equivalent.  This would make the C backend much
simpler again (and also make simple non-C backends fun and quick to
implement).  This would include typing the operations: the 'add' would be
reserved for 'add two PyObject*'; another 'add_i' would add two integers.  
Conversion operations would be inserted in the flow graph as needed.

So typing would be more tightly coupled with the annotation phase, which I
think is a good idea.  Essentially, the same code that says that adding two
SomeIntegers() produces a SomeInteger() would say that to do so the correct
operation is 'add_i'.  And the code that says that by default 'add' produces a
SomeObject() would say that this requires the two inputs to be converted to
PyObject*.


That's all for now...


A bientot,

Armin



More information about the Pypy-dev mailing list