[pypy-dev] Next step: gen???.py

Wed Mar 30 19:25:50 CEST 2005

Hi there,

This is a status report and a "what to do next" summary.

The status is easily summarized as follows: the PyPy interpreter is quite
complete and highly compliant with CPython, minus a number of dark corners
that are only likely to bite user programs using the most introspective
features of the standard library (most notably pickling).  The flow graph and
annotation subsystems work quite nicely too; we can successfully annotate more
or less all of PyPy, or at least we are getting close to this goal.

The "next" step is low-level code generation.  Here, we have a rather large
number of prototypes all around.  The most complete one is genc.py, which is
able to produce code for roughly the complete PyPy interpreter, but without
using the annotations -- i.e. it is very slow code, and it is essentially
dependent on CPython.  Approaching the same goal but from the opposite
direction, we have genllvm.py, which is only able to translate fully annotated
graphs and doesn't have any mean of fall-back for things like faked objects
which cannot be annotated.  Finally there are other very incomplete or
deprecated or planning-only back-ends: gencl, genpyrex, genjava.  Not to
mention geninterplevel whose goal is still different.

The question is which line of work to focus on right now.  All of these
back-ends are interesting and worthwhile in the long run but we need to select
a first one.  There are basically 4 reasonable options:

* Enhance genc.py.  This is a step-by-step process, and any intermediate
version can still be tested against the whole PyPy and against the snippet
examples.  Another advantage of genc is that it is the only option that
doesn't depend on any external tool other than a C compiler.

* Enhance genc.py using the C++ facility of function overloading for
simplicity (basically, we would generate "z=add(x,y)" in the file and let the
C++ compiler decide which version of add() to call based on the declared types
of x and y).  This might well be the easiest solution.  A minor drawback is to
require a C++ compiler.  A possibly larger drawback is that the C++
compilation time might be quite larger, even for similar-looking code.  
(Having to know C++ in the first place shouldn't be that big a drawback if we
don't use fancy C++ features.)

* Go for genllvm.py.  An obvious drawback is that we'd all have to install
LLVM.  The problem with genllvm right now is that it cannot make sense of
unannotated code (or code containing the SomeObject annotation).  We don't
know yet for sure the quantity of such SomeObjects in the annotated PyPy
source code, but a guess is that they occur mainly for "fake" stuff (file,
long, unicode...).  If so, there is one way around this problem.  Carl pointed
out that it *might* be easy to link the LLVM compiler output with CPython,
possibly making a C extension module for CPython.  If so, then we would add in
genllvm support for "black box" PyObject pointers, and use a few functions
from the CPython C-API to manipulate them.  The goal here would be to modify
the source code of the interpreter/module/objspace to reduce the number of
operations that need to be performed on these "black boxes".  For example, we
could possibly reduce all these operations to method calls.  In other words,
we would say that using CPython objects like "long" at interp-level is
temporarily OK but only if they are manipulated via method calls.  This would
make the genllvm support for them much easier.

* genjava.py could be another option.  It has a simpler type system, which
matches ours quite well, but genjava doesn't exist yet at all (the one in the
java/ subdirectory had a different goal in mind).  We get memory management
for free.  If we add the requirement to compile with GCJ it could be easy to
make a CPython extension module too, with the same problems and solutions
about SomeObject as above.

All in all, each 4 option is equally possible.  If I have to pick one, I guess
the first 2 options pass the additional criteria of "very good confidence that
it will work within a couple of months".  This is definitely biased by the
fact that I'm not fluent in Java and know very little about LLVM, but also by
my lack of knowledge about the ease of installation and integration of the
corresponding tools.

Then to pick one of the first two options, the second one (allowing some C++
facilities to sneak it) is my favourite.

Comments and feed-back are welcome!

Armin