[pypy-dev] From PyPy to Psyco

Thu Jan 16 16:46:45 CET 2003

Hello everybody,

Ok, sorry about not posting earlier on this list.  Sorry too for the long
e-mail; I wanted to reply to individual e-mails, but it seems that the
subjects always overlap.  So I'm just dropping my thoughts here.

Bootstrapping issues where quite widely discussed.  There are several
valid approaches in my opinion.  I would say that we should currently
stick to "Python in Python"'s first goal:  to have a Python interpreter up
and running, written entierely in Python.

(1) write a Python interpreter in Python, keeping (2) in mind, using any
recent version of CPython to test it.  Include at least the bytecode
interpreter and redefinition of the basic data structures (tuple, lists,
integers, frames...) as classes.  Optionally add a
tokenizer-parser-compiler to generate bytecode from source (for the first
tests, using the underlying compile() function is fine).

Only when this is done can we invent nice ways to do interesting things on
this interpreter.  The approach I prefer is:

(2) have a tool that can perform a static analysis of the core of (1).
To make this possible this core has to be written using some restricted
coding style, using simple constructs (and not full of Python tricks and
optimizations; something like Rocco describes).

(3) this tool compiles the core (via C code) into several different
things:

(3a) the classical approach (like e.g. in smalltalk) is to emit C code
that is very similar to the original Python code, much like Pyrex does.
We obtain a stand-alone minimal interpreter.  By interpreting the rest of
(1), we get an already nice result: a stand-alone complete Python
interpreter with a small amount of C -- and, more interestingly, almost
all this C code was itself automatically generated from Python code.  To
bootstrap the tokenizer-parser-compiler written in unrestricted Python,
simply provide a bytecode-precompiled version of it along with the C file.
Keep in mind that both the C files and the precompiled .pyc files are
intermediate files, automatically regenerated as often as needed, and only
distributed for bootstrapping purposes if we want to be independent from
CPython.

(3b) among other things that can be generated from (2) are a bytecode
checker, which CPython currently lacks and Guido sometimes thinks would be
a nice addition to prevent 'new.code()' from crashing the interpreter.
(I already experimented with this, it can work.)  (Note that we are not
forced a priori to choose the same set of bytecodes as CPython, but doing
so is probably a good idea at this stage.)

(3c) Psyco stuff:  given a few extra hints, the code generated from (2)
can be Psyco itself.  I mean, I am not thinking about introducing the
current C-based Psyco into the play to execute (1) faster.  This would at
best give the same performance as CPython --- and I seriously doubt it can
be as fast as that, given that Psyco's limited back-end cannot compete
with serious C compilers.  Instead, we can generate C code which
constitutes a new Psyco for the language that (1) is an interpreter for.
In short, we would have translated our interpreter into a specializing
compiler.  What is nice, of course, is that the same would also work if
(1) were an interpreter for a different language.  This is not magic;
specialization is known to be a tool that can translate any interpreter
into a compiler.  What's new here is the dynamic nature of the choice of
what is compile-time and what is run-time.

For (3b) and (3c) I am thinking about emitting C code, but this is not a
requirement: it would be possible to emit Python code too, for example to
build a Python-based bytecode checker.  But if we did the static type
analysis in (2), we might just as well emit C code to keep the discovered
static type declarations.

Related points:

* Platform-specific bootstrapping tools.  Looks like I favour the idea
that everything is managed by the tool in (3a) (which is itself in Python,
of course).  This tool could emit platform-specific C code when possible,
or (for redistribution) a generic low-quality platform-independent version
that would suffice to run (3a) again.  As the C code is not really meant
to be compiled more than once we can avoid 'make' tools --- for all I care
it could even be a single large C file, as it is not manually managed
anyway.

* Representation of data structures.  Use Python classes, e.g. integers
are implemented using a "class PyIntObject" with an ob_ival attribute (or
property).  These classes are straightforwardly translated into a C struct
by (3a).  The structure can be made compatible with CPython's PyIntObject,
or alternatively can be built for a GC-based interpreter with no reference
counting, if we wanted to try such a thing.

* Foreign Function Interface (a.k.a. calling external C functions).  Two
approches here.  I believe that in a first stage it is sufficient to emit
the real calls in our C code, by translating some static Python
declaration with (3a).  All such callable functions must be pre-declared
to get compiled into the stand-alone core interpreter.  This doesn't give
get dynamic call features like those offered by some existing CPython C
extensions.  But a form of dynamic calls is necessary for Psyco --- it
must at the very least be able to emit machine code that calls arbitrary C
functions.  So I'm trying to push this whole issue into the context of
emitting machine code by Psyco: when this works it should not be a problem
to expose some of the techniques in a built-in extension module to let end
users call arbitrary C functions.  Before that I don't think there is a
need for enhanced 'struct', 'calldll' and 'ctypes' modules.

* Psyco as a compiler.  If no C compiler is available, Psyco can be used
to emit statically-known code; I guess we can bootstrap a whole Python
interpreter without even leaving CPython, just by having Psyco's back-end
do the (3a) or even (3c).  That's nice, althought I cannot think of a real
use for this :-)

* CPython.  The above plan only uses CPython for its ability to run Python
programs and for the hopefully shrinking number of features that are not
re-implemented in (1).  The CPython source code is used for inspiration,
not compilation.

* Python Virtual Machine.  In the above plan there is no need for a small
Python VM in C for bootstrapping.

* Assembly Virtual Machine.  Something else that Christian mentionned was
a low-level VM that would provide a cool target for emitted machine code.
For all static stuff I see C as a nicer low-level target, but for Psyco it
might be interesting to have a general platform-independant target.
Another cool Psyco thing would be to never actually emit code in a first
phase, but only gather statistics that would later let a specialized C
version of parts of the program be written and compiled statically.

There are so many cool things to experiment, I can't wait to have (1) and
(2) ready --- but I guess it's the same for all of us :-)

A bientot,

Armin.