Python Compiler

Mon May 1 00:46:44 EDT 2000

In article <8F26DEAB4joshtinamecom at 209.208.0.7>, josht at iname.com (Josh
Tompkins) wrote:
> This is maybe a dumb question, but here goes: (donning asbestos
> underwear)
> 
> What would be the difficulty of writing a real python compiler?  Or
> maybe a  compiler/interpreter similar to the one in VB?  Leaving
> platform indepenacy  and other similar concerns aside, what are the
> technical hurdles?  Is it  hard to write a compiler?  Given proper
> pointers to tutorials and  references I'd give it a shot, if it wasn't
> above my (low) level of  programming skill...

No need for asbestos here. That's a very valid question, one that I have
been thinking about some.

There are two main approaches to compile Python to machine code:

A) Work through the bytecode as the interpreter does, compiling each
bytecode instruction to the library function that the interpreter would
call. For example, BINARY_ADD would become a PyObject_Add() call.

JPython uses a similar technique to compile to Java bytecode. However,
because all variables are completely polymorphic (i.e., nothing is  known
about their type), even the simplest operations end up going through the
abstraction mechanism. So that BINARY_ADD, for example,  might still have
to go through and allocate a new integer object, deallocate the old ones,
etc, even if a simple machine "ADD 1 TO REGISTER" instruction would work.

The result is that the program is in machine code, but it still runs like
it's in an interpreter. Cutting out the fetch-decode-dispatch sequence is
really only the tip of the iceberg.

B) Implement a fully optimizing compiler that does global data flow
analysis and type inference to try to determine how the code could be
improved. Some ways that might be considered could be: basic type
replacement (for example, using an integer instead of a pointer to an
integer object); virtualization elimination (for example, calling
PyList_Append and PyList_GET_ITEM instead of PyObject_CallMethod and
PyObject_GetItem); traditional techniques such as common subexpression
elimination, code motion, constant folding, dead code elimination; and
more.

Ideally, this would cause the compiled form to look more like what someone
might write if they were translating the Python to C by hand. However, the
goal of primary importance in compilers is that they produce correct code;
and often, this means that they make optimizations conservatively; that
is, using a slower but correct transformation if a faster one can't be
deduced to be 100% correct.

Unfortunately, the myriad of ways that some of Python's dynamic mechanisms
can be used (and abused) to achieve strange and wonderful things can
really end up working against compile-time optimizations. What we
certainly would not want is to build a fancy optimizing  compiler, and
upon feeding it a non-trivial program, have it proclaim, correctly, that
none of its optimizations can safely be applied.

Here's an example to give you an idea of the depth of the problem:

     while i < len(x):
         ...

The expression len(x) looks like a candidate for code motion. That is, we
would like to move the expression above the loop, computing it once
instead of redundant computing the same value every time through the loop.
In order to do that, we must prove that 1) len(x) computes the same value
every time it is evaluated with the same arguments (it is purely
functional), and 2) that evaluating len(x) causes no change in the value
of any expression (it is side-effect free.)

Firstly, we must check the body of the loop to make sure that no
assignments are made to x. Thus, we know that x always refers to the same
object. If that fails, we can't optimize.

len(x) is purely functional with respect to x's value. But we are only
given x's reference, not its value. If we can infer that x is a tuple, for
example, that's a huge leap, because we know the reference always refers
to an object with the same value (immutability). If x is a list, and
x.append() is called in the loop, we're out. If x is something else, and
some method x.foo() is called, what do we do?

Even that's not safe. We're assuming we know the behavior of len(x); but
in Python, that can easily be changed. A class might be defined with the
method:

            def __len__(self):
                self.silly = self.silly + 1
                return self.silly

which would obviously make len(x) not functional, in the general case. So
we would also need to infer that x does not refer to an object which
defines a __len__ method which incorporates or causes side-effects.

But even that's not enough, because in Python, the name len itself is late
bound. Thus we do not even know what function we're calling! Someone might
come along and define a global variable called "len" to whatever they'd
like, at any time.

So in addition, we'd have to track not only assignments to global
variables within the module, but also any statements in any module of the
forms a.len = c, or setattr(a,"len",c), that a does not refer to our
module or the builtins module. Assuming that setattr refers to what we
think it does, of course. And in addition to that, any statement like a[b]
= c would have to be verified that a does not refer to the _dictionary_ of
the module or the builtins module!

At some point, an optimizing compiler will be unable to realistically keep
track of all of these complexities, and will be forced to discard some 
information, in a conservative manner. The big question is how much info
can be discarded and still end up with reasonably efficient machine code.