[pypy-dev] How to translate 300000 lines of C

Christian Tismer tismer at tismer.com
Mon Jan 20 03:27:31 CET 2003


Dear list,

I already announced some concern in a recent message.

[Edward, I need you for this, at least for advice!]

Part One: Making you frightened about the code size
---------------------------------------------------

Running the following command over the current Python CVS
src/dist direcotry:

  wc $(find . -name '*.c' -or -name '*.h')

gives this result today (Januaray 20, 2003, 2:31 (GMT+01.00)

  319282 1132750 9397985 total

Ok, this is about everything in the core distribution, may
it be needed for Minimal Python (whatever it is) or not.
Let's roughly shrink it down to 150.000 lines.

This is 150.000 lines of well-written, tested, evolved,
really good C code.

Now, a crowd of maybe 5-10 people is going to meet in a
sprint by end of February, trying to translate a relevant
amount of this mountain of code into Python?
Really working Python? Won't they get bored?

I can do 1000 to 3000 lines per day, when re-coding into
Python as a prototype. With debugging and code quality
stuff, I'm down to 500 or less. Let's assume we have 10
people of comparable caliber.
Given that we work for 10 days full-time, nobody being
ill, not accounting for the parties we probably will have,
everything being perfect, then we *might* have 50.000 lines
of quality code done in that period.

I don't really believe in such a great success, it will
probably be much less, since programming in groups does
not scale well, sorry. There will be lots of overhead,
discussions, misunderstandings, personal problems, I will
probably get shot, so let's expect 10.000 to 20.000 of
good Python program lines.

Now, think of code like ceval.c which is alone 3900 lines
of code, and not the most simple code.

Of course, we can create a serious new interpreter, with
all "borrowed" objects wrapped in a proper way quite
quickly, and I still think this is a good idea.
But I think we can do much better.
And most probably, people will not get bored to do the
implementation, see part three.

Part Two: Making you frightened about the C code
------------------------------------------------

No offense to the python-dev people (also, since I do
belong to this group a little bit), the C code base
is absolutely great. As a code base written in C, of course.

But I would like to encourage everybody to pick some
medium-sized C source file and try to translate it into
Python. It is possible, and it isn't too difficult.
But it makes you stumble and stumble and stumble.
The more you look at it, you recognize that it is
quite near to assembly language. Everything is written
down, expanded in some rather efficient way, there
is not much abstraction. There is no inheritance,
but there are lots of repetitions of fimilar but
not identical code.

You are confronted with exceptions which need some mapping.
You see all the primitive types being used all the time,
and you'll wonder how to map them.
(Yes, we can set up general directives how to do that).
You will also find lots of stuctures which need to be
implemented.
Finally, you find myriads of builtin Py...stuff...()
runtime functions which you need to emulate somehow.

Then, looking at the frequency of python-checkins, you
will find that your translation work will be voided
in some near future. python-dev is improving things all
the time, and you will be kept busy for a life time to
adjust your Python version.
This might come to an end, if the core developers might
finally decide to drop the C implementation in favor of
our new project.
But this can only happen, if we are fast enough!

Part Three: Proposing A Radical Consequence
-------------------------------------------

I see no point in wasting manjears of coding to re-invent
the whell by assembling piece-to-piece from C code to
Python code.
For sure, there are some very relevant modules which might
need to be hand-coded.
But, and this is driven by the summary of what I thought
to re-code by hand today:
I believe that it is possible to automate this translation
process!
We can set up some default mappings for the most frequent C
constructs.
There are a number of free-ware C compilers around, and also
some C interpreters.
My vision since today is now to augment such a compiler
to become a Python extension, and then run this compiler
over all the C code.
The Python extension should then try to provide a re-write
of the C code in Python!
There are some simple rules to be obeyed, which come out
of the top of my head and can be changed as needed, just to
give an example:

For every structure that appears in the source, emit an
approporiate Class definition, based upon a base class that
is designed to handle structures.
For every switch statement, create an according number of
local functions (indeed making use of the new scopes), and
prepare a dispatcher table for all the functions.
For every simple for loop, create an xrange construct.
For every not simple for loop, create a while loop with
a break condition.
For every simple type instantiation, create a similar object
that derives from a class that describes such simple types.
For very macro constant, use a constant notation.
for every macro function, provide a Python function.
Remove every Py_INCREF and every py_DECREF.
Instead, let's automate that, using more reference counts
than necessary, since this can be deduced by a good
code generator, later. The interpreter doesn't do it
differently, anyway.

This list is by far not complete.

Addition:
For every C module, provide an extra Python module that is
able to override some of the automatic decision of above.
Special example:
For ceval.c, overwrite all the specialized opcode implementations
which try to optimize integer operations. These should not
be written by hand any longer, but they are the objective of
Psyco's specializing features.

That's what I'm saying today:
Make the move from C to Python automatic, by 95 percent.
Let's modify a C compiler to do most of the tedious tasks
for us. Try to use pattern matching to remove more of
specializations done for the sake of C.
Remove C-specific optimizations and optimize for
abstractions.
Then, we can try to re-target, create C code or assembly
from that.
My proposal right now is: Let's write (or change) such a
compiler which emits fairly good scripts, and then let's
add modifications which make these into really good scripts.
With some luck, these will withstand the hight frequency
of python-dev's code changes, too.

Whow, this was a lot of storm in my brain for today.

I hope it made some sense -- cheers - chris



More information about the Pypy-dev mailing list