How to translate 300000 lines of C
Dear list, I already announced some concern in a recent message. [Edward, I need you for this, at least for advice!] Part One: Making you frightened about the code size --------------------------------------------------- Running the following command over the current Python CVS src/dist direcotry: wc $(find . -name '*.c' -or -name '*.h') gives this result today (Januaray 20, 2003, 2:31 (GMT+01.00) 319282 1132750 9397985 total Ok, this is about everything in the core distribution, may it be needed for Minimal Python (whatever it is) or not. Let's roughly shrink it down to 150.000 lines. This is 150.000 lines of well-written, tested, evolved, really good C code. Now, a crowd of maybe 5-10 people is going to meet in a sprint by end of February, trying to translate a relevant amount of this mountain of code into Python? Really working Python? Won't they get bored? I can do 1000 to 3000 lines per day, when re-coding into Python as a prototype. With debugging and code quality stuff, I'm down to 500 or less. Let's assume we have 10 people of comparable caliber. Given that we work for 10 days full-time, nobody being ill, not accounting for the parties we probably will have, everything being perfect, then we *might* have 50.000 lines of quality code done in that period. I don't really believe in such a great success, it will probably be much less, since programming in groups does not scale well, sorry. There will be lots of overhead, discussions, misunderstandings, personal problems, I will probably get shot, so let's expect 10.000 to 20.000 of good Python program lines. Now, think of code like ceval.c which is alone 3900 lines of code, and not the most simple code. Of course, we can create a serious new interpreter, with all "borrowed" objects wrapped in a proper way quite quickly, and I still think this is a good idea. But I think we can do much better. And most probably, people will not get bored to do the implementation, see part three. Part Two: Making you frightened about the C code ------------------------------------------------ No offense to the python-dev people (also, since I do belong to this group a little bit), the C code base is absolutely great. As a code base written in C, of course. But I would like to encourage everybody to pick some medium-sized C source file and try to translate it into Python. It is possible, and it isn't too difficult. But it makes you stumble and stumble and stumble. The more you look at it, you recognize that it is quite near to assembly language. Everything is written down, expanded in some rather efficient way, there is not much abstraction. There is no inheritance, but there are lots of repetitions of fimilar but not identical code. You are confronted with exceptions which need some mapping. You see all the primitive types being used all the time, and you'll wonder how to map them. (Yes, we can set up general directives how to do that). You will also find lots of stuctures which need to be implemented. Finally, you find myriads of builtin Py...stuff...() runtime functions which you need to emulate somehow. Then, looking at the frequency of python-checkins, you will find that your translation work will be voided in some near future. python-dev is improving things all the time, and you will be kept busy for a life time to adjust your Python version. This might come to an end, if the core developers might finally decide to drop the C implementation in favor of our new project. But this can only happen, if we are fast enough! Part Three: Proposing A Radical Consequence ------------------------------------------- I see no point in wasting manjears of coding to re-invent the whell by assembling piece-to-piece from C code to Python code. For sure, there are some very relevant modules which might need to be hand-coded. But, and this is driven by the summary of what I thought to re-code by hand today: I believe that it is possible to automate this translation process! We can set up some default mappings for the most frequent C constructs. There are a number of free-ware C compilers around, and also some C interpreters. My vision since today is now to augment such a compiler to become a Python extension, and then run this compiler over all the C code. The Python extension should then try to provide a re-write of the C code in Python! There are some simple rules to be obeyed, which come out of the top of my head and can be changed as needed, just to give an example: For every structure that appears in the source, emit an approporiate Class definition, based upon a base class that is designed to handle structures. For every switch statement, create an according number of local functions (indeed making use of the new scopes), and prepare a dispatcher table for all the functions. For every simple for loop, create an xrange construct. For every not simple for loop, create a while loop with a break condition. For every simple type instantiation, create a similar object that derives from a class that describes such simple types. For very macro constant, use a constant notation. for every macro function, provide a Python function. Remove every Py_INCREF and every py_DECREF. Instead, let's automate that, using more reference counts than necessary, since this can be deduced by a good code generator, later. The interpreter doesn't do it differently, anyway. This list is by far not complete. Addition: For every C module, provide an extra Python module that is able to override some of the automatic decision of above. Special example: For ceval.c, overwrite all the specialized opcode implementations which try to optimize integer operations. These should not be written by hand any longer, but they are the objective of Psyco's specializing features. That's what I'm saying today: Make the move from C to Python automatic, by 95 percent. Let's modify a C compiler to do most of the tedious tasks for us. Try to use pattern matching to remove more of specializations done for the sake of C. Remove C-specific optimizations and optimize for abstractions. Then, we can try to re-target, create C code or assembly from that. My proposal right now is: Let's write (or change) such a compiler which emits fairly good scripts, and then let's add modifications which make these into really good scripts. With some luck, these will withstand the hight frequency of python-dev's code changes, too. Whow, this was a lot of storm in my brain for today. I hope it made some sense -- cheers - chris
As a maybe relevant data point: Current Jython CVS (2.2 functionality minus at the moment missing type/class unification plus java specific integration code) is around 60000 lines of java.
[Christian Tismer Mon, Jan 20, 2003 at 03:27:31AM +0100]
Running the following command over the current Python CVS src/dist direcotry:
wc $(find . -name '*.c' -or -name '*.h')
gives this result today (Januaray 20, 2003, 2:31 (GMT+01.00)
319282 1132750 9397985 total
Ok, this is about everything in the core distribution, may it be needed for Minimal Python (whatever it is) or not. Let's roughly shrink it down to 150.000 lines.
I am not sure what this number (300.000) really means at all. For example, 100.000 lines of it are in the Modules directory (not counting their .h files). And getting a running Python-Python-Interpreter doesn't require rewriting all C-stuff.
Now, think of code like ceval.c which is alone 3900 lines of code, and not the most simple code.
If this wouldn't translate to less than 1000 lines of nice python code i would be surprised.
Of course, we can create a serious new interpreter, with all "borrowed" objects wrapped in a proper way quite quickly, and I still think this is a good idea.
good :-)
But I think we can do much better. And most probably, people will not get bored to do the implementation, see part three.
Part Two: Making you frightened about the C code ------------------------------------------------
No offense to the python-dev people (also, since I do belong to this group a little bit), the C code base is absolutely great. As a code base written in C, of course.
But I would like to encourage everybody to pick some medium-sized C source file and try to translate it into Python. It is possible, and it isn't too difficult. But it makes you stumble and stumble and stumble.
btw, I would think there is two orders of magnitude more python code out there than python C-extensions.
[more analysis of how much good C-code there is]
Part Three: Proposing A Radical Consequence -------------------------------------------
I see no point in wasting manjears of coding to re-invent the whell by assembling piece-to-piece from C code to Python code. For sure, there are some very relevant modules which might need to be hand-coded. But, and this is driven by the summary of what I thought to re-code by hand today: I believe that it is possible to automate this translation process! We can set up some default mappings for the most frequent C constructs. There are a number of free-ware C compilers around, and also some C interpreters. My vision since today is now to augment such a compiler to become a Python extension, and then run this compiler over all the C code. The Python extension should then try to provide a re-write of the C code in Python! ... That's what I'm saying today: Make the move from C to Python automatic, by 95 percent.
Now *this* seems a like a huge undertaking which requires to deal with C-parsers for starters. Don't take me wrong but i don't believe in this route, yet. But i will do as you say and spent more time recoding some stuff in python and try getting it to work. Maybe you are right. cheers, holger
I already announced some concern in a recent message.
[Edward, I need you for this, at least for advice!]
Thanks for the vote of confidence. Whenever I am confronted with a problem that seems big, ugly and messy I devote myself to trying to find a way of avoiding it altogether. I suggest you do the same, focusing on what it is that pyco does at "runtime" to discover optimizations. Sorry if this sounds a bit like the Delphic oracle. The only advice I can give is to change your point of view so that a) somehow the problem goes away or b) somehow you can use already existing tools to make it much easier. Actually, you've done both in your original post. I would simply encourage you to keep doing that :-) Edward -------------------------------------------------------------------- Edward K. Ream email: edream@tds.net Leo: Literate Editor with Outlines Leo: http://personalpages.tds.net/~edream/front.html --------------------------------------------------------------------
Hm, maybe people could take responisbility for one or two modules and rewrite them in Python? I would try the structmodule, for example... Thomas
[Thomas Heller Mon, Jan 20, 2003 at 10:09:16AM +0100]
Hm, maybe people could take responisbility for one or two modules and rewrite them in Python?
I think this is a good idea. Christians idea of automatically translating CPython source to python code sure makes sense. But there will be quite some modules which we have to manually code and the structmodule is certainly among them. regarding the automatic translation: At the last FOSDEM conference i talked to Richard Dale who wrote a tool (IIRC "Koala") which automatically generates 5 or 6 bindings for the KDE/QT-library directly from the C++ code, for Java, Objective C and others. Although he was very experienced it took him a year or so to get it going. But maybe "translating" CPython code is considerably simpler than this. Anyway, i think we can start from 'both sides'. Manually rewriting stuff in python and work on a C-to-Python translator. cheers, holger
holger krekel wrote: ...
Anyway, i think we can start from 'both sides'. Manually rewriting stuff in python and work on a C-to-Python translator.
Absolutely, everybody should work on the ends she sees fit. To get something quickly, we cannot rely on C2P right now. To get everything translated, we cannot wait for the manpower to do it by hand. Furthermore, for certain modules it makes very much sense to write them by hand, *and* we probably need some as a template, reference implementations for targetting the C2P processor. Plenty to do, trying all paths in parallel will move the project forward. It also cannot all be planned in advance, I'm thinking more of evolution and Extreme Programming style than of classical design. will-be-extreme-programming-fun -y y'rs chris -- Christian Tismer :^) <mailto:tismer@tismer.com> Mission Impossible 5oftware : Have a break! Take a ride on Python's Johannes-Niemeyer-Weg 9a : *Starship* http://starship.python.net/ 14109 Berlin : PGP key -> http://wwwkeys.pgp.net/ work +49 30 89 09 53 34 home +49 30 802 86 56 pager +49 173 24 18 776 PGP 0x57F3BF04 9064 F4E1 D754 C2FF 1619 305B C09C 5A3B 57F3 BF04 whom do you want to sponsor today? http://www.stackless.com/
Hello Christian, On Mon, Jan 20, 2003 at 03:27:31AM +0100, Christian Tismer wrote:
I believe that it is possible to automate this translation process!
Yes! I think it is a very good idea. I would certainly be much more happy with keeping a reasonable-sized translator up-to-date than having to do so with the huge C code base. Let's be clear, we cannot automate the whole translation process, and setting this up might take as long as manually translating most of CPython, but I am confident that it will be a big win afterwards (and I am sure you know it better than me, having discovered it the hard way). The point is not to blindly translate the C code into Python code that is guaranteed to do the same thing. Instead, we need to discover the high-level structure of the C code and map this to Python. It should be relatively easy given that the whole CPython code follows consistent style guidelines. All we need is a C parser; translation could be done from the resulting syntax tree.
For every switch statement, create an according number of local functions (indeed making use of the new scopes), and prepare a dispatcher table for all the functions.
Maybe just write a chain of if:elif:. New scopes are not completely sufficient because they won't let us modify a variable from the parent scope.
For very macro constant, use a constant notation. for every macro function, provide a Python function.
Yes. In no case should be preprocess the C code to replace the macros by their definition. This would be loosing essential high-level information.
Addition: For every C module, provide an extra Python module that is able to override some of the automatic decision of above.
Yes. Never change the emitted Python code directly, it would prevent us from keeping up-to-date with CPython. We need some way to give hints to the translator (small hints or whole hand-tuned versions of some functions). Attaching a Python module to each C source file looks like a good way to do it, althought we might also consider adding the hints directly into the C source at the point where they apply, as C comments (or #ifdef'ed-away lines). An advantage of this is that CVS will warn us in case of conflicts between our hints and CPython updates. Well, maybe there is a need for both inline hints and attached Python modules.
For ceval.c, overwrite all the specialized opcode implementations which try to optimize integer operations. These should not be written by hand any longer, but they are the objective of Psyco's specializing features.
Yes, althought I would say that the main loop deserves some special treatment. There is no need, for example, to copy the code that calls Py_MakePendingCalls() every _Py_CheckInterval bytecode instructions. This is a parallel aspect that will might want to add or not later, like reference counting. The big switch should be special-cased into a bundle of frame methods with the dispatch table. The Python-in-Python interpreter main loop should be hand-written. Each opcode function is itself produced by the C-to-Python translator unless otherwise specified.
My proposal right now is: Let's write (or change) such a compiler which emits fairly good scripts, and then let's add modifications which make these into really good scripts.
I believe you are absolutely right. A bientot, Armin.
On Mon, Jan 20, 2003 at 03:27:31AM +0100, Christian Tismer wrote:
I believe that it is possible to automate this translation process!
Yes! I think it is a very good idea. I would certainly be much more happy with keeping a reasonable-sized translator up-to-date than having to do so with the huge C code base.
In a private email to Christian I suggested making this whole problem go away by changing the name of this project from minimalPython to psycoticPython :-) Whether automated or not, translating tested C code to Python seems extremely difficult and risky. It is risky because it implies one of two speculative assumptions: 1. The Python library will eventually outperform the C library or: 2. Guido will at some point approve supporting _two_ versions of the same library. I view assumption 2 as having almost zero probability, though of course I don't speek for Guido in any way. The reason is plain: it is odious to keep two sets of source code in synch. That leaves assumption 1. No point in arguing over the probabilities of it now: let's assume it is will be proved correct. I would be inclined to pick _one_ module to work on as a test bed. Translation can be done by hand. We can then test assumption 1. The bigger translation problem becomes real only if assumption 1 is proved to be true. Even then, I would imagine a _lengthy_ probationary period for each translated module before it becomes accepted into the library. So it isn't so important how long translation takes; the translation process is much less important than the testing process. My script c2py.py works only on translating C to Python syntax. It's complex enough. The hisory of machine translation of natural languages is littered with initial failure, in some cases with limited success after decades of work. Myself, I wouldn't invest any time at all in automatically translating C semantics to Python semantics. YMMV. Edward -------------------------------------------------------------------- Edward K. Ream email: edream@tds.net Leo: Literate Editor with Outlines Leo: http://personalpages.tds.net/~edream/front.html --------------------------------------------------------------------
Edward K. Ream wrote:
On Mon, Jan 20, 2003 at 03:27:31AM +0100, Christian Tismer wrote:
I believe that it is possible to automate this translation process!
Yes! I think it is a very good idea. I would certainly be much more
happy
with keeping a reasonable-sized translator up-to-date than having to do so with the huge C code base.
In a private email to Christian I suggested making this whole problem go away by changing the name of this project from minimalPython to psycoticPython :-)
Oh, I didn't get that until now. :-)
Whether automated or not, translating tested C code to Python seems extremely difficult and risky. It is risky because it implies one of two speculative assumptions:
1. The Python library will eventually outperform the C library or: 2. Guido will at some point approve supporting _two_ versions of the same library.
I view assumption 2 as having almost zero probability, though of course I don't speek for Guido in any way. The reason is plain: it is odious to keep two sets of source code in synch.
That leaves assumption 1. No point in arguing over the probabilities of it now: let's assume it is will be proved correct. I would be inclined to pick _one_ module to work on as a test bed. Translation can be done by hand. We can then test assumption 1.
Fine with me.
The bigger translation problem becomes real only if assumption 1 is proved to be true. Even then, I would imagine a _lengthy_ probationary period for each translated module before it becomes accepted into the library. So it isn't so important how long translation takes; the translation process is much less important than the testing process.
That's very true. The testing process will probably take longer as one or two new Python versions. We have to run in parallel for a reasonable period. That's why we need a semi-automated process that is easy to use on a changed code base.
My script c2py.py works only on translating C to Python syntax. It's complex enough. The hisory of machine translation of natural languages is littered with initial failure, in some cases with limited success after decades of work. Myself, I wouldn't invest any time at all in automatically translating C semantics to Python semantics. YMMV.
Well, it is not C, it is Pythonic C already. That's much simpler than C. (Which means, it doesn't use every and all possible trick in C, it has cleanly seperated statements, very little usage of macros, all ambiguous looking constructs are well-embraced) I also don't think to automatically translate the whole bunch without looking into the output. Instead, I think of a C parser which emits a series of tokens, or maybe AST objects, which is then fed into a Python code generator. This generator should only provide some common rules how to map certain constructs. It should stop in a situation it cannot handle. The porting work is to write configuration scripts for that, which control what to map how. I think this is quite an interactive process, but with the benefit that it is most probably repeatable for a slightly changed new Python version. There are also common patterns which should be replaced by some more abstract Python functions, which describe *what* is happening, instead of always telling *how* to do it, in an inlined way. This is what I call "uplifting". This is of course no quick process. The automated tool will help us to avoid tedious work, and to avoid errors by systematic mappings. And we can play with that and configure and fine tune, until the result looks as we like it. Not meantioning all the new ideas which we will have while we're at it. Right now, everything is an oracle. ciao - chris -- Christian Tismer :^) <mailto:tismer@tismer.com> Mission Impossible 5oftware : Have a break! Take a ride on Python's Johannes-Niemeyer-Weg 9a : *Starship* http://starship.python.net/ 14109 Berlin : PGP key -> http://wwwkeys.pgp.net/ work +49 30 89 09 53 34 home +49 30 802 86 56 pager +49 173 24 18 776 PGP 0x57F3BF04 9064 F4E1 D754 C2FF 1619 305B C09C 5A3B 57F3 BF04 whom do you want to sponsor today? http://www.stackless.com/
Hello, On Mon, Jan 20, 2003 at 03:16:00PM +0100, Christian Tismer wrote:
Whether automated or not, translating tested C code to Python seems extremely difficult and risky. It is risky because it implies one of two speculative assumptions:
1. The Python library will eventually outperform the C library or: 2. Guido will at some point approve supporting _two_ versions of the same library.
I'm not sure these are the fundamental assumptions. The goal we have here is to write Python in Python. The translator we are debating about is only a tool to acheive this goal in a way that greatly helps keeping our source in sync with CPython's. In fact that's precisely because we don't want to support two versions of the same code that we need some help from such a tool. Again, it is out of question to design a tool that reliably translates arbitrary C code to Python. The goal is to use simple rules and hand-made patterns to emit Python code, and then check *all* the emitted Python code and fine-tune it if needed --- with configuration scripts fed to the translator, not by directly changing the emitted Python code. What we can then do with such a Python-in-Python interpreter (e.g. emit good C code again) is another story. A bientôt, Armin.
participants (6)
-
Armin Rigo -
Christian Tismer -
Edward K. Ream -
holger krekel -
Samuele Pedroni -
Thomas Heller