From PyPy to Psyco
Hello everybody, Ok, sorry about not posting earlier on this list. Sorry too for the long e-mail; I wanted to reply to individual e-mails, but it seems that the subjects always overlap. So I'm just dropping my thoughts here. Bootstrapping issues where quite widely discussed. There are several valid approaches in my opinion. I would say that we should currently stick to "Python in Python"'s first goal: to have a Python interpreter up and running, written entierely in Python. (1) write a Python interpreter in Python, keeping (2) in mind, using any recent version of CPython to test it. Include at least the bytecode interpreter and redefinition of the basic data structures (tuple, lists, integers, frames...) as classes. Optionally add a tokenizer-parser-compiler to generate bytecode from source (for the first tests, using the underlying compile() function is fine). Only when this is done can we invent nice ways to do interesting things on this interpreter. The approach I prefer is: (2) have a tool that can perform a static analysis of the core of (1). To make this possible this core has to be written using some restricted coding style, using simple constructs (and not full of Python tricks and optimizations; something like Rocco describes). (3) this tool compiles the core (via C code) into several different things: (3a) the classical approach (like e.g. in smalltalk) is to emit C code that is very similar to the original Python code, much like Pyrex does. We obtain a stand-alone minimal interpreter. By interpreting the rest of (1), we get an already nice result: a stand-alone complete Python interpreter with a small amount of C -- and, more interestingly, almost all this C code was itself automatically generated from Python code. To bootstrap the tokenizer-parser-compiler written in unrestricted Python, simply provide a bytecode-precompiled version of it along with the C file. Keep in mind that both the C files and the precompiled .pyc files are intermediate files, automatically regenerated as often as needed, and only distributed for bootstrapping purposes if we want to be independent from CPython. (3b) among other things that can be generated from (2) are a bytecode checker, which CPython currently lacks and Guido sometimes thinks would be a nice addition to prevent 'new.code()' from crashing the interpreter. (I already experimented with this, it can work.) (Note that we are not forced a priori to choose the same set of bytecodes as CPython, but doing so is probably a good idea at this stage.) (3c) Psyco stuff: given a few extra hints, the code generated from (2) can be Psyco itself. I mean, I am not thinking about introducing the current C-based Psyco into the play to execute (1) faster. This would at best give the same performance as CPython --- and I seriously doubt it can be as fast as that, given that Psyco's limited back-end cannot compete with serious C compilers. Instead, we can generate C code which constitutes a new Psyco for the language that (1) is an interpreter for. In short, we would have translated our interpreter into a specializing compiler. What is nice, of course, is that the same would also work if (1) were an interpreter for a different language. This is not magic; specialization is known to be a tool that can translate any interpreter into a compiler. What's new here is the dynamic nature of the choice of what is compile-time and what is run-time. For (3b) and (3c) I am thinking about emitting C code, but this is not a requirement: it would be possible to emit Python code too, for example to build a Python-based bytecode checker. But if we did the static type analysis in (2), we might just as well emit C code to keep the discovered static type declarations. Related points: * Platform-specific bootstrapping tools. Looks like I favour the idea that everything is managed by the tool in (3a) (which is itself in Python, of course). This tool could emit platform-specific C code when possible, or (for redistribution) a generic low-quality platform-independent version that would suffice to run (3a) again. As the C code is not really meant to be compiled more than once we can avoid 'make' tools --- for all I care it could even be a single large C file, as it is not manually managed anyway. * Representation of data structures. Use Python classes, e.g. integers are implemented using a "class PyIntObject" with an ob_ival attribute (or property). These classes are straightforwardly translated into a C struct by (3a). The structure can be made compatible with CPython's PyIntObject, or alternatively can be built for a GC-based interpreter with no reference counting, if we wanted to try such a thing. * Foreign Function Interface (a.k.a. calling external C functions). Two approches here. I believe that in a first stage it is sufficient to emit the real calls in our C code, by translating some static Python declaration with (3a). All such callable functions must be pre-declared to get compiled into the stand-alone core interpreter. This doesn't give get dynamic call features like those offered by some existing CPython C extensions. But a form of dynamic calls is necessary for Psyco --- it must at the very least be able to emit machine code that calls arbitrary C functions. So I'm trying to push this whole issue into the context of emitting machine code by Psyco: when this works it should not be a problem to expose some of the techniques in a built-in extension module to let end users call arbitrary C functions. Before that I don't think there is a need for enhanced 'struct', 'calldll' and 'ctypes' modules. * Psyco as a compiler. If no C compiler is available, Psyco can be used to emit statically-known code; I guess we can bootstrap a whole Python interpreter without even leaving CPython, just by having Psyco's back-end do the (3a) or even (3c). That's nice, althought I cannot think of a real use for this :-) * CPython. The above plan only uses CPython for its ability to run Python programs and for the hopefully shrinking number of features that are not re-implemented in (1). The CPython source code is used for inspiration, not compilation. * Python Virtual Machine. In the above plan there is no need for a small Python VM in C for bootstrapping. * Assembly Virtual Machine. Something else that Christian mentionned was a low-level VM that would provide a cool target for emitted machine code. For all static stuff I see C as a nicer low-level target, but for Psyco it might be interesting to have a general platform-independant target. Another cool Psyco thing would be to never actually emit code in a first phase, but only gather statistics that would later let a specialized C version of parts of the program be written and compiled statically. There are so many cool things to experiment, I can't wait to have (1) and (2) ready --- but I guess it's the same for all of us :-) A bientot, Armin.
Bootstrapping issues where quite widely discussed. There are several valid approaches in my opinion. I would say that we should currently stick to "Python in Python"'s first goal: to have a Python interpreter up and running, written entierely in Python.
[snip]
There are so many cool things to experiment, I can't wait to have (1) and (2) ready --- but I guess it's the same for all of us :-)
Actually, I believe this _last_ paragraph is the heart of the matter, not the traditional bootstrapping issues. Cockpit warning sounds: Whoop, whoop: war story! Whoop, whoop: war story! :-) By far the most important moment in Leo's 7-year history was the moment that I saw how to begin to use Leo without actually having it. Leo is a combination of an outliner and traditional programming techiques. I had a vague notion that the combination was going to be effective, but I was stuck: building an outliner is a _big_ task, and I wasn't sure exactly what kind of outliner would work well with the programming constructs. I was talking to Rebecca on the way back from a one-day ski outing, vaguely mulling over the problems (she's not a programmer, and she is a great listener :-) when it suddenly struck me that I could use the MORE outliner as an "instant prototype". I would just embed my experimental code in the MORE outline. I would then copy the outline to the clibboard by hand using MORE's copy command. Finally, I would write a little program (M2C for More to C) to take the stuff off the clipboard and create proper C source code that I could then compile. Naturally, the first "outline-oriented" program I wrote in MORE was M2C. This took a few hours. I then simulated by hand the output of M2C on M2C. The result was the C code for M2C. Once I debugged M2C I was in business. It all took less than 2 days. The point is this: I could use MORE _immediately_, even without actually having M2C, and certainly without writing something as complex as the MORE outliner. As soon as I shifted my point of view I was able, within seconds, to experiment with the combination of outlines and literate programming. Within minutes all my doubts about the combination of the two techniques vanished. Within an hour I evolved a new kind of programming style that has remained remarkably constant for over 7 years. Within a few days I had a working prototyping system. (end of war story: transcript of cabin voice recorder ends) I believe something this good can be done with psyco. My ideas: 1. We now have some "safety proofs" in place that show that there is absolutely no need to worry about performance during the initial experimentation/prototyping phase of this project. 2. We already have a superb language tool, namely Python. We must exploit Python to the fullest. 3. We want a bootstrapping scheme that gets us (or rather Armin :-) going _now_: preferably within hours or days, and at most within a week. Putting these ideas together, I suggest the following: 1. Ignore all issues relating to the ultimate target language. In other words, use Python as the target language. 2. Ignore all issues relating to speed. Focus instead on the algorythms that psyco will use and all the nifty experiments that Armin wants to run yesturday. Many of these experiments will involve looking at the target code that gets produced from particular programs/byte codes. 3. Modify Python's logic (it may be possible to do with a simple patch written in Python) so that Python looks for .pyp files and loads them as needed before looking for .pyc files. I believe this can be done very quickly. 4. Put nothing but Python code and data into the .pyp files! The "bootstrap loader" is the code that loads .pyp files. It does one of the following: a. an import of the .pyp file (changing its type temporarily to .py presumably) b. an exec on the entire contents of the .pyp file. In either case, some cleverness will be needed so that the import or exec will execute psyco with the proper data. This cleverness is the province of the code emitters... 5. Modify psyco so it outputs Python code, not C or machine code. The "code emitters" write _whatever is useful_ to the .pyp file. The code emitters might use str(x) to dump psyco's x data structure. At worst (if str could not be used), the code emitters would be write the Python data structures used by psyco to the .pyp file _as python code and data_. As I said before, some cleverness may be needed so that the Python code in the .pyp file ends up executing psyco again, but this is "routine cleverness". Armin is free to dump whatever Python code he wants into the .pyp file. There is no need for formal specifications and no need for the Python code to have a consistent format. Just blast away. Presumably, Armin will design the .pyp file so that it is easy to see the results of his experiments. The advantages are these: - This can all be done within hours--days at the most. - There may be no need for further group design work. - This ignores everything that should be ignored, namely all implementation details. - We get the highest-level, most flexible framework for experimentation, namely the Python code and data in .pyp files. This Python code is the highest-level representation of the generated code, and it the clearest possible way to see the results of experimentation. - It is an immediate path to psyco in python. - There is little or no need to create an interp in Python. HTH :-) Edward P.S. Yes, the results of experimentation will be Python code. Yes, the experimental code will run slower (maybe much slower) than .pyc files given to the C interp. That doesn't matter. What _does_ matter is that Armin will be up and running quickly with an extremely clear, powerful and flexible experimental environment. For example, the code given in another thread: PyObject* my_function(PyObject* a, PyObject* b, PyObject* c) { int r1, r2, r3; if (a->ob_type != &PyInt_Type) goto uncommon_case; if (b->ob_type != &PyInt_Type) goto uncommon_case; if (c->ob_type != &PyInt_Type) goto uncommon_case; r1 = ((PyIntObject*) a)->ob_ival; r2 = ((PyIntObject*) b)->ob_ival; r3 = ((PyIntObject*) c)->ob_ival; return PyInt_FromLong(r1+r2+r3); } will appear in the .pyp file as something like this: def my_function__(a,b,c): if a.ob_type__ != PyInt_Type__: do_uncommon_case__() if b.ob_type__ != PyInt_Type__: do_uncommon_case__() if c.ob_type__ != PyInt_Type__: do_uncommon_case__() r1 = a.ob_ival__ r2 = b.ob_ival__ r3 = c.ob_ival__ return PyInt_FromLong__(r1+r2+r3) I've added trailing double underscores throughout just to indicate that I don't understand any of the implementation details of psyco in psyco. Presumably the generated prototype Python code will gather lots of statistics. The statistics _themselves_ can be written to the .pyp file as plain Python data structures. EKR -------------------------------------------------------------------- Edward K. Ream email: edream@tds.net Leo: Literate Editor with Outlines Leo: http://personalpages.tds.net/~edream/front.html --------------------------------------------------------------------
Hello Edward, On Sun, Jan 19, 2003 at 09:36:22AM -0600, Edward K. Ream wrote:
3. Modify Python's logic (it may be possible to do with a simple patch written in Python) so that Python looks for .pyp files and loads them as needed before looking for .pyc files. I believe this can be done very quickly.
I am not convinced by the whole .pyp idea, but it can certainly be experimented with. I regard it as an optimization only --- which might be very worthwhile, but that's not the point. We should avoid any kind of optimization at all for the current Python-in-Python interpreter. Similarily, there is no need to support .pyc files to claim CPython compatibility. Armin
participants (2)
-
Armin Rigo -
Edward K. Ream