
Hi there, This is a status report and a "what to do next" summary. The status is easily summarized as follows: the PyPy interpreter is quite complete and highly compliant with CPython, minus a number of dark corners that are only likely to bite user programs using the most introspective features of the standard library (most notably pickling). The flow graph and annotation subsystems work quite nicely too; we can successfully annotate more or less all of PyPy, or at least we are getting close to this goal. The "next" step is low-level code generation. Here, we have a rather large number of prototypes all around. The most complete one is genc.py, which is able to produce code for roughly the complete PyPy interpreter, but without using the annotations -- i.e. it is very slow code, and it is essentially dependent on CPython. Approaching the same goal but from the opposite direction, we have genllvm.py, which is only able to translate fully annotated graphs and doesn't have any mean of fall-back for things like faked objects which cannot be annotated. Finally there are other very incomplete or deprecated or planning-only back-ends: gencl, genpyrex, genjava. Not to mention geninterplevel whose goal is still different. The question is which line of work to focus on right now. All of these back-ends are interesting and worthwhile in the long run but we need to select a first one. There are basically 4 reasonable options: * Enhance genc.py. This is a step-by-step process, and any intermediate version can still be tested against the whole PyPy and against the snippet examples. Another advantage of genc is that it is the only option that doesn't depend on any external tool other than a C compiler. * Enhance genc.py using the C++ facility of function overloading for simplicity (basically, we would generate "z=add(x,y)" in the file and let the C++ compiler decide which version of add() to call based on the declared types of x and y). This might well be the easiest solution. A minor drawback is to require a C++ compiler. A possibly larger drawback is that the C++ compilation time might be quite larger, even for similar-looking code. (Having to know C++ in the first place shouldn't be that big a drawback if we don't use fancy C++ features.) * Go for genllvm.py. An obvious drawback is that we'd all have to install LLVM. The problem with genllvm right now is that it cannot make sense of unannotated code (or code containing the SomeObject annotation). We don't know yet for sure the quantity of such SomeObjects in the annotated PyPy source code, but a guess is that they occur mainly for "fake" stuff (file, long, unicode...). If so, there is one way around this problem. Carl pointed out that it *might* be easy to link the LLVM compiler output with CPython, possibly making a C extension module for CPython. If so, then we would add in genllvm support for "black box" PyObject pointers, and use a few functions from the CPython C-API to manipulate them. The goal here would be to modify the source code of the interpreter/module/objspace to reduce the number of operations that need to be performed on these "black boxes". For example, we could possibly reduce all these operations to method calls. In other words, we would say that using CPython objects like "long" at interp-level is temporarily OK but only if they are manipulated via method calls. This would make the genllvm support for them much easier. * genjava.py could be another option. It has a simpler type system, which matches ours quite well, but genjava doesn't exist yet at all (the one in the java/ subdirectory had a different goal in mind). We get memory management for free. If we add the requirement to compile with GCJ it could be easy to make a CPython extension module too, with the same problems and solutions about SomeObject as above. All in all, each 4 option is equally possible. If I have to pick one, I guess the first 2 options pass the additional criteria of "very good confidence that it will work within a couple of months". This is definitely biased by the fact that I'm not fluent in Java and know very little about LLVM, but also by my lack of knowledge about the ease of installation and integration of the corresponding tools. Then to pick one of the first two options, the second one (allowing some C++ facilities to sneak it) is my favourite. Comments and feed-back are welcome! Armin

Hi Armin, hi all, thanks for the good post and listing of choice of options! Let me add a some quick thoughts and comments. On Wed, Mar 30, 2005 at 18:25 +0100, Armin Rigo wrote:
Wouldn't this mean that we are barred from using "tcc" for testing/debugging purposes? To me the current No. 1 criterium for the choice of backends is development-testing/round-trip speed. And i would guess that both LLVM and genc both fare better than C++ in that respect. Also, it would be interesting to hear from Carl what the current state of the LLVM backend is (all tests pass for me, btw, and Carl seems to have made quite some progress judging from the passing tests and the generated .ll files alone). cheers, holger

hi all, On 30 Mar 2005 19:48:43 +0200, Holger Krekel wrote:
Yup, LLVM compiles surprisingly fast though I didn't test any big programs (On the other hand, I tried to compile CPython using the cfrontend of LLVM and it felt at least as fast as GCC, too).
Ok, short list of things that work and things that don't: things that work: ================= * ints, bools * lists (including range, some methods like append, pop, reverse) * strings (implemented as lists of chars) * classes with inheritance, basic polymorphism, isinstance checks things that don't work but should be (relatively) easy: ======================================================= * exceptions (I'll tackle those next weekend) * multiple inheritance -- should be easy since only mixins are allowed * iterators * class attributes * floats * variable argument functions -- haven't thought about those * dicts -- it might be a bit more difficult to get them fast, though * SomeObjects: I'm now sure that I can link again CPython so maybe it's actually easy to support SomeObjects things that don't work and are probably complicated: ==================================================== * garbage collection -- ouch: There are GC hooks in LLVM but there is no GC implemented. Some group of students tried implementing a GC at the end of last year but they didn't write to LLVM-dev since then. * I'm sure there is more but I can't think of it now Some more points: LLVM seems to be even more difficult to compile under Windows. I didn't track that in all detail but a while ago there were quite some problems. Furthermore I'm not sure whether there are that many advantages of LLVM over C (or some sort of minimal C++) at this point. Most if not all of the things I do with genllvm can be done in pretty much the same way with C and just as easily (maybe except the stack unrolling for exceptions), especially if the C++ features Armin mentiones are used. LLVM probably optimizes the code better but if that's the issue we could use genc and compile with llvm-gcc. The real strengths of LLVM come into play when the code is generated dynamically -- which is not really the point at the moment, right? All in all I can't really tell whether it's a good idea to make genllvm the 'first' backend. Regards, Carl Friedrich

Hi, On Wed, Mar 30, 2005 at 07:48:43PM +0200, holger krekel wrote:
That's an excellent point. I played with a more promizing approach in http://codespeak.net/svn/pypy/dist/pypy/translator/typer.py . Basically, instead of enhancing genc to support all kind of typed operations and implicit conversions (or rely on C++ to select the operations and insert these conversions automatically), the abvoe module contains code that modifies the flow graph itself to turn it into a "low-level" flow graph. The idea was already floating around here. In short it turns code like z = add(x, y) into z = intadd(x, y) if x and y are SomeIntegers, and it inserts int2obj() and obj2int() operations to convert variables that are SomeIntegers but used in operations that can't be special-cased (most of them, right now). The idea is then that genc only needs minor updates to give various C types to the variables. The operations like intadd() can be defined as a macro in genc.h, just like all the other operations. The module is called "typer" because I guess that a clean solution would involve a dict that maps Variables and individual Constants to their C type, instead of relying implicitely on the SomeXxx annotations to mean particular C types. A bientôt, Armin

Hi Armin, On Thu, Mar 31, 2005 at 09:29 +0100, Armin Rigo wrote:
Actually i woke up this morning with exactly this idea in mind :-) In other words, i agree that this is probably the cleanest/nicest way to go and share code between the backends.
I was wondering if it makes sense for such conversions to be determined at the flowgraph-level and conversion operations to be inserted accordingly? This could be a transformation that is specific to genc, genc++ and genjava with Jython (or even genllvm + cpython-bindings) ... in short almost for all the current or envisioned backends :-)
Yes, in addition with the conversions at the flowgraph level the genc backend should become pretty simple.
good point. cheers, holger

Hi Holger, On Thu, Mar 31, 2005 at 02:00:28PM +0200, holger krekel wrote:
Actually i woke up this morning with exactly this idea in mind :-)
:-)
I was wondering if it makes sense for such conversions to be determined at the flowgraph-level and conversion operations to be inserted accordingly?
Yes, that's what typer.py does already. Example: from pypy.translator.translator import Translator from pypy.translator.typer import specialize from pypy.translator.test.snippet import my_gcd as fn t = Translator(fn) t.simplify() a = t.annotate([int, int]) t.checkgraphs() specialize(a) t.checkgraphs() t.view() This example shows both the is_true() -> int_is_true() transformation, and the mod() operation which isn't recognized yet by specialize() and thus gets explicit conversion operations inserted before and after. It is indeed back-end specific which operations can be specialized and how, but it is done at the graph level. A bientot, Armin

Armin Rigo wrote:
Seems to be a nice idea, because it removes the burden of handling such things from the generators. I always wondered btw., why you have the annotation data elsewhere. I would have augmented it to the block structure. Changing the flow graph to contain the info is even more rude, but very nice! ciao - chris -- Christian Tismer :^) <mailto:tismer@stackless.com> tismerysoft GmbH : Have a break! Take a ride on Python's Johannes-Niemeyer-Weg 9A : *Starship* http://starship.python.net/ 14109 Berlin : PGP key -> http://wwwkeys.pgp.net/ work +49 30 802 86 56 mobile +49 173 24 18 776 fax +49 30 80 90 57 05 PGP 0x57F3BF04 9064 F4E1 D754 C2FF 1619 305B C09C 5A3B 57F3 BF04 whom do you want to sponsor today? http://www.stackless.com/

Hi Christian, On Thu, Mar 31, 2005 at 06:58:40PM +0200, Christian Tismer wrote:
The annotator's own flow-and-reflow nature makes it best to dissociate the data it (mis-)handles from the immutable inputs it gets, but you're right: the typer.py could probably just stick type information on the Variable and Constant objects of the flow graph. This probably makes even more sense given that there is no sane other way to put type information on constants (e.g. this "1" here is an int but that "1" over there is a PyObject). Armin

Armin Rigo <arigo@tunes.org> writes:
* Enhance genc.py using the C++ facility of function overloading for simplicity
* Go for genllvm.py.
* genjava.py could be another option.
A question that springs to mind is: How much work do these various options share? (It seems like your typer.py is exactly this kind of work). Also, I think that the benefits of function overloading are pretty minor for the cost of demanding a c++ compiler (although you could probably use cfront and tcc for that :). Cheers, mwh -- Whaaat? That is the most retarded thing I have seen since, oh, yesterday -- Kaz Kylheku, comp.lang.lisp

Michael Hudson wrote:
125 % seconded. -- Christian Tismer :^) <mailto:tismer@stackless.com> tismerysoft GmbH : Have a break! Take a ride on Python's Johannes-Niemeyer-Weg 9A : *Starship* http://starship.python.net/ 14109 Berlin : PGP key -> http://wwwkeys.pgp.net/ work +49 30 802 86 56 mobile +49 173 24 18 776 fax +49 30 80 90 57 05 PGP 0x57F3BF04 9064 F4E1 D754 C2FF 1619 305B C09C 5A3B 57F3 BF04 whom do you want to sponsor today? http://www.stackless.com/

Armin Rigo wrote:
We would get gc, thread support, a runtime and useful libraries (unicode, big integers ...), and an object model for free. How much of that and with what stability if we go through gcj is a bit open, although for the target of self-hosting that would be the interesting route. Java has no gotos wich means that at some point we would have to generate bytecode wich is not too hard but sometimes making the java verifier happy is harder than it could seem. The type system should match but there are no pointer to functions or delegates wich means some more involved solution to emulate them. We could probably reuse some things or ideas that are in Jython. Whether the things we would get for free and the type system and basic object model are really a good match for the code we can easily generate is also a open question. So it probably makes sense as a platform to try long term, and surely when we have made even more progress in annotating our codebase. Also because reusing java or jython stuff instead of trying to interface with CPython is probably easier because of ref-counting vs gc issues. But is worth keeping it in mind as a reserve route, because its trade-offs come together with quite a bit of high level functionality already there. But I agree that genc especially with an approach involving incrementally rewriting the graphs is the most natural route right now. Samuele.

Hi Günter, On Fri, Apr 01, 2005 at 11:13:26PM +0200, Günter Jantzen wrote:
C# is very similar to Java and supports 'goto'. Could this be an option?
Yes, it is definitely an option, and it seems likely that C# and/or .NET are very good medium- or long-term target -- but we have too many options for now :-) For the short-term we have settled over gradually making genc.py more type-aware. Armin

Hi Armin, hi all, thanks for the good post and listing of choice of options! Let me add a some quick thoughts and comments. On Wed, Mar 30, 2005 at 18:25 +0100, Armin Rigo wrote:
Wouldn't this mean that we are barred from using "tcc" for testing/debugging purposes? To me the current No. 1 criterium for the choice of backends is development-testing/round-trip speed. And i would guess that both LLVM and genc both fare better than C++ in that respect. Also, it would be interesting to hear from Carl what the current state of the LLVM backend is (all tests pass for me, btw, and Carl seems to have made quite some progress judging from the passing tests and the generated .ll files alone). cheers, holger

hi all, On 30 Mar 2005 19:48:43 +0200, Holger Krekel wrote:
Yup, LLVM compiles surprisingly fast though I didn't test any big programs (On the other hand, I tried to compile CPython using the cfrontend of LLVM and it felt at least as fast as GCC, too).
Ok, short list of things that work and things that don't: things that work: ================= * ints, bools * lists (including range, some methods like append, pop, reverse) * strings (implemented as lists of chars) * classes with inheritance, basic polymorphism, isinstance checks things that don't work but should be (relatively) easy: ======================================================= * exceptions (I'll tackle those next weekend) * multiple inheritance -- should be easy since only mixins are allowed * iterators * class attributes * floats * variable argument functions -- haven't thought about those * dicts -- it might be a bit more difficult to get them fast, though * SomeObjects: I'm now sure that I can link again CPython so maybe it's actually easy to support SomeObjects things that don't work and are probably complicated: ==================================================== * garbage collection -- ouch: There are GC hooks in LLVM but there is no GC implemented. Some group of students tried implementing a GC at the end of last year but they didn't write to LLVM-dev since then. * I'm sure there is more but I can't think of it now Some more points: LLVM seems to be even more difficult to compile under Windows. I didn't track that in all detail but a while ago there were quite some problems. Furthermore I'm not sure whether there are that many advantages of LLVM over C (or some sort of minimal C++) at this point. Most if not all of the things I do with genllvm can be done in pretty much the same way with C and just as easily (maybe except the stack unrolling for exceptions), especially if the C++ features Armin mentiones are used. LLVM probably optimizes the code better but if that's the issue we could use genc and compile with llvm-gcc. The real strengths of LLVM come into play when the code is generated dynamically -- which is not really the point at the moment, right? All in all I can't really tell whether it's a good idea to make genllvm the 'first' backend. Regards, Carl Friedrich

Hi, On Wed, Mar 30, 2005 at 07:48:43PM +0200, holger krekel wrote:
That's an excellent point. I played with a more promizing approach in http://codespeak.net/svn/pypy/dist/pypy/translator/typer.py . Basically, instead of enhancing genc to support all kind of typed operations and implicit conversions (or rely on C++ to select the operations and insert these conversions automatically), the abvoe module contains code that modifies the flow graph itself to turn it into a "low-level" flow graph. The idea was already floating around here. In short it turns code like z = add(x, y) into z = intadd(x, y) if x and y are SomeIntegers, and it inserts int2obj() and obj2int() operations to convert variables that are SomeIntegers but used in operations that can't be special-cased (most of them, right now). The idea is then that genc only needs minor updates to give various C types to the variables. The operations like intadd() can be defined as a macro in genc.h, just like all the other operations. The module is called "typer" because I guess that a clean solution would involve a dict that maps Variables and individual Constants to their C type, instead of relying implicitely on the SomeXxx annotations to mean particular C types. A bientôt, Armin

Hi Armin, On Thu, Mar 31, 2005 at 09:29 +0100, Armin Rigo wrote:
Actually i woke up this morning with exactly this idea in mind :-) In other words, i agree that this is probably the cleanest/nicest way to go and share code between the backends.
I was wondering if it makes sense for such conversions to be determined at the flowgraph-level and conversion operations to be inserted accordingly? This could be a transformation that is specific to genc, genc++ and genjava with Jython (or even genllvm + cpython-bindings) ... in short almost for all the current or envisioned backends :-)
Yes, in addition with the conversions at the flowgraph level the genc backend should become pretty simple.
good point. cheers, holger

Hi Holger, On Thu, Mar 31, 2005 at 02:00:28PM +0200, holger krekel wrote:
Actually i woke up this morning with exactly this idea in mind :-)
:-)
I was wondering if it makes sense for such conversions to be determined at the flowgraph-level and conversion operations to be inserted accordingly?
Yes, that's what typer.py does already. Example: from pypy.translator.translator import Translator from pypy.translator.typer import specialize from pypy.translator.test.snippet import my_gcd as fn t = Translator(fn) t.simplify() a = t.annotate([int, int]) t.checkgraphs() specialize(a) t.checkgraphs() t.view() This example shows both the is_true() -> int_is_true() transformation, and the mod() operation which isn't recognized yet by specialize() and thus gets explicit conversion operations inserted before and after. It is indeed back-end specific which operations can be specialized and how, but it is done at the graph level. A bientot, Armin

Armin Rigo wrote:
Seems to be a nice idea, because it removes the burden of handling such things from the generators. I always wondered btw., why you have the annotation data elsewhere. I would have augmented it to the block structure. Changing the flow graph to contain the info is even more rude, but very nice! ciao - chris -- Christian Tismer :^) <mailto:tismer@stackless.com> tismerysoft GmbH : Have a break! Take a ride on Python's Johannes-Niemeyer-Weg 9A : *Starship* http://starship.python.net/ 14109 Berlin : PGP key -> http://wwwkeys.pgp.net/ work +49 30 802 86 56 mobile +49 173 24 18 776 fax +49 30 80 90 57 05 PGP 0x57F3BF04 9064 F4E1 D754 C2FF 1619 305B C09C 5A3B 57F3 BF04 whom do you want to sponsor today? http://www.stackless.com/

Hi Christian, On Thu, Mar 31, 2005 at 06:58:40PM +0200, Christian Tismer wrote:
The annotator's own flow-and-reflow nature makes it best to dissociate the data it (mis-)handles from the immutable inputs it gets, but you're right: the typer.py could probably just stick type information on the Variable and Constant objects of the flow graph. This probably makes even more sense given that there is no sane other way to put type information on constants (e.g. this "1" here is an int but that "1" over there is a PyObject). Armin

Armin Rigo <arigo@tunes.org> writes:
* Enhance genc.py using the C++ facility of function overloading for simplicity
* Go for genllvm.py.
* genjava.py could be another option.
A question that springs to mind is: How much work do these various options share? (It seems like your typer.py is exactly this kind of work). Also, I think that the benefits of function overloading are pretty minor for the cost of demanding a c++ compiler (although you could probably use cfront and tcc for that :). Cheers, mwh -- Whaaat? That is the most retarded thing I have seen since, oh, yesterday -- Kaz Kylheku, comp.lang.lisp

Michael Hudson wrote:
125 % seconded. -- Christian Tismer :^) <mailto:tismer@stackless.com> tismerysoft GmbH : Have a break! Take a ride on Python's Johannes-Niemeyer-Weg 9A : *Starship* http://starship.python.net/ 14109 Berlin : PGP key -> http://wwwkeys.pgp.net/ work +49 30 802 86 56 mobile +49 173 24 18 776 fax +49 30 80 90 57 05 PGP 0x57F3BF04 9064 F4E1 D754 C2FF 1619 305B C09C 5A3B 57F3 BF04 whom do you want to sponsor today? http://www.stackless.com/

Armin Rigo wrote:
We would get gc, thread support, a runtime and useful libraries (unicode, big integers ...), and an object model for free. How much of that and with what stability if we go through gcj is a bit open, although for the target of self-hosting that would be the interesting route. Java has no gotos wich means that at some point we would have to generate bytecode wich is not too hard but sometimes making the java verifier happy is harder than it could seem. The type system should match but there are no pointer to functions or delegates wich means some more involved solution to emulate them. We could probably reuse some things or ideas that are in Jython. Whether the things we would get for free and the type system and basic object model are really a good match for the code we can easily generate is also a open question. So it probably makes sense as a platform to try long term, and surely when we have made even more progress in annotating our codebase. Also because reusing java or jython stuff instead of trying to interface with CPython is probably easier because of ref-counting vs gc issues. But is worth keeping it in mind as a reserve route, because its trade-offs come together with quite a bit of high level functionality already there. But I agree that genc especially with an approach involving incrementally rewriting the graphs is the most natural route right now. Samuele.

Hi Günter, On Fri, Apr 01, 2005 at 11:13:26PM +0200, Günter Jantzen wrote:
C# is very similar to Java and supports 'goto'. Could this be an option?
Yes, it is definitely an option, and it seems likely that C# and/or .NET are very good medium- or long-term target -- but we have too many options for now :-) For the short-term we have settled over gradually making genc.py more type-aware. Armin
participants (8)
-
Armin Rigo
-
Carl Friedrich Bolz
-
Christian Tismer
-
Günter Jantzen
-
holger krekel
-
hpk@trillke.net
-
Michael Hudson
-
Samuele Pedroni