Threaded interpretation (was: Re: compiler optimizations: collecting ideas)

Hi all, after the completion of our student project, I have enough experience to say something more. We wrote in C an interpreter for a Python subset and we could make it much faster than the Python interpreter (50%-60% faster). That was due to the usage of indirect threading, tagged pointers and unboxed integers, and a real GC (a copying one, which is enough for the current benchmarks we have - a generational GC would be more realistic but we didn't manage to do it). While we lack some runtime checks, conditional branches are often almost for free if you have to do an indirect branch soon and likely flush the pipeline anyway, especially if just one of the possibilities lays in the fast path. That means that runtime checks for invalid code have the cost of a normal instruction and not the cost of a conditional branch (which can be much more expensive). I obviously do not expect everybody to believe me on this. So, if you prefer, ignore this "vaporware" and just consider my proposals. In fact, we are at least 50% faster on anything we can run, but also on this benchmark, with the usage of unboxed integers and tagged pointer (we tag pointers with 1 and integers with 0, like V8 does and SELF did, so you can add integers without untagging them): def f(): y = 0 x = 10000000 while x: x-=1 y+=3 return y - 29999873 And since we do overflow checking (by checking EFLAGS in assembly, even if the code could be improved even more, probably), I don't think a comparison on this is unfair in any way. Also, when adding overflow checking we didn't notice any important slowdown. I think that's enough to say that we are faster also because of design choices and implementation quality, and the V8 lead developer, Lars Bak, our professor, had a similar feeling. On Mon, Nov 17, 2008 at 15:05, Antonio Cuni <anto.cuni@gmail.com> wrote:
Paolo Giarrusso wrote:
specialized bytecode can be significant, I guess, only if the interpreter is really fast (either a threaded one, or a code-copying one). Is the PyPy interpreter threaded?
sometime ago I tried to measure if/how much we can gain with a threaded interpreter. I manually modified the produced C code to make the main loop threaded, but we didn't gain anything; I think there are three possible reasons:
1) in Python a lot of opcodes are quite complex and time-consuming,
That's wrong for a number of reasons - the most common opcodes are probably loads from the constant pool, and loads and stores to/from the locals (through LOAD/STORE_FAST). Right now, our hotspot is indeed the dispatch after the LOAD_FAST opcode. And those opcodes are quite simple. The hotspot used to be LOAD_CONST, but to optimize it, I just had to save the pointer to the constant pool. Instead of accessing current_function->co_consts, I save it into a local, which can be cached into a register even through function calls: raw_array *current_function_consts = current_function->co_consts; That made a 4% speedup on our benchmarks. And we didn't write a microbenchmark for LOAD_CONST. This is the handling code: LABEL_BEGIN(LOAD_CONST); PUSH(raw_array_skip_header(current_function_consts)[parameter]); LABEL_END(LOAD_CONST); (LABEL_BEGIN/LABEL_END bracket all opcode cases, they expand to labels for indirect threading and to the dispatch code itself). After seeing this, you can see why one extra memory load makes a difference.
so the time spent to dispatch to them is a little percentage of the total time spent for the execution
That's your problem - threading helps when you spend most of the time on dispatch, and efficient interpreters get to that point. Threading helps even Prolog interpreters, like Yap, so it can help even for Python. Also, according to measurements "The Structure and Performance of Efficient Interpreters", efficient interpreters tend to execute no more than 30 instructions per opcode on their RISC machine simulator. The Python interpreter could still maybe be considered efficient, since it uses ~30 instructions per opcode, but that's on a CISC processor (i.e. my laptop), so the numbers are not directly comparable. I'll maybe measure the ratio, for the interpreters they benchmark, again on my laptop, to do a fair comparison.
2) due to Python's semantics, it's not possible to just jump from one opcode to the next, as we need to do a lot of bookkeeping, like remembering what was the last line executed, etc.
No, you don't need that, and not even CPython does it. For exception handling, just _when an exception is thrown_, you can take the current opcode index and find the last executed line with a lookup table. Their source code has a slowpath for tracing support, but that's one just conditional jump to a slowpath, which is for free in any case if the predictor guesses it right (and it should). So, can you make some more examples?
This means that the trampolines at the end of each opcode contains a lot code duplication, leading to a bigger main loop, with possibly bad effects on the cache (didn't measure this, though)
If the interpreter loop is able to overflow the Icache, that should be fought through __builtin_expect first, to give hint for jump prediction and lay out slowpaths out-of-line.
3) it's possible that I did something wrong, so in that case my measurements are completely useless :-). If anyone wants to try again, it cannot hurt.
Well, my plan is first to try, at some point, to implant threading into the Python interpreter and benchmark the difference - it shouldn't take long but it has a low priority currently. Regards -- Paolo Giarrusso

Hi Paolo, Paolo Giarrusso wrote:
after the completion of our student project, I have enough experience to say something more. We wrote in C an interpreter for a Python subset and we could make it much faster than the Python interpreter (50%-60% faster). That was due to the usage of indirect threading, tagged pointers and unboxed integers, and a real GC (a copying one, which is enough for the current benchmarks we have - a generational GC would be more realistic but we didn't manage to do it).
Interesting, but it sounds like you are comparing apples to oranges. What sort of subset of Python are you implementing, i.e. what things don't work? It has been shown time and time again that implementing only a subset of Python makes it possible to get interesting speedups compared to CPython. Then, as more and more features are implemented, the difference gets smaller and smaller. This was true for a number of Python implementations (e.g. IronPython). I think to get really meaningful comparisons it would be good to modify an existing Python implementation and compare that. Yes, I know this can be a lot of work. On your actual techniques used I don't have an opinion. I am rather sure that a copying GC helped performance – it definitely did for PyPy. Tagged pointers make PyPy slower, but then, we tag integers with 1, not with 0. This could be changed, wouldn't even be too much work. About better implementations of the bytecode dispatch I am unsure. Note however, that a while ago we did measurements to see how large the bytecode dispatch overhead is. I don't recall the exact number, but I think it was below 10%. That means that even if you somehow manage to reduce that to no overhead at all, you would still only get 10% performance win. Cheers, Carl Friedrich

On Sun, Dec 21, 2008 at 19:13, Carl Friedrich Bolz <cfbolz@gmx.de> wrote:
Hi Paolo,
Paolo Giarrusso wrote:
after the completion of our student project, I have enough experience to say something more. We wrote in C an interpreter for a Python subset and we could make it much faster than the Python interpreter (50%-60% faster). That was due to the usage of indirect threading, tagged pointers and unboxed integers, and a real GC (a copying one, which is enough for the current benchmarks we have - a generational GC would be more realistic but we didn't manage to do it).
Interesting, but it sounds like you are comparing apples to oranges. What sort of subset of Python are you implementing, i.e. what things don't work? It has been shown time and time again that implementing only a subset of Python makes it possible to get interesting speedups compared to CPython. Then, as more and more features are implemented, the difference gets smaller and smaller. This was true for a number of Python implementations (e.g. IronPython).
Yes, I remember - that's part of the motivation of your project, or something like that. Still, on the example of the arithmetic loop there couldn't be anything to add I feel; I forgot to note that we support, beyond overflow checking, operator overloading; i.e. this is our BINARY_ADD implementation: LABEL_BEGIN(BINARY_ADD); { tagged_el_t a = POP; tagged_el_t b = POP; if (!(unlikely(is_pointer(a)) || unlikely(is_pointer(b)))) { long sum; if (likely(!overflow_checked_add(&sum, TAGGED_EL_RAW_VALUE(a), TAGGED_EL_RAW_VALUE(b)))) { PUSH(raw_long_to_tagged_el(sum)); } else { //XXX box to unlimited precision integers. PUSH(long_to_tagged_el(INT_MAX)); } } else { Object *bp = get_ptr(b); Object *(*add)(Object *, Object *) = bp->type->add; if (add != NULL) { PUSH(ptr_to_tagged_el(add(get_ptr(b), get_ptr(a)))); } else { assert(false); } } } LABEL_END(BINARY_ADD); As you can see, the slowpaths still need a lot of implementation work (we don't support __radd__ for instance), but I tried to ensure that nothing is missing from the fastpath I'm benchmarking. That's why overflow checking is implemented even if overflow is not handled properly (and the given example does _not_ overflow).
I think to get really meaningful comparisons it would be good to modify an existing Python implementation and compare that. Yes, I know this can be a lot of work.
Yes, I'm aware of that, and that's why I plan to test threading in Python.
On your actual techniques used I don't have an opinion. I am rather sure that a copying GC helped performance – it definitely did for PyPy. Tagged pointers make PyPy slower, but then, we tag integers with 1, not with 0. This could be changed, wouldn't even be too much work.
The major problem in changing it is maybe to convert pointers, since in your representation pointers can be used directly (it depends on your macros). But since offseting a pointer is directly supported by the processor, it shouldn't have any major cost (except 1-extra byte of output, I guess, for each pointer access).
About better implementations of the bytecode dispatch I am unsure. Note however, that a while ago we did measurements to see how large the bytecode dispatch overhead is. I don't recall the exact number, but I think it was below 10%. That means that even if you somehow manage to reduce that to no overhead at all, you would still only get 10% performance win.
Sure - but my point, discussed in the mail, and taken from the paper I mentioned, is that in an efficient interpreter, bytecode dispatch becomes much more important than the other activities. Don't believe me, believe that paper (which is quite respected). They report speedups given by threading up to 2x :-), on efficient interpreters. Also, Antonio Cuni made just one example of additional overhead for dispatch: source-line tracking, for exception handling I guess. And as I explained, as far as I understand that's simply a mistake, even compared to CPython - so I think there may be other low-hanging fruits in the interpreter. But I'm still waiting for answers and further explainations on this point. Regards -- Paolo Giarrusso

Carl Friedrich Bolz wrote:
About better implementations of the bytecode dispatch I am unsure. Note however, that a while ago we did measurements to see how large the bytecode dispatch overhead is. I don't recall the exact number, but I think it was below 10%.
I think it's something more. There is the 'rbench' module that contains geninterpreted versions of both richards and pystone; IIRC last time I tried they where ~50% faster than they interpreted counterparts, on both pypy-c and pypy-cli. Of course, with geninterp you remove more than just the interpretatiom overhead, as e.g. locals are stored on the stack instead that on a frame. ciao, Anto

Paolo Giarrusso wrote:
Hi all,
Hi!
after the completion of our student project, I have enough experience to say something more. We wrote in C an interpreter for a Python subset and we could make it much faster than the Python interpreter (50%-60% faster). That was due to the usage of indirect threading, tagged pointers and unboxed integers, and a real GC (a copying one, which is enough for the current benchmarks we have - a generational GC would be more realistic but we didn't manage to do it).
That's interesting but says little about the benefit of threaded interpretation itself, as the speedup could be given by the other optimizations. For example, I suspect that for the benchmark you showed most of the speedup is because of tagged pointers and the better GC. Is it possible to make you eval loop non-threaded? Measuring the difference with and without indirect threading could give a good hint of how much you gain with it. What kind of bytecode you use? The same as CPython or a custom one? E.g. I found that if we want to handle properly the EXTENDED_ARG CPython opcode it is necessary to insert a lot of code before jumping to the next opcode. Moreover, tagging pointer with 1 helps a lot for numerical benchmarks, but it is possible that causes a slowdown for other kind of operations. Do you have non-numerical benchmarks? (though I know that it's hard to get fair comparison, because the Python object model is complex and it's not easy to write a subset of it in a way that is not cheating) Finally, as Carl said, it would be nice to know which kind of subset it is. E.g. does it support exceptions, sys.settrace() and sys._getframe()?
In fact, we are at least 50% faster on anything we can run, but also on this benchmark, with the usage of unboxed integers and tagged pointer (we tag pointers with 1 and integers with 0, like V8 does and SELF did, so you can add integers without untagging them):
def f(): y = 0 x = 10000000 while x: x-=1 y+=3 return y - 29999873
And since we do overflow checking (by checking EFLAGS in assembly, even if the code could be improved even more, probably), I don't think a comparison on this is unfair in any way.
is your subset large enough to handle e.g. pystone? What is the result?
1) in Python a lot of opcodes are quite complex and time-consuming,
That's wrong for a number of reasons - the most common opcodes are probably loads from the constant pool, and loads and stores to/from the locals (through LOAD/STORE_FAST). Right now, our hotspot is indeed the dispatch after the LOAD_FAST opcode.
if you do benchmarks as the one showed above, I agree with you. If you consider real world applications, unfortunately there is more than LOAD_CONST and LOAD_FAST: GETATTR, SETATTR, CALL, etc. are all much more time consuming than LOAD_{FAST,CONST}
That's your problem - threading helps when you spend most of the time on dispatch, and efficient interpreters get to that point.
the question is: is it possible for a full python interpreter to be "efficient" as you define it?
2) due to Python's semantics, it's not possible to just jump from one opcode to the next, as we need to do a lot of bookkeeping, like remembering what was the last line executed, etc.
No, you don't need that, and not even CPython does it. For exception handling, just _when an exception is thrown_, [cut]
Sorry, I made a typo: it is needed to remember the last *bytecode* executed, not the last line. This is necessary to implement properly sys.settrace(). I never mentioned exception handling, that was your (wrong :-)) guess.
If the interpreter loop is able to overflow the Icache, that should be fought through __builtin_expect first, to give hint for jump prediction and lay out slowpaths out-of-line.
I think that Armin tried once to use __builtin_expect, but I don't remember the outcome. Armin, what was it?
Well, my plan is first to try, at some point, to implant threading into the Python interpreter and benchmark the difference - it shouldn't take long but it has a low priority currently.
That would be cool, tell us when you have done :-).

On Sun, Dec 21, 2008 at 23:49, Antonio Cuni <anto.cuni@gmail.com> wrote:
Paolo Giarrusso wrote:
Thanks for the interesting mail. It really goes into detail about the issues I raised, and it does suggest what I needed: stuff needing support inside the interpreter fast path, if just for unlikely branches. I started the required coding, but it will take a while (also because I have little time, given Christmas). But I'm answering with some preliminary results. First, I'd answer this question:
That's your problem - threading helps when you spend most of the time on dispatch, and efficient interpreters get to that point.
the question is: is it possible for a full python interpreter to be "efficient" as you define it?
Well, my guess is "if Prolog, Scheme and so on can, why can't Python"? I know this is an interesting question, and I and my colleague, during the past two months, tried to get an answer. And it wasn't really two months for two people, I would say that much less time was actually spent. So for instance we don't have support for modules, exception handling, and the basic support for object (through Self/V8 style maps) is not yet complete (but almost). Then, for a couple of points, it isn't possible. Unboxed integers work better with the Python 3.x specs - if you have them and you still want a 32-bit integer, which is boxed, to be an int instead of a long, you get some problems. Also, the semantics of Python destructors are too restrictive for a proper GC, and they can create unreclaimable reference cycles (when all objects have destructors). In fact, Java, unlike C++, has instead finalizers - the difference is that you can finalize an object after finalizing the objects it points to. That's why I prefer not to say "Python finalizers" like the official docs. However, for both cases I feel that the semantics are wrong. Even finalizers are a bad idea - it's interesting that .NET is even more misguided than Java about them (and that's the opinion of Hans Boehm, not mine). So, those semantics fit well only reference counting - and reference counting, in turn, makes support for multithreading impossible (the patches for "free threading", without the global interpreter lock, gave a 2x slowdown I think). Somebody on the python-dev ML realized that CPython simply gets both things wrong (reference counting and threading support), but most people went to say "oh, but people shouldn't use threading anyway, it's too complicated to get right, they should use multiprocess stuff..." . That was one of the most (negatively) astonishing things I saw. Oh, no, the worst was something like "oh, we should allow using more interpreters in the same process, but we should move lots of statics to the interpreter context, and it's a lot of work so it's not worth it". Since I've participated to a community where bigger patches are committed every week, this looks like plain laziness; it could be because much less people are paid to work on Python, or maybe because the general community mood is that one. In fact, one of the reason I started this project _at all_ so early is that the Python community clearly shows _ignorance_ about years of VM research and little attention about VM performance. == Benchmarks about dispatch overhead == And while you don't look like that, the mention of "tracking the last line executed" seemed quite weird. And even tracking the last bytecode executed looks weird, even if it is not maybe. I'm inspecting CPython's Python/ceval.c, and the overhead for instruction dispatch looks comparable. The only real problem I'm getting right now is committing the last bytecode executed to memory. If I store it into a local, I have no problem at all, if I store it into the interpreter context, it's a store to memory, so it hurts performance a lot - I'm still wondering about the right road to go. Runtime of the while benchmark, with 10x more iterations, user time: baseline (now, with tracking of f_lasti inside a local): ~4.6s add ->f_lasti tracking _into memory_: ~5.6s disable indirect threading: ~5.6s both slowdown together: ~7s Python 2.5.1: ~11.2s Other benchmarks use fatter opcodes, so the impact is still visible, but is lower relative to the overall time. Even if f_lasti must be updated at all the time, could it be stored in memory just in some occasions? Is it needed during the call to settrace(), or only to the next opcode after that, i.e. after the return of the interpreter loop? The second case would be enough. Storing the variable into memory just for that would be simply excellent for performance, maybe even in the stock Python interpreter. Note: all this is on a 64bit box, with 16 registers; we're also faster on a 32-bit one, and in some cases the difference wrt. Python is bigger on 32-bit (on the GC stress-test, for instance, due to smaller pointers to update, and on the recursion test, maybe for the same reason when using the stack, since the stack element size is 64bit on 64bit machines), while on the others the extra registers (I guess) give advantage wrt. Python to the 64bit interpreter (obviously, the comparison is always with Python on the same machine and architecture). See slide 8 of pres.pdf.
That's interesting but says little about the benefit of threaded interpretation itself, as the speedup could be given by the other optimizations. For example, I suspect that for the benchmark you showed most of the speedup is because of tagged pointers and the better GC.
Is it possible to make you eval loop non-threaded? Measuring the difference with and without indirect threading could give a good hint of how much you gain with it.
Ok, just done it, the speedup given by indirect threading seems to be about 18% (see also above). More proper benchmarks are needed though. And as you say in the other mail, the overhead given by dispatch is quite more than 50% (maybe). Am I correct in assuming that "geninterpret"ing _basically_ pastes the opcode handlers together? I guess with your infrastructure, you can even embed easily the opcode parameters inside the handlers, it's just a trivial partial evaluation - I tried to apply code copying of machine code to my interpreter, but I would have had to keep parameter fetching separate (getting the output I needed from GCC was not easy - I could make code copying work just for an early prototype). About keeping variables into the stack instead that in the frame, that's even stranger to me, given this argument.
What kind of bytecode you use? The same as CPython or a custom one? E.g. I found that if we want to handle properly the EXTENDED_ARG CPython opcode it is necessary to insert a lot of code before jumping to the next opcode.
We handle a subset of the Python's opcodes, but what you say means that we'll have to handle EXTENDED_ARG properly, because of our mission. Oh, done, I would say (tests needed though). I think that "lot of code" can only refer to the requirement of tracking the last bytecode executed, but I'll have to look at your and CPython sources for that. That seems to have an impact of a few percent points (like 4-5% slower than before), but I have to do real benchmarking for these stuff. I fear I'll have to calculate confidence interval to give any meaningful number (I never needed that when comparing us with CPython), because the impact seems under statistical noise. However, just having the code laid out the wrong way can make it much worse. As soon as I wrote it, it almost doubled the runtime of the arithmetic benchmark (while it had a much lower impact on the other). It seems it was because GCC decided that HAS_ARG(opcode) had become unlikely, adding "likely" fixed it back. In fact, one of the reason we are so fast is that we started with a small interpreter and tracked down every performance regression as soon as it happened. We lost half of our performance, at some point, just because the address of the program_counter local variable escaped, so that var had to be allocated on the stack and not inside a register. == Opcode set == We really wanted to have our own one opcode set, but we didn't manage. In particular, I was refreshed when I read about your LOAD_METHOD optimization. Also, invoking __init__ is a real mess, Java handles it much more nicely, and I don't see the point of not using JVM-style 2-step class construction: new Class dup <-- <arguments> invokespecial Class.<init> Existence of bound methods prevents simply pushing the receiver as the first argument on the stack; however, I can't see a similar argument for class construction, even if __init__ can be a bound method. Generating similar bytecode would avoid the need to rotate the stack. In Python, one would get something like: ALLOC_INSTANCE Class // <--- new opcode DUP_TOP //Copy the value to return LOAD_METHOD __init__ <arguments> INVOKE_METHOD POP_TOP //Remove the None returned by the constructor However, the dispatch cost of the additional DUP_TOP and POP_TOP might be a problem. I guess such bytecode will be for sure more efficient to JIT-compile, but for actual interpretation, benchmarks would be needed.
Moreover, tagging pointer with 1 helps a lot for numerical benchmarks, but it is possible that causes a slowdown for other kind of operations. Do you have non-numerical benchmarks? (though I know that it's hard to get fair comparison, because the Python object model is complex and it's not easy to write a subset of it in a way that is not cheating)
I agree with that. The choice was done following the one done by V8 people, and was already tested in the Self system, but I'll answer you with those tests. We have a couple of recursion test, with a recursive function and a recursive method (but this one is unfair). And a test with list access - we are ~50% faster even there. For the Python object model, we have the opposite problem, because it's much more code to write and optimize: our INPLACE_ADD is equal to BINARY_ADD, so given a list l, a loop with "l += [1]" is able to kill us completely (since for now we create a list each time). Indeed, "l = l + [1]" is just as slow with our interp, but gets much slower in Python. Even here we are still faster, but most of the time is spent adding lists, so we are just 15% faster. I guess we'd get more advantage if INPLACE_ADD was implemented, and that's not a big deal.
Finally, as Carl said, it would be nice to know which kind of subset it is.
I just need to find the time to write it down, sorry for the wait. I'm attaching our final presentation, the report we wrote for the course, and the _benchmarks_ we have. But basically, the subset we implemented is _much_ smaller.
E.g. does it support exceptions, sys.settrace() and sys._getframe()?
Exception handling is not supported, but stack unwind should be possible without impact on the fastpath. We didn't know about the other two ones. I saw code which I guess is for handling settrace in CPython, but that's on the slowpath, I think I should be able to easily simulate the impact of that by adding one unlikely conditional jump. Is it required to track the last bytecode even when tracing is not active, as asked above? About _getframe(), I'm quite curious about how it works and how your support for it works. From the next mail you sent, it seems that you construct the needed frames during the function calls (and I saw some documentation about them). Well, in the Smalltalk age, developers found that constructing the frame object only when reflective access to the frame is needed is the only reasonable solution, if performance is important. _If_ the frame object returned is modifiable, I still think one can intercept field modifications and modify the original stack to track the modified object. In fact, to get the current performance, we don't allocate a locals dictionary unless needed; and currently it seems that doing it only when STORE_NAME is invoked gives the correct semantics (and if that were wrong, computing a NEEDS_LOCAL_DICT flag during compilation would be trivial). Otherwise, the recursion test would be slower than the one in Python. Also, we didn't bother with Unicode, and that's complicated to handle - even Unicode handling in V8 has problems, and string operations are slower than in TraceMonkey.
is your subset large enough to handle e.g. pystone? What is the result?
We didn't have the time to even download pystone.
1) in Python a lot of opcodes are quite complex and time-consuming,
That's wrong for a number of reasons - the most common opcodes are probably loads from the constant pool, and loads and stores to/from the locals (through LOAD/STORE_FAST). Right now, our hotspot is indeed the dispatch after the LOAD_FAST opcode.
if you do benchmarks as the one showed above, I agree with you. If you consider real world applications, unfortunately there is more than LOAD_CONST and LOAD_FAST: GETATTR, SETATTR, CALL, etc. are all much more time consuming than LOAD_{FAST,CONST}
Yep, but well, if you want to implement a low-level algorithm in Python, at some point you'll write such code in an inner loop. But yes, we've been working also on attribute fetch - we've been implementing V8 maps just for that (we worked on that for a couple of days, and I'm fixing the remaining bugs - but it's not really a lot of code). For GET/SET_ATTR, I will need to do "inline caching" of lookup results (even if the term doesn't really apply to an interpreter). Making it faster than dictionary lookup will be hard however, since our strings cache their hash code.
2) due to Python's semantics, it's not possible to just jump from one opcode to the next, as we need to do a lot of bookkeeping, like remembering what was the last line executed, etc.
No, you don't need that, and not even CPython does it. For exception handling, just _when an exception is thrown_,
[cut]
Sorry, I made a typo: it is needed to remember the last *bytecode* executed, not the last line. This is necessary to implement properly sys.settrace().
I never mentioned exception handling, that was your (wrong :-)) guess.
Indeed, and I was quite surprised to see something like that. I'm quite happier now, but see above for comment on its impact.
If the interpreter loop is able to overflow the Icache, that should be fought through __builtin_expect first, to give hint for jump prediction and lay out slowpaths out-of-line.
I think that Armin tried once to use __builtin_expect, but I don't remember the outcome. Armin, what was it?
On suitable benchmarks, for me it's easy to see that the wrong expectation can give horrible results (not always, but often). And sometimes GCC does get the wrong result. But once, on a short opcode, fixing code laid out the wrong way didn't make any important difference. I guess that to execute one simple opcode one maybe doesn't manage to fill a 20-stage pipeline before of the next dispatch, which will flush it. So, if you have 10 or 20 instructions, it doesn't matter that much - 20 cycles are needed anyway. In that case, fixing the layout made the same instructions execute faster probably, but they had to wait more time. Also, I do guess that dynamic branch prediction could fix the wrong code layout (not perfectly though - a predicted taken branch costs more than one not taken). Note: to give you actual numbers, BINARY_ADD uses 16 assembler instructions, and then 11 are needed for dispatching, plus other 7 if the next opcode has a parameter; the slow paths use some more instructions, in case of EXTENDED_ARG (but they are laid out-of-line). DUP_TOP is three assembly instructions, POP_TOP is one, and the stack pointer is kept in RBX. Note 2: by suitable benchmarks, I don't mean unfair ones. It would be easy (and unfair) to give hints for the exact code you have in your benchmark.
Well, my plan is first to try, at some point, to implant threading into the Python interpreter and benchmark the difference - it shouldn't take long but it has a low priority currently.
That would be cool, tell us when you have done :-).
For sure I will let you know, just don't hold your breath on that :-). Regards -- Paolo Giarrusso

Hi Paolo, Just a quick note as an answer to your long and detailed e-mail (thanks for it, btw). On the whole you're making quite some efforts to get Python fast, starting with a subset of Python and adding feature after feature until it is more or less complete, while benchmarking at every step. This is not a new approach: it has been tried before for Python. Usually, this kind of project ends up being not used and forgotten, because it's "only" 80% or 90% compatible but not 99% -- and people care much more, on average, about 99% compatibility than about 50% performance improvement. PyPy on the other hand starts from the path of 99% compatibility and then tries to improve performance (which started as 10000 times slower... and is now roughly 1.5 or 2 times slower). Just saying that the approach is completely different... And I have not much interest in it -- because you change the language and have to start again from scratch. A strong point of PyPy is that you don't have to; e.g. we have, in addition to the Python interpreter, a JavaScript, a Smalltalk, etc... A bientot, Armin.

Hi Armin, first, thanks for your answer. On Tue, Dec 23, 2008 at 16:01, Armin Rigo <arigo@tunes.org> wrote:
On the whole you're making quite some efforts to get Python fast, starting with a subset of Python and adding feature after feature until it is more or less complete, while benchmarking at every step. This is not a new approach: it has been tried before for Python. Usually, this kind of project ends up being not used and forgotten, because it's "only" 80% or 90% compatible but not 99% -- and people care much more, on average, about 99% compatibility than about 50% performance improvement. PyPy on the other hand starts from the path of 99% compatibility and then tries to improve performance (which started as 10000 times slower... and is now roughly 1.5 or 2 times slower).
Just saying that the approach is completely different... And I have not much interest in it -- because you change the language and have to start again from scratch. A strong point of PyPy is that you don't have to; e.g. we have, in addition to the Python interpreter, a JavaScript, a Smalltalk, etc... Never said we're gonna turn this into a full-featured Python interpreter, and to rewrite all the libraries for it.
So, just a few clarifications: 1) this is a _student project_ which is currently "completed" and has been handed in, has been written by two students and was our first interpreter ever (and for one of us, the first really big C project). I knew that locals are faster than structure fields, but I had absolutely no idea of why and how much, before starting experimenting with this. 2) it is intended to be a way to learn how to write it, and a proof of concept about how Python can be made faster. The first two things I'll try to optimize are the assignment to ->f_lasti and addition of indirect threading (even if right now I'd guess an impact around 5%, if anything, because of refcounting). If I'll want to try something without refcounting, I'll guess I'd turn to PyPy, but don't hold your breath for that. The fact that indirect threading didn't work, that you're 1.5-2x slower than CPython, and that you store locals in frame objects, they all show that the abstraction overhead of the interpret is too high. Since you have different type of frame objects, I guess you might use virtuals to access them (even though I hope not), or that you have anyhow some virtuals. And that'd be a problem as well. 3) still, I do believe that working on it was interesting to get experience about how to optimize an interpreter. And the original idea was to show that real multithreading (without a global interpreter lock) cannot be done in Python just because of the big design mistakes of CPython. Regards -- Paolo Giarrusso

Hi Paolo, On Thu, Dec 25, 2008 at 12:42:18AM +0100, Paolo Giarrusso wrote:
If I'll want to try something without refcounting, I'll guess I'd turn to PyPy, but don't hold your breath for that. The fact that indirect threading didn't work, that you're 1.5-2x slower than CPython, and that you store locals in frame objects, they all show that the abstraction overhead of the interpret is too high.
True, but a 1.5x slowdown is not a big deal on many application; the blocker is mostly elsewhere. And on the other hand, we've been working on the JIT generator -- since a while now, so I cannot make any promise -- and the goal is to turn this slowish interpreter into a good JITting one "for free". As far as I know, this cannot be done so easily if you start from a lower-level interpreter like CPython. This is our "real" goal somehow, or one of our real goals: a JITting virtual machine for any language that we care to write an interpreter for.
3) still, I do believe that working on it was interesting to get experience about how to optimize an interpreter.
Sure, it makes some sense if you think about it in this way. I doesn't if you think about "I want to give the world a fast Python interpreter", but you corrected me already: this is not your goal, so it's fine :-)
And the original idea was to show that real multithreading (without a global interpreter lock) cannot be done in Python just because of the big design mistakes of CPython.
I disagree on this point. Jython shows that without redesign of the language, they can implement a real multithreading model without a GIL. About PyPy, the lesson that we learned is different: it's that implementing a GIL-free model requires a bit of tweaking of the whole interpreter -- as opposed to doing it in a nicely separable manner. So far we only implemented threads-with-a-GIL, which can be done in this separable way (i.e. without changing a single line of most of the interpreter). Given enough interest we can implement full GIL-free multithreading the slow way: by going over most features of the interpreter and thinking. Note that it would also be possible to do that in CPython (but it would be even more work). It's just that doing so is not really "in the spirit" of PyPy :-) In our style, we would rather start by thinking harder about some unpolished ideas that float around about doing it in a separable manner anyway -- but this has not been tried at all. A bientot, Armin.

On Fri, Jan 2, 2009 at 14:23, Armin Rigo <arigo@tunes.org> wrote:
Hi Paolo,
On Thu, Dec 25, 2008 at 12:42:18AM +0100, Paolo Giarrusso wrote:
If I'll want to try something without refcounting, I'll guess I'd turn to PyPy, but don't hold your breath for that. The fact that indirect threading didn't work, that you're 1.5-2x slower than CPython, and that you store locals in frame objects, they all show that the abstraction overhead of the interpret is too high.
True, but a 1.5x slowdown is not a big deal on many application; the blocker is mostly elsewhere.
Let's say 1.5x * 2x = 3x, since CPython is not as fast as it could be, because of refcounting for instance. The 2x is taken from PyVM performance reports (see http://www.python.org/dev/implementations/). And for PyPy a slower interpreter means a slower JIT output, isn't it? See below. Also, according to CS literature, interpreter performance makes more difference than JIT. According to the paper about efficient interpreters I'm mentioning endless times, an inefficient interpreter is 1000x slower than C, while an efficient one is only 10x slower. Python is not that slow, but what you wrote about Psyco seems to imply that there is still lots of room for improvement. "You might get 10x to 100x speed-ups. It is theoretically possible to actually speed up this kind of code up to the performance of C itself." At some point, I should repeat these comparison with the OcaML interpreter: http://mail.python.org/pipermail/python-list/2000-December/063212.html to see how faster it got :-).
And on the other hand, we've been working on the JIT generator -- since a while now, so I cannot make any promise -- and the goal is to turn this slowish interpreter into a good JITting one "for free". As far as I know, this cannot be done so easily if you start from a lower-level interpreter like CPython. This is our "real" goal somehow, or one of our real goals: a JITting virtual machine for any language that we care to write an interpreter for.
That's amazing - I knew that, but your explanation is much nicer, since it gives the emphasis to the right thing.
From a native interpreter you can only do code copying, which removes dispatch costs but is still suboptimal.
And well, the current overhead can be removed with further improvements on the interpreter and/or the RPython translator. And it is important, for two reasons: 1) an interpreter is still needed to profile the code; experience with Java showed that only having a JIT gives too much latency, and people expect even less latency from Python. 2) partially different translators are used I guess, but since the compiler and the interpreter are built from the same source, part of the 3x slowdown would have effect also on the JIT-ed code. In any case, even if it may be unclear, I do like your plans. Believe me, lack of time until now is the only thing which prevented me from actually joining you and coding.
And the original idea was to show that real multithreading (without a global interpreter lock) cannot be done in Python just because of the big design mistakes of CPython.
I disagree on this point. Jython shows that without redesign of the language, they can implement a real multithreading model without a GIL.
Ok, first I have to rephrase what I said. Python in itself is a Good Thing. BUT, there are some big design mistakes about some semantic details of __del__, which make a difference for code using files without calling close() for instance. To be polite, those semantics are totally brain-dead, unless one worships reference counting and considers GC as the devil. Having clarified that, I can state that indeed, Jython may only show the opposite of what you say; indeed I like to say that they didn't bother to be bug-to-bug compatible with Python _specification bugs_, and that's why it seems they proved your point. Need I point out that reference cycles can't be reclaimed in Python if any involved object implements __del__? I know the Python gc module allows to fix that, but in really clumsy ways. I was going to bash Python specs a lot more, but I'll spare your time for now :-). Regards -- Paolo Giarrusso

Hi, Your writing style is too annoying to answer properly :-( I will not try to answer all of the points you mention; instead, let's just point out (as others have done) that the picture is more complex that you suppose, because it includes a lot of aspects that you completely ignore. I know definitely that with changes to the semantics of the language, Python could be made seriously faster. PyPy is not about this; it's a collection of people who either don't care too much about performance, or believe that even with these semantics Python *can* be made seriously fast -- it's just harder, but not impossible. A bientot, Armin.

On Sat, Jan 3, 2009 at 15:12, Armin Rigo <arigo@tunes.org> wrote:
Hi,
Your writing style is too annoying to answer properly :-(
Sorry for that, I don't usually do it. I'm trying to do advocacy of my position, but sometimes I realize there's a subtle boundary with flaming; and most of that flaming is actually addressed to CPython developers.
I will not try to answer all of the points you mention; instead, let's just point out (as others have done) that the picture is more complex that you suppose, because it includes a lot of aspects that you completely ignore.
Tradeoffs are always involved, but in lots of cases, in the Python community, misaddressed concerns about simplicity of the Python VM, and ignorance of literature on Virtual Machines, lead to mistakes. And I think we mostly agree on the facts underlying this statement, even if we express that with different harshness. I don't claim to be an expert of VM development, not at all. But since there is a big disagreement between what I learned during 3 months of lessons (which was a first course on VM development) and CPython, either one is wrong; my professor suggests that the Python side is wrong and that implementers of Smalltalk, Self, and other languages have some point. And all the evidence I could find (including your project) agrees with that, on topics like: - threading in the interpreter - garbage collection vs reference counting - tagged pointers - hidden classes (i.e. V8 maps, which mostly date back to the early '90s). If anybody is interested, this is the homepage of the course, with the needed references (most of the links to papers are not directly accessible, but the papers can be downloaded from other sources): http://www.daimi.au.dk/~amoeller/VM/ == __DEL__ SEMANTICS ===
I know definitely that with changes to the semantics of the language, Python could be made seriously faster.
PyPy is not about this; it's a collection of people who either don't care too much about performance, or believe that even with these semantics Python *can* be made seriously fast -- it's just harder, but not impossible.
Just to make sure: I don't advocate changing semantics which have value for the programmer; I advocate changing semantics when they have problems anyway, like for this case, or for the int/long distinction, which created problems both for tagged integers and for implementers. Those two posts pointed out by Carl are interesting: http://morepypy.blogspot.com/2008/02/python-finalizers-semantics-part-1.html http://morepypy.blogspot.com/2008/02/python-finalizers-semantics-part-2.html I'm interested about the costs of guaranteeing destructor calls in topological order, I might reconsider my position on that particular point; however, "topological-order-unless-a-cycle-is-found" is IMHO a worse semantics than "no-topological-order", and that's what actually matters, because it complicates implementation without being easier to use correctly, because using fields requires proving that the object cannot be part of a cycle, and that requires in turn an analysis of the internals of other objects; this shouldn't be required to the programmer since data hiding mostly forbids relying on this. It would maybe be more useful to implement "reverse topological order" and to clear all pointers to children of an object, to make sure that nobody tries to rely on undefined behaviour, at least for Java, but maybe also for Python. === SUMMARY === So, the full Python semantics (i.e. __del__ is like a C++ destructor) are surely easier to use than Java/.NET finalizers, and are safe, because one cannot get NULL pointers; but they have for the huge problem of memory leaks; I'm undecided about "topological-order-unless-a-cycle-is-found"; Java finalizers are something that nobody should use; Python weakrefs still allow resurrection, which causes slowdown as documented by Boehm; Java weakrefs involve a fair amount of boilerplate code and they also require the programmer to decide where the reference queue should be polled, but at least they are quite difficult (if not impossible) to misuse. Regards -- Paolo Giarrusso

Hi Paolo, Happy new year! I am getting back to one of the points that Anto has made which you dismissed. I still think that it is the central one to explain why we don't think that threaded interpretation is the panacea that you think it is. Paolo Giarrusso wrote:
On Fri, Jan 2, 2009 at 14:23, Armin Rigo <arigo@tunes.org> wrote:
On Thu, Dec 25, 2008 at 12:42:18AM +0100, Paolo Giarrusso wrote:
If I'll want to try something without refcounting, I'll guess I'd turn to PyPy, but don't hold your breath for that. The fact that indirect threading didn't work, that you're 1.5-2x slower than CPython, and that you store locals in frame objects, they all show that the abstraction overhead of the interpret is too high.
True, but a 1.5x slowdown is not a big deal on many application; the blocker is mostly elsewhere.
Let's say 1.5x * 2x = 3x, since CPython is not as fast as it could be, because of refcounting for instance. The 2x is taken from PyVM performance reports (see http://www.python.org/dev/implementations/). And for PyPy a slower interpreter means a slower JIT output, isn't it? See below.
Also, according to CS literature, interpreter performance makes more difference than JIT. According to the paper about efficient interpreters I'm mentioning endless times, an inefficient interpreter is 1000x slower than C, while an efficient one is only 10x slower. Python is not that slow, but what you wrote about Psyco seems to imply that there is still lots of room for improvement.
I finally looked at the paper. I think that one of the basic premises of the paper does not apply to Python. This is the premise that a VM instruction itself is a simple operation. For example the paper states: "Most VM instructions are quite simple (e.g., add the top two numbers on the stack), and can be implemented in a few native instructions plus dispatch code. Complex VM instructions occur relatively rarely, and the setup for them typically takes many simple VM instructions; [...]" This is just nonsense in Python, and indeed in many other dynamic languages as well. Take a simple fibonacci function:
def fibo(n): ... if n <= 2: ... return 1 ... return fibo(n - 1) + fibo(n - 2) ... import dis dis.dis(fibo) 2 0 LOAD_FAST 0 (n) 3 LOAD_CONST 1 (2) 6 COMPARE_OP 1 (<=) 9 JUMP_IF_FALSE 8 (to 20) 12 POP_TOP
3 13 LOAD_CONST 2 (1) 16 RETURN_VALUE 17 JUMP_FORWARD 1 (to 21) >> 20 POP_TOP 4 >> 21 LOAD_GLOBAL 0 (fibo) 24 LOAD_FAST 0 (n) 27 LOAD_CONST 2 (1) 30 BINARY_SUBTRACT 31 CALL_FUNCTION 1 34 LOAD_GLOBAL 0 (fibo) 37 LOAD_FAST 0 (n) 40 LOAD_CONST 1 (2) 43 BINARY_SUBTRACT 44 CALL_FUNCTION 1 47 BINARY_ADD 48 RETURN_VALUE Of the opcodes present here, some are indeed simple (like LOAD_CONST and JUMP_FORWARD, etc). However, most of them perform potentially complex operations and a disproportionate amount of time is spent in them. E.g. COMPARE_OP, BINARY_SUBTRACT, BINARY_ADD, CALL_FUNCTION are implemented using special methods, so are essentially method calls, which need a full method lookup (which can take a lot of time, given complex class hierarchies). Then the method can be implemented in Python which leads to even more overhead (yes, one could try to speed up common cases like integers for binary operations, but those turn out to not be all that common in normal Python code). Even a seemingly harmless opcode like LOAD_GLOBAL is rather complex to implement. First you need to check the global dictionary and then potentially the builtins dictionary. I guess to be really conclusive one would have to do measurements about which bytecodes take the most time at runtime. However, there is some evidence from PyPy that what I said is right: Implementing a method cache gave PyPy much larger speedups than any of the experiments we ever did for speeding up dispatching. For these reasons I argue that the paper you keep citing vastly exaggerates the effect of bytecode dispatch for Python, and I would guess also for other dynamic languages. I am entering pure speculation here, but I could guess that the reason the Perl interpreter appears so inefficient in the paper is because it has similar characteristics. Being a bit mean, the paper seems to be a bit of a self-fulfilling prophecy: "Efficient interpreters are those that have a large number of indirect branches. Therefore, efficient interpreters will be helped by our techniques. If our techniques don't help, well, then the interpreter just wasn't efficient enough in the first place." Note that I am not saying that techniques like a threaded interpreter loop cannot lead to a faster Python interpreter. I am pretty sure that they can help. My main point is that they might not help as much as you seem to hope, and that one needs additional things to get really great performance results (and with great I mean basically anything that is more than an order of magnitude better). [snip] Cheers, Carl Friedrich

Paolo Giarrusso wrote:
the question is: is it possible for a full python interpreter to be "efficient" as you define it?
Well, my guess is "if Prolog, Scheme and so on can, why can't Python"?
a possible answer is that python is much more complex than prolog; for example, in PyPy we also have an rpython implementation of both prolog and scheme (though I don't know how much complete is the latter one). I quickly counted the number of lines for the interpreters, excluding the builtin types/functions, and we have 28188 non-empty lines for python, 5376 for prolog and 1707 for scheme. I know that the number of lines does not mean anything, but I think it's a good hint about the relative complexities of the languages. I also know that being more complex does not necessarily mean that it's impossible to write an "efficient" interpreter for it, it's an open question. Thanks for the interesting email, but unfortunately I don't have time to answer right now (xmas is coming :-)), I just drop few quick notes:
And while you don't look like that, the mention of "tracking the last line executed" seemed quite weird. And even tracking the last bytecode executed looks weird, even if it is not maybe. I'm inspecting CPython's Python/ceval.c, and the overhead for instruction dispatch looks comparable. The only real problem I'm getting right now is committing the last bytecode executed to memory. If I store it into a local, I have no problem at all, if I store it into the interpreter context, it's a store to memory, so it hurts performance a lot - I'm still wondering about the right road to go.
by "tracking the last bytecode executed" I was really referring to the equivalent of f_lasti; are you sure you can store it in a local and still implement sys.settrace()?
Ok, just done it, the speedup given by indirect threading seems to be about 18% (see also above). More proper benchmarks are needed though.
that's interesting, thanks for having tried. I wonder I should try again with indirect threading in pypy soon or later. Btw, are the sources for your project available somewhere?
And as you say in the other mail, the overhead given by dispatch is quite more than 50% (maybe).
no, it's less. 50% is the total speedup given by geninterp, which removes dispatch overhead but also other things, like storing variables on the stack and turning python level flow control into C-level flow control (so e.g. loops are expressed as C loops).
Am I correct in assuming that "geninterpret"ing _basically_ pastes the opcode handlers together? I guess with your infrastructure, you can even embed easily the opcode parameters inside the handlers, it's just a trivial partial evaluation
that's (part of) what our JIT is doing/will do. But it does much more than that, of course. Merry Christmas to you and all pypyers on the list! ciao, Anto

Hi! This time, I'm trying to answer shortly Is this the geninterp you're talking about? http://codespeak.net/pypy/dist/pypy/doc/geninterp.html Is the geninterpreted version RPython code? I'm almost sure, except for the """NOT_RPYTHON""" doc string in the geninterpreted source snippet. I guess it's there because the _source_ of it is not RPython code. On Wed, Dec 24, 2008 at 22:29, Antonio Cuni <anto.cuni@gmail.com> wrote:
Paolo Giarrusso wrote:
I quickly counted the number of lines for the interpreters, excluding the builtin types/functions, and we have 28188 non-empty lines for python, 5376 for prolog and 1707 for scheme.
I know that the number of lines does not mean anything, but I think it's a good hint about the relative complexities of the languages.
Also about the amount of Python-specific optimizations you did :-).
I also know that being more complex does not necessarily mean that it's impossible to write an "efficient" interpreter for it, it's an open question.
The 90-10 rule should apply anyway, but overhead for obscure features might be a problem. Well, reflection on the call stack can have a big runtime impact, but that's also present in Smalltalk as far as I know and that can be handled as well. Anyway, if Python developers are not able to implement efficient multithreading in the interpreter because of the excessive performance impact and they don't decide to drop refcounting, saying "there's space for optimizations" looks like a safe bet; the source of the idea is what I've been taught in the course, but I'm also noticing this by myself.
Thanks for the interesting email, but unfortunately I don't have time to answer right now (xmas is coming :-)), I just drop few quick notes:
Yeah, for me as well, plus I'm in the last month of my Erasmus study time :-)
Ok, just done it, the speedup given by indirect threading seems to be about 18% (see also above). More proper benchmarks are needed though.
that's interesting, thanks for having tried. I wonder I should try again with indirect threading in pypy soon or later.
I would do it together with OProfile benchmarking of indirect branches and of their mispredictions (see the presentation for the OProfile commands on the Core 2 processor).
Btw, are the sources for your project available somewhere?
They'll be sooner or later. There are a few bugs I should fix, and a few other interesting things to do. But if you are interested in trying to do benchmarking even if it's a student project, it's not feature complete, and it's likely buggy, I might publish it earlier.
And as you say in the other mail, the overhead given by dispatch is quite more than 50% (maybe).
no, it's less.
Yeah, sorry, I remember you wrote geninterp also does other stuff.
50% is the total speedup given by geninterp, which removes dispatch overhead but also other things, like storing variables on the stack
I wonder why that's not done by your stock interpreter - the CPython frame object has a pointer to a real stack frame; I'm not sure, but I guess this can increase stack locality since a 32/64-byte cacheline is much bigger than a typical stack frame and has space for the operand stack (and needless to say we store locals on the stack, like JVMs do). The right benchmark for this, I guess, would be oprofiling cache misses on a recursion test like factorial or Fibonacci.
and turning python level flow control into C-level flow control (so e.g. loops are expressed as C loops).
Looking at the geninterpreted code, it's amazing that the RPython translator can do this. Can it also already specialize the interpreter for each of the object spaces and save the virtual calls? == About F_LASTI ==
by "tracking the last bytecode executed" I was really referring to the equivalent of f_lasti; are you sure you can store it in a local and still implement sys.settrace()?
Not really, I didn't even start studying its proper semantics, but now I know it's worth a try and some additional complexity, at least in an interpreter with GC. If one write to memory has such a horrible impact, I'm frightened by the possible impact of refcounting; on the other side, I wouldn't be surprised if saving the f_lasti write had no impact on CPython. My current approach would be that if I can identify code paths where no code can even look at it (and I guess that most simple opcodes are such paths), I can copy f_lasti to a global structure only in the other paths; if f_lasti is just passed to the code tracing routine and it's called only from the interpreter loop, I could even turn it into a parameter to that routine (it may be faster with a register calling convention, but anyway IMHO one gets code which is easier to follow). Actually, I even wonder if I can just set it when tracing is active, but since that'd be trivial to do, I guess that when you return from a call to settrace, you discover (without being able to anticipate it) that now you need to discover the previous opcode, that's why it's not already fixed. Still, a local can do even for that, or more complicated algorithms can do as well (basically, the predecessor is always known at compile time except for jumps, so only jump opcodes really need to compute f_lasti). Regards -- Paolo Giarrusso

Interesting discussion. Just a note: in Jython, f_lasti is only used to manage exit/entry points, specifically for coroutines/generators, so this is not at the level of bytecode granularity. We also set f_lineno, at the level of Python code of course. HotSpot apparently optimizes this access nicely in any event. (There are other problems with supporting call frames, but this is not one of them it seems.) Java also offers a debugging interface, which in conjunction with a C++ agent, allows for more fine-grained access to these internals, potentially with lower overhead. This is something Tobias Ivarsson has been exploring. - Jim On Thu, Dec 25, 2008 at 9:52 PM, Paolo Giarrusso <p.giarrusso@gmail.com>wrote:
Hi! This time, I'm trying to answer shortly
Is this the geninterp you're talking about? http://codespeak.net/pypy/dist/pypy/doc/geninterp.html Is the geninterpreted version RPython code? I'm almost sure, except for the """NOT_RPYTHON""" doc string in the geninterpreted source snippet. I guess it's there because the _source_ of it is not RPython code.
On Wed, Dec 24, 2008 at 22:29, Antonio Cuni <anto.cuni@gmail.com> wrote:
Paolo Giarrusso wrote:
I quickly counted the number of lines for the interpreters, excluding the builtin types/functions, and we have 28188 non-empty lines for python, 5376 for prolog and 1707 for scheme.
I know that the number of lines does not mean anything, but I think it's a good hint about the relative complexities of the languages.
Also about the amount of Python-specific optimizations you did :-).
I also know that being more complex does not necessarily mean that it's impossible to write an "efficient" interpreter for it, it's an open question.
The 90-10 rule should apply anyway, but overhead for obscure features might be a problem. Well, reflection on the call stack can have a big runtime impact, but that's also present in Smalltalk as far as I know and that can be handled as well. Anyway, if Python developers are not able to implement efficient multithreading in the interpreter because of the excessive performance impact and they don't decide to drop refcounting, saying "there's space for optimizations" looks like a safe bet; the source of the idea is what I've been taught in the course, but I'm also noticing this by myself.
Thanks for the interesting email, but unfortunately I don't have time to answer right now (xmas is coming :-)), I just drop few quick notes:
Yeah, for me as well, plus I'm in the last month of my Erasmus study time :-)
Ok, just done it, the speedup given by indirect threading seems to be about 18% (see also above). More proper benchmarks are needed though.
that's interesting, thanks for having tried. I wonder I should try again with indirect threading in pypy soon or later.
I would do it together with OProfile benchmarking of indirect branches and of their mispredictions (see the presentation for the OProfile commands on the Core 2 processor).
Btw, are the sources for your project available somewhere?
They'll be sooner or later. There are a few bugs I should fix, and a few other interesting things to do. But if you are interested in trying to do benchmarking even if it's a student project, it's not feature complete, and it's likely buggy, I might publish it earlier.
And as you say in the other mail, the overhead given by dispatch is quite more than 50% (maybe).
no, it's less.
Yeah, sorry, I remember you wrote geninterp also does other stuff.
50% is the total speedup given by geninterp, which removes dispatch overhead but also other things, like storing variables on the stack
I wonder why that's not done by your stock interpreter - the CPython frame object has a pointer to a real stack frame; I'm not sure, but I guess this can increase stack locality since a 32/64-byte cacheline is much bigger than a typical stack frame and has space for the operand stack (and needless to say we store locals on the stack, like JVMs do).
The right benchmark for this, I guess, would be oprofiling cache misses on a recursion test like factorial or Fibonacci.
and turning python level flow control into C-level flow control (so e.g. loops are expressed as C loops).
Looking at the geninterpreted code, it's amazing that the RPython translator can do this. Can it also already specialize the interpreter for each of the object spaces and save the virtual calls?
== About F_LASTI ==
by "tracking the last bytecode executed" I was really referring to the equivalent of f_lasti; are you sure you can store it in a local and still implement sys.settrace()?
Not really, I didn't even start studying its proper semantics, but now I know it's worth a try and some additional complexity, at least in an interpreter with GC. If one write to memory has such a horrible impact, I'm frightened by the possible impact of refcounting; on the other side, I wouldn't be surprised if saving the f_lasti write had no impact on CPython.
My current approach would be that if I can identify code paths where no code can even look at it (and I guess that most simple opcodes are such paths), I can copy f_lasti to a global structure only in the other paths; if f_lasti is just passed to the code tracing routine and it's called only from the interpreter loop, I could even turn it into a parameter to that routine (it may be faster with a register calling convention, but anyway IMHO one gets code which is easier to follow).
Actually, I even wonder if I can just set it when tracing is active, but since that'd be trivial to do, I guess that when you return from a call to settrace, you discover (without being able to anticipate it) that now you need to discover the previous opcode, that's why it's not already fixed. Still, a local can do even for that, or more complicated algorithms can do as well (basically, the predecessor is always known at compile time except for jumps, so only jump opcodes really need to compute f_lasti).
Regards -- Paolo Giarrusso _______________________________________________ pypy-dev@codespeak.net http://codespeak.net/mailman/listinfo/pypy-dev
-- Jim Baker jbaker@zyasoft.com
participants (5)
-
Antonio Cuni
-
Armin Rigo
-
Carl Friedrich Bolz
-
Jim Baker
-
Paolo Giarrusso