On Sun, Dec 21, 2008 at 23:49, Antonio Cuni <anto.cuni@gmail.com> wrote:
Paolo Giarrusso wrote:
Thanks for the interesting mail. It really goes into detail about the issues I raised, and it does suggest what I needed: stuff needing support inside the interpreter fast path, if just for unlikely branches. I started the required coding, but it will take a while (also because I have little time, given Christmas). But I'm answering with some preliminary results. First, I'd answer this question:
That's your problem - threading helps when you spend most of the time on dispatch, and efficient interpreters get to that point.
the question is: is it possible for a full python interpreter to be "efficient" as you define it?
Well, my guess is "if Prolog, Scheme and so on can, why can't Python"? I know this is an interesting question, and I and my colleague, during the past two months, tried to get an answer. And it wasn't really two months for two people, I would say that much less time was actually spent. So for instance we don't have support for modules, exception handling, and the basic support for object (through Self/V8 style maps) is not yet complete (but almost). Then, for a couple of points, it isn't possible. Unboxed integers work better with the Python 3.x specs - if you have them and you still want a 32-bit integer, which is boxed, to be an int instead of a long, you get some problems. Also, the semantics of Python destructors are too restrictive for a proper GC, and they can create unreclaimable reference cycles (when all objects have destructors). In fact, Java, unlike C++, has instead finalizers - the difference is that you can finalize an object after finalizing the objects it points to. That's why I prefer not to say "Python finalizers" like the official docs. However, for both cases I feel that the semantics are wrong. Even finalizers are a bad idea - it's interesting that .NET is even more misguided than Java about them (and that's the opinion of Hans Boehm, not mine). So, those semantics fit well only reference counting - and reference counting, in turn, makes support for multithreading impossible (the patches for "free threading", without the global interpreter lock, gave a 2x slowdown I think). Somebody on the python-dev ML realized that CPython simply gets both things wrong (reference counting and threading support), but most people went to say "oh, but people shouldn't use threading anyway, it's too complicated to get right, they should use multiprocess stuff..." . That was one of the most (negatively) astonishing things I saw. Oh, no, the worst was something like "oh, we should allow using more interpreters in the same process, but we should move lots of statics to the interpreter context, and it's a lot of work so it's not worth it". Since I've participated to a community where bigger patches are committed every week, this looks like plain laziness; it could be because much less people are paid to work on Python, or maybe because the general community mood is that one. In fact, one of the reason I started this project _at all_ so early is that the Python community clearly shows _ignorance_ about years of VM research and little attention about VM performance. == Benchmarks about dispatch overhead == And while you don't look like that, the mention of "tracking the last line executed" seemed quite weird. And even tracking the last bytecode executed looks weird, even if it is not maybe. I'm inspecting CPython's Python/ceval.c, and the overhead for instruction dispatch looks comparable. The only real problem I'm getting right now is committing the last bytecode executed to memory. If I store it into a local, I have no problem at all, if I store it into the interpreter context, it's a store to memory, so it hurts performance a lot - I'm still wondering about the right road to go. Runtime of the while benchmark, with 10x more iterations, user time: baseline (now, with tracking of f_lasti inside a local): ~4.6s add ->f_lasti tracking _into memory_: ~5.6s disable indirect threading: ~5.6s both slowdown together: ~7s Python 2.5.1: ~11.2s Other benchmarks use fatter opcodes, so the impact is still visible, but is lower relative to the overall time. Even if f_lasti must be updated at all the time, could it be stored in memory just in some occasions? Is it needed during the call to settrace(), or only to the next opcode after that, i.e. after the return of the interpreter loop? The second case would be enough. Storing the variable into memory just for that would be simply excellent for performance, maybe even in the stock Python interpreter. Note: all this is on a 64bit box, with 16 registers; we're also faster on a 32-bit one, and in some cases the difference wrt. Python is bigger on 32-bit (on the GC stress-test, for instance, due to smaller pointers to update, and on the recursion test, maybe for the same reason when using the stack, since the stack element size is 64bit on 64bit machines), while on the others the extra registers (I guess) give advantage wrt. Python to the 64bit interpreter (obviously, the comparison is always with Python on the same machine and architecture). See slide 8 of pres.pdf.
That's interesting but says little about the benefit of threaded interpretation itself, as the speedup could be given by the other optimizations. For example, I suspect that for the benchmark you showed most of the speedup is because of tagged pointers and the better GC.
Is it possible to make you eval loop non-threaded? Measuring the difference with and without indirect threading could give a good hint of how much you gain with it.
Ok, just done it, the speedup given by indirect threading seems to be about 18% (see also above). More proper benchmarks are needed though. And as you say in the other mail, the overhead given by dispatch is quite more than 50% (maybe). Am I correct in assuming that "geninterpret"ing _basically_ pastes the opcode handlers together? I guess with your infrastructure, you can even embed easily the opcode parameters inside the handlers, it's just a trivial partial evaluation - I tried to apply code copying of machine code to my interpreter, but I would have had to keep parameter fetching separate (getting the output I needed from GCC was not easy - I could make code copying work just for an early prototype). About keeping variables into the stack instead that in the frame, that's even stranger to me, given this argument.
What kind of bytecode you use? The same as CPython or a custom one? E.g. I found that if we want to handle properly the EXTENDED_ARG CPython opcode it is necessary to insert a lot of code before jumping to the next opcode.
We handle a subset of the Python's opcodes, but what you say means that we'll have to handle EXTENDED_ARG properly, because of our mission. Oh, done, I would say (tests needed though). I think that "lot of code" can only refer to the requirement of tracking the last bytecode executed, but I'll have to look at your and CPython sources for that. That seems to have an impact of a few percent points (like 4-5% slower than before), but I have to do real benchmarking for these stuff. I fear I'll have to calculate confidence interval to give any meaningful number (I never needed that when comparing us with CPython), because the impact seems under statistical noise. However, just having the code laid out the wrong way can make it much worse. As soon as I wrote it, it almost doubled the runtime of the arithmetic benchmark (while it had a much lower impact on the other). It seems it was because GCC decided that HAS_ARG(opcode) had become unlikely, adding "likely" fixed it back. In fact, one of the reason we are so fast is that we started with a small interpreter and tracked down every performance regression as soon as it happened. We lost half of our performance, at some point, just because the address of the program_counter local variable escaped, so that var had to be allocated on the stack and not inside a register. == Opcode set == We really wanted to have our own one opcode set, but we didn't manage. In particular, I was refreshed when I read about your LOAD_METHOD optimization. Also, invoking __init__ is a real mess, Java handles it much more nicely, and I don't see the point of not using JVM-style 2-step class construction: new Class dup <-- <arguments> invokespecial Class.<init> Existence of bound methods prevents simply pushing the receiver as the first argument on the stack; however, I can't see a similar argument for class construction, even if __init__ can be a bound method. Generating similar bytecode would avoid the need to rotate the stack. In Python, one would get something like: ALLOC_INSTANCE Class // <--- new opcode DUP_TOP //Copy the value to return LOAD_METHOD __init__ <arguments> INVOKE_METHOD POP_TOP //Remove the None returned by the constructor However, the dispatch cost of the additional DUP_TOP and POP_TOP might be a problem. I guess such bytecode will be for sure more efficient to JIT-compile, but for actual interpretation, benchmarks would be needed.
Moreover, tagging pointer with 1 helps a lot for numerical benchmarks, but it is possible that causes a slowdown for other kind of operations. Do you have non-numerical benchmarks? (though I know that it's hard to get fair comparison, because the Python object model is complex and it's not easy to write a subset of it in a way that is not cheating)
I agree with that. The choice was done following the one done by V8 people, and was already tested in the Self system, but I'll answer you with those tests. We have a couple of recursion test, with a recursive function and a recursive method (but this one is unfair). And a test with list access - we are ~50% faster even there. For the Python object model, we have the opposite problem, because it's much more code to write and optimize: our INPLACE_ADD is equal to BINARY_ADD, so given a list l, a loop with "l += [1]" is able to kill us completely (since for now we create a list each time). Indeed, "l = l + [1]" is just as slow with our interp, but gets much slower in Python. Even here we are still faster, but most of the time is spent adding lists, so we are just 15% faster. I guess we'd get more advantage if INPLACE_ADD was implemented, and that's not a big deal.
Finally, as Carl said, it would be nice to know which kind of subset it is.
I just need to find the time to write it down, sorry for the wait. I'm attaching our final presentation, the report we wrote for the course, and the _benchmarks_ we have. But basically, the subset we implemented is _much_ smaller.
E.g. does it support exceptions, sys.settrace() and sys._getframe()?
Exception handling is not supported, but stack unwind should be possible without impact on the fastpath. We didn't know about the other two ones. I saw code which I guess is for handling settrace in CPython, but that's on the slowpath, I think I should be able to easily simulate the impact of that by adding one unlikely conditional jump. Is it required to track the last bytecode even when tracing is not active, as asked above? About _getframe(), I'm quite curious about how it works and how your support for it works. From the next mail you sent, it seems that you construct the needed frames during the function calls (and I saw some documentation about them). Well, in the Smalltalk age, developers found that constructing the frame object only when reflective access to the frame is needed is the only reasonable solution, if performance is important. _If_ the frame object returned is modifiable, I still think one can intercept field modifications and modify the original stack to track the modified object. In fact, to get the current performance, we don't allocate a locals dictionary unless needed; and currently it seems that doing it only when STORE_NAME is invoked gives the correct semantics (and if that were wrong, computing a NEEDS_LOCAL_DICT flag during compilation would be trivial). Otherwise, the recursion test would be slower than the one in Python. Also, we didn't bother with Unicode, and that's complicated to handle - even Unicode handling in V8 has problems, and string operations are slower than in TraceMonkey.
is your subset large enough to handle e.g. pystone? What is the result?
We didn't have the time to even download pystone.
1) in Python a lot of opcodes are quite complex and time-consuming,
That's wrong for a number of reasons - the most common opcodes are probably loads from the constant pool, and loads and stores to/from the locals (through LOAD/STORE_FAST). Right now, our hotspot is indeed the dispatch after the LOAD_FAST opcode.
if you do benchmarks as the one showed above, I agree with you. If you consider real world applications, unfortunately there is more than LOAD_CONST and LOAD_FAST: GETATTR, SETATTR, CALL, etc. are all much more time consuming than LOAD_{FAST,CONST}
Yep, but well, if you want to implement a low-level algorithm in Python, at some point you'll write such code in an inner loop. But yes, we've been working also on attribute fetch - we've been implementing V8 maps just for that (we worked on that for a couple of days, and I'm fixing the remaining bugs - but it's not really a lot of code). For GET/SET_ATTR, I will need to do "inline caching" of lookup results (even if the term doesn't really apply to an interpreter). Making it faster than dictionary lookup will be hard however, since our strings cache their hash code.
2) due to Python's semantics, it's not possible to just jump from one opcode to the next, as we need to do a lot of bookkeeping, like remembering what was the last line executed, etc.
No, you don't need that, and not even CPython does it. For exception handling, just _when an exception is thrown_,
[cut]
Sorry, I made a typo: it is needed to remember the last *bytecode* executed, not the last line. This is necessary to implement properly sys.settrace().
I never mentioned exception handling, that was your (wrong :-)) guess.
Indeed, and I was quite surprised to see something like that. I'm quite happier now, but see above for comment on its impact.
If the interpreter loop is able to overflow the Icache, that should be fought through __builtin_expect first, to give hint for jump prediction and lay out slowpaths out-of-line.
I think that Armin tried once to use __builtin_expect, but I don't remember the outcome. Armin, what was it?
On suitable benchmarks, for me it's easy to see that the wrong expectation can give horrible results (not always, but often). And sometimes GCC does get the wrong result. But once, on a short opcode, fixing code laid out the wrong way didn't make any important difference. I guess that to execute one simple opcode one maybe doesn't manage to fill a 20-stage pipeline before of the next dispatch, which will flush it. So, if you have 10 or 20 instructions, it doesn't matter that much - 20 cycles are needed anyway. In that case, fixing the layout made the same instructions execute faster probably, but they had to wait more time. Also, I do guess that dynamic branch prediction could fix the wrong code layout (not perfectly though - a predicted taken branch costs more than one not taken). Note: to give you actual numbers, BINARY_ADD uses 16 assembler instructions, and then 11 are needed for dispatching, plus other 7 if the next opcode has a parameter; the slow paths use some more instructions, in case of EXTENDED_ARG (but they are laid out-of-line). DUP_TOP is three assembly instructions, POP_TOP is one, and the stack pointer is kept in RBX. Note 2: by suitable benchmarks, I don't mean unfair ones. It would be easy (and unfair) to give hints for the exact code you have in your benchmark.
Well, my plan is first to try, at some point, to implant threading into the Python interpreter and benchmark the difference - it shouldn't take long but it has a low priority currently.
That would be cool, tell us when you have done :-).
For sure I will let you know, just don't hold your breath on that :-). Regards -- Paolo Giarrusso