JIT'ed function performance degrades
![](https://secure.gravatar.com/avatar/0a8e8de28b748c5db6e1fae0fb5a57ee.jpg?s=120&d=mm&r=g)
I am starting to learn how to use the JIT, and i'm confused why my function gets slower over time, twice as slow after running for a few minutes. Using a virtualizable did speed up my code, but it still has the degrading performance problem. I have yesterdays SVN and using 64bit with boehm. I understand boehm is slower, but overall my JIT'ed function is many times slower than un-jitted, is this expected behavior from boehm? code is here: http://pastebin.com/9VGJHpNa
![](https://secure.gravatar.com/avatar/a3a676c96a88feb813010e67af012ca0.jpg?s=120&d=mm&r=g)
On Thu, Aug 19, 2010 at 03:25, Hart's Antler <bhartsho@yahoo.com> wrote:
I am starting to learn how to use the JIT, and i'm confused why my function gets slower over time, twice as slow after running for a few minutes. Using a virtualizable did speed up my code, but it still has the degrading performance problem. I have yesterdays SVN and using 64bit with boehm. I understand boehm is slower, but overall my JIT'ed function is many times slower than un-jitted, is this expected behavior from boehm?
code is here: http://pastebin.com/9VGJHpNa
I think this has nothing to do with Boehm. Is it swapping? If yes, that explains the slowdown. Is memory usage growing over time? I expect yes, and it's a misbehavior which could be explained by my analysis below. Is it JITting code? I think no, or not to an advantage, but that's a more complicated guess. BTW, when debugging such things, _always_ ask and answer these questions yourself. Moreover, I'm not sure you need to use the JIT yourself. - Your code is RPython, so you could as well just translate it without JIT annotations, and it will be compiled to C code. - Otherwise, you could write that as a app-level function, i.e. in normal Python, and pass it to a translated PyPy-JIT interpreter. Did you try and benchmark the code? Can I ask you why you did not write that as a app-level function, i.e. as normal Python code, to use PyPy's JIT directly, without needing detailed understanding of the JIT? It would be interesting to see a comparison (and have it on the web, after some code review). Especially, I'm not sure that as currently written you're getting any speedup, and I seriously wonder whether the JIT could give an additional speedup over RPython here (the regexp interpreter is a completely different case, since it compiles a regexp, but why do you compile an array?). I think just raw CPython can be 340x slower than C (I assume NumPy uses C), and since your code is RPython, there must be something basic wrong. I think you have too many green variables in your code: "At runtime, for a given value of the green variables, one piece of machine code will be generated. This piece of machine code can therefore assume that the value of the green variable is constant." [1] So, every time you change the value of a green variable, the JIT will have to recompile again the function. Note that actually, I think, for each new value of the variable, first a given number of iterations have to occur (1000? 10 000? I'm not sure), then the JIT will spend time creating a trace and compiling it. The length of the involved arrays is maybe around the threshold, maybe smaller, so you get "all pain, and no gain".
From your code: complex_dft_jitdriver = JitDriver( greens = 'index length accum array'.split(), reds = 'k a b J'.split(), virtualizables = 'a'.split() #can_inline=True )
The only acceptable green variable are IMHO array and length there, because in the calling code, the other change for each invocation I think. I also think that only length should be green (and that could give a speedup), and that marking array as green gives neglibible or no speedup. Marking length as green allows specializing the function on the size of the array - something one would not do in C probably, but that one could do in C++. Whether it is worth it depends on the specific code & optimizations available - I think here the speedup should be small. Best regards [1] http://morepypy.blogspot.com/2010/06/jit-for-regular-expression-matching.htm... -- Paolo Giarrusso - Ph.D. Student http://www.informatik.uni-marburg.de/~pgiarrusso/
![](https://secure.gravatar.com/avatar/bfc96d2a02d9113edb992eb96c205c5a.jpg?s=120&d=mm&r=g)
Hi On Thu, Aug 19, 2010 at 8:48 AM, Paolo Giarrusso <p.giarrusso@gmail.com> wrote:
On Thu, Aug 19, 2010 at 03:25, Hart's Antler <bhartsho@yahoo.com> wrote:
I am starting to learn how to use the JIT, and i'm confused why my function gets slower over time, twice as slow after running for a few minutes. Using a virtualizable did speed up my code, but it still has the degrading performance problem. I have yesterdays SVN and using 64bit with boehm. I understand boehm is slower, but overall my JIT'ed function is many times slower than un-jitted, is this expected behavior from boehm?
code is here: http://pastebin.com/9VGJHpNa
I think this has nothing to do with Boehm.
I don't think as well <snip>
Moreover, I'm not sure you need to use the JIT yourself. - Your code is RPython, so you could as well just translate it without JIT annotations, and it will be compiled to C code. - Otherwise, you could write that as a app-level function, i.e. in normal Python, and pass it to a translated PyPy-JIT interpreter. Did you try and benchmark the code? Can I ask you why you did not write that as a app-level function, i.e. as normal Python code, to use PyPy's JIT directly, without needing detailed understanding of the JIT? It would be interesting to see a comparison (and have it on the web, after some code review).
JIT can essentially speed up based on constant folding based on bytecode. Bytecode should be the only green variable here and all others (that you don't want to specialize over) should be red and not promoted. In your case it's very likely you compile new loop very often (overspecialization).
Especially, I'm not sure that as currently written you're getting any speedup, and I seriously wonder whether the JIT could give an additional speedup over RPython here (the regexp interpreter is a completely different case, since it compiles a regexp, but why do you compile an array?).
That's silly, our python interpreter is an RPython program. Anything that can have a meaningfully defined "bytecode" or a "compile time constant" can be sped up by the JIT. For example a templating language.
I think just raw CPython can be 340x slower than C (I assume NumPy uses C)
You should check more and have less assumptions.
So, every time you change the value of a green variable, the JIT will have to recompile again the function. Note that actually, I think, for each new value of the variable, first a given number of iterations have to occur (1000? 10 000? I'm not sure), then the JIT will spend time creating a trace and compiling it. The length of the involved arrays is maybe around the threshold, maybe smaller, so you get "all pain, and no gain".
to be precise for each combination of green variables there has to be a 1000 (by default) iterations. If there is no such thing, you'll never compile code and simply spend time bookkeeping. Cheers, fijal
![](https://secure.gravatar.com/avatar/a3a676c96a88feb813010e67af012ca0.jpg?s=120&d=mm&r=g)
Hi Maciej, I think you totally misunderstood me, possibly because I was not clear, see below. In short, I was wondering whether the approach of the original code made any sense, and my guess was "mostly not", exactly because there is little constant folding possible in the code, as it is written. [Hart, I don't think that any O(N^2) implementation of DFT (what is in the code), i.e. two nested for loops, should be written to explicitly take advantage of the JIT. I don't know about the FFT algorithm, but a few vague ideas say "yes", because constant folding the length could _maybe_ allow constant folding the permutations applied to data in the Cooley–Tukey FFT algorithm.] On Thu, Aug 19, 2010 at 12:11, Maciej Fijalkowski <fijall@gmail.com> wrote:
Hi <snip>
Moreover, I'm not sure you need to use the JIT yourself. - Your code is RPython, so you could as well just translate it without JIT annotations, and it will be compiled to C code. - Otherwise, you could write that as a app-level function, i.e. in normal Python, and pass it to a translated PyPy-JIT interpreter. Did you try and benchmark the code? Can I ask you why you did not write that as a app-level function, i.e. as normal Python code, to use PyPy's JIT directly, without needing detailed understanding of the JIT? It would be interesting to see a comparison (and have it on the web, after some code review).
JIT can essentially speed up based on constant folding based on bytecode. Bytecode should be the only green variable here and all others (that you don't want to specialize over) should be red and not promoted. In your case it's very likely you compile new loop very often (overspecialization).
I see no bytecode in the example - it's a DFT implementation. For each combination of green variables, there are 1024 iterations, and there are 1024 such combinations, so overspecialization is almost guaranteed. My next question, inspired from the specific code, is: is JITted code ever thrown away, if too much is generated? Even for valid use cases, most JITs can generate too much code, and they need then to choose what to keep and what to throw away.
Especially, I'm not sure that as currently written you're getting any speedup, and I seriously wonder whether the JIT could give an additional speedup over RPython here (the regexp interpreter is a completely different case, since it compiles a regexp, but why do you compile an array?).
That's silly, our python interpreter is an RPython program. Anything that can have a meaningfully defined "bytecode" or a "compile time constant" can be sped up by the JIT. For example a templating language.
You misunderstood me, I totally agree with you, and my understanding is that in the given program (which I read almost fully) constant folding makes little sense. Since that program is written with RPython + JIT, but it has green variables which are not at all "compile time constants", "I wonder seriously" was meant as "I wonder seriously whether what you are trying makes any sense". As I argued, the only constant folding possible is for the array length. And again, I wonder whether it's worth it, my guess tends towards "no", but a benchmark is needed (there will be some improvement probably). I was just a bit vaguer because I just studied docs on PyPy (and papers about tracing compilation). But your answer confirms that my original analysis is correct, and that I should write more clearly maybe.
I think just raw CPython can be 340x slower than C (I assume NumPy uses C)
You should check more and have less assumptions.
I did some checks, on PyPy's blog actually, not definitive though, and I stand by what I meant (see below). Without reading the pastie in full, however, my comments are out of context. I guess your tone is fine, since you thought I wrote nonsense. But in general, I have yet to see a guideline forbidding "IIRC" and similar ways of discussing (the above was an _educated_ guess), especially when the writer remembers correctly (as in this case). Having said that, I'm always happy to see counterexamples and learn something, if they exist. In this case, for what I actually meant (and wrote, IMHO), a counterexample would be a RPython or a JITted program
= 340x slower than C.
For the speed ratio, the code pastie writes that RPython JITted code is 340x slower than NumPy code, and I was writing that it's unreasonable; in this case, it happens because of overspecialization caused by misuse of the JIT. For speed ratios among CPython, C, RPython, I was comparing to http://morepypy.blogspot.com/2010/06/jit-for-regular-expression-matching.htm.... What I meant is that JITted code can't be so much slower than C. For NumPy, I had read this: http://morepypy.blogspot.com/2009/07/pypy-numeric-experiments.html, and it mostly implies that NumPy is written in C (it actually says "NumPy's C version", but I missed it). And for the specific discussed microbenchmark, the performance gap between NumPy and CPython is ~100x. Best regards -- Paolo Giarrusso - Ph.D. Student http://www.informatik.uni-marburg.de/~pgiarrusso/
![](https://secure.gravatar.com/avatar/bfc96d2a02d9113edb992eb96c205c5a.jpg?s=120&d=mm&r=g)
On Thu, Aug 19, 2010 at 1:34 PM, Paolo Giarrusso <p.giarrusso@gmail.com> wrote:
Hi Maciej, I think you totally misunderstood me, possibly because I was not clear, see below. In short, I was wondering whether the approach of the original code made any sense, and my guess was "mostly not", exactly because there is little constant folding possible in the code, as it is written.
That's always possible :)
[Hart, I don't think that any O(N^2) implementation of DFT (what is in the code), i.e. two nested for loops, should be written to explicitly take advantage of the JIT. I don't know about the FFT algorithm, but a few vague ideas say "yes", because constant folding the length could _maybe_ allow constant folding the permutations applied to data in the Cooley–Tukey FFT algorithm.]
On Thu, Aug 19, 2010 at 12:11, Maciej Fijalkowski <fijall@gmail.com> wrote:
Hi <snip>
Moreover, I'm not sure you need to use the JIT yourself. - Your code is RPython, so you could as well just translate it without JIT annotations, and it will be compiled to C code. - Otherwise, you could write that as a app-level function, i.e. in normal Python, and pass it to a translated PyPy-JIT interpreter. Did you try and benchmark the code? Can I ask you why you did not write that as a app-level function, i.e. as normal Python code, to use PyPy's JIT directly, without needing detailed understanding of the JIT? It would be interesting to see a comparison (and have it on the web, after some code review).
JIT can essentially speed up based on constant folding based on bytecode. Bytecode should be the only green variable here and all others (that you don't want to specialize over) should be red and not promoted. In your case it's very likely you compile new loop very often (overspecialization).
I see no bytecode in the example - it's a DFT implementation. For each combination of green variables, there are 1024 iterations, and there are 1024 such combinations, so overspecialization is almost guaranteed.
Agreed.
My next question, inspired from the specific code, is: is JITted code ever thrown away, if too much is generated? Even for valid use cases, most JITs can generate too much code, and they need then to choose what to keep and what to throw away.
No, as of now, never. In general in case of Python it would have to be a heuristic anyway (since code objects are mostly immortal and you can't decide whether certain combination of assumptions will occur in the future or not). We have some ideas which code will never run any more and besides that, we need to implement some heuristics when to throw away code.
Especially, I'm not sure that as currently written you're getting any speedup, and I seriously wonder whether the JIT could give an additional speedup over RPython here (the regexp interpreter is a completely different case, since it compiles a regexp, but why do you compile an array?).
That's silly, our python interpreter is an RPython program. Anything that can have a meaningfully defined "bytecode" or a "compile time constant" can be sped up by the JIT. For example a templating language.
You misunderstood me, I totally agree with you, and my understanding is that in the given program (which I read almost fully) constant folding makes little sense.
Great :) I might have misunderstood you.
Since that program is written with RPython + JIT, but it has green variables which are not at all "compile time constants", "I wonder seriously" was meant as "I wonder seriously whether what you are trying makes any sense". As I argued, the only constant folding possible is for the array length. And again, I wonder whether it's worth it, my guess tends towards "no", but a benchmark is needed (there will be some improvement probably).
I guess the answer is "hell no", simply because if you don't constant fold our assembler would not be nearly as good as gcc's one (if nothing else).
I was just a bit vaguer because I just studied docs on PyPy (and papers about tracing compilation). But your answer confirms that my original analysis is correct, and that I should write more clearly maybe.
I think just raw CPython can be 340x slower than C (I assume NumPy uses C)
You should check more and have less assumptions.
I did some checks, on PyPy's blog actually, not definitive though, and I stand by what I meant (see below). Without reading the pastie in full, however, my comments are out of context. I guess your tone is fine, since you thought I wrote nonsense. But in general, I have yet to see a guideline forbidding "IIRC" and similar ways of discussing (the above was an _educated_ guess), especially when the writer remembers correctly (as in this case). Having said that, I'm always happy to see counterexamples and learn something, if they exist. In this case, for what I actually meant (and wrote, IMHO), a counterexample would be a RPython or a JITted program
= 340x slower than C.
My comment was merely about "numpy is written in C".
For the speed ratio, the code pastie writes that RPython JITted code is 340x slower than NumPy code, and I was writing that it's unreasonable; in this case, it happens because of overspecialization caused by misuse of the JIT.
Yes.
For speed ratios among CPython, C, RPython, I was comparing to http://morepypy.blogspot.com/2010/06/jit-for-regular-expression-matching.htm.... What I meant is that JITted code can't be so much slower than C.
For NumPy, I had read this: http://morepypy.blogspot.com/2009/07/pypy-numeric-experiments.html, and it mostly implies that NumPy is written in C (it actually says "NumPy's C version", but I missed it). And for the specific discussed microbenchmark, the performance gap between NumPy and CPython is ~100x.
Yes, there is a slight difference :-) numpy is written mostly in C (at least glue code), but a lot of algorithms call back to some other stuff (depending what you have installed) which as far as I'm concerned might be whatever (most likely fortran or SSE assembler at some level.)
Best regards -- Paolo Giarrusso - Ph.D. Student http://www.informatik.uni-marburg.de/~pgiarrusso/
![](https://secure.gravatar.com/avatar/0a8e8de28b748c5db6e1fae0fb5a57ee.jpg?s=120&d=mm&r=g)
Hi Paolo, thanks for your in-depth response, i tried your suggestions and noticed a big speed improvement with no more degrading performance, i didn't realize having more green is bad. However it still runs 4x slower than just plain old compiled RPython, i checked if the JIT was really running, and your right its not actually using any JIT'ed code, it only traces and then aborts, though now i can not figure out why it aborts after trying several things. I didn't write this as an app-level function because i wanted to understand how the JIT works on a deeper level and with RPython. I had seen the blog post before by Carl Friedrich Bolz about JIT'ing and that he was able to speed things up 22x faster than plain RPython translated to C, so that got me curious about the JIT. Now i understand that that was an exceptional case, but what other cases might RPython+JIT be useful? And its good to see here what if any speed up there will be in the worst case senairo. Sorry about all the confusion about numpy being 340x faster, i should have added in that note that i compared numpy fast fourier transform to Rpython direct fourier transform, and direct is known to be hundreds of times slower. (numpy lacks a DFT to compare to) updated code with only the length as green: http://pastebin.com/DnJikXze The jitted function now checks jit.we_are_jitted(), and prints 'unjitted' if there is no jitting. abort: trace too long seems to happen every trace, so we_are_jitted() is never true, and the 4x overhead compared to compiled RPython is then understandable. trace_limit is set to its maximum, so why is it aborting? Here is my settings: jitdriver.set_param('threshold', 4) jitdriver.set_param('trace_eagerness', 4) jitdriver.set_param('trace_limit', sys.maxint) jitdriver.set_param('debug', 3) Tracing: 80 1.019871 Backend: 0 0.000000 Running asm: 0 Blackhole: 80 TOTAL: 16.785704 ops: 1456160 recorded ops: 1200000 calls: 99080 guards: 430120 opt ops: 0 opt guards: 0 forcings: 0 abort: trace too long: 80 abort: compiling: 0 abort: vable escape: 0 nvirtuals: 0 nvholes: 0 nvreused: 0 --- On Thu, 8/19/10, Maciej Fijalkowski <fijall@gmail.com> wrote:
Hi Maciej, I think you totally misunderstood me, possibly because I was not clear, see below. In short, I was wondering whether
the original code made any sense, and my guess was "mostly not", exactly because there is little constant folding
From: Maciej Fijalkowski <fijall@gmail.com> Subject: Re: [pypy-dev] JIT'ed function performance degrades To: "Paolo Giarrusso" <p.giarrusso@gmail.com> Cc: "Hart's Antler" <bhartsho@yahoo.com>, pypy-dev@codespeak.net Date: Thursday, 19 August, 2010, 4:55 AM On Thu, Aug 19, 2010 at 1:34 PM, Paolo Giarrusso <p.giarrusso@gmail.com> wrote: the approach of possible in the code,
as it is written.
That's always possible :)
[Hart, I don't think that any O(N^2) implementation of
the code), i.e. two nested for loops, should be written to explicitly take advantage of the JIT. I don't know about the FFT algorithm, but a few vague ideas say "yes", because constant folding
_maybe_ allow constant folding the permutations applied to data in the Cooley–Tukey FFT algorithm.]
On Thu, Aug 19, 2010 at 12:11, Maciej Fijalkowski <fijall@gmail.com> wrote:
Hi <snip>
Moreover, I'm not sure you need to use the JIT yourself. - Your code is RPython, so you could as well just translate it without JIT annotations, and it will be compiled to C code. - Otherwise, you could write that as a app-level function, i.e. in normal Python, and pass it to a translated PyPy-JIT interpreter. Did you try and benchmark the code? Can I ask you why you did not write that as a app-level function, i.e. as normal Python code, to use PyPy's JIT
DFT (what is in the length could directly, without needing
detailed understanding of the JIT? It would be interesting to see a comparison (and have it on the web, after some code review).
JIT can essentially speed up based on constant folding based on bytecode. Bytecode should be the only green variable here and all others (that you don't want to specialize over) should be red and not promoted. In your case it's very likely you compile new loop very often (overspecialization).
I see no bytecode in the example - it's a DFT implementation. For each combination of green variables, there are 1024 iterations, and there are 1024 such combinations, so overspecialization is almost guaranteed.
Agreed.
My next question, inspired from the specific code, is:
ever thrown away, if too much is generated? Even for valid use cases, most JITs can generate too much code, and they need
is JITted code then to choose
what to keep and what to throw away.
No, as of now, never. In general in case of Python it would have to be a heuristic anyway (since code objects are mostly immortal and you can't decide whether certain combination of assumptions will occur in the future or not). We have some ideas which code will never run any more and besides that, we need to implement some heuristics when to throw away code.
Especially, I'm not sure that as currently
speedup, and I seriously wonder whether the JIT could give an additional speedup over RPython here (the regexp interpreter is a completely different case, since it compiles a regexp, but why do you compile an array?).
That's silly, our python interpreter is an RPython
written you're getting any program. Anything
that can have a meaningfully defined "bytecode" or a "compile time constant" can be sped up by the JIT. For example a templating language.
You misunderstood me, I totally agree with you, and my understanding is that in the given program (which I read almost fully) constant folding makes little sense.
Great :) I might have misunderstood you.
Since that program is written with RPython + JIT, but it has green variables which are not at all "compile time constants", "I wonder seriously" was meant as "I wonder seriously whether what you are trying makes any sense". As I argued, the only constant folding possible is for the array length. And again, I wonder whether it's worth it, my guess tends towards "no", but a benchmark is needed (there will be some improvement probably).
I guess the answer is "hell no", simply because if you don't constant fold our assembler would not be nearly as good as gcc's one (if nothing else).
I was just a bit vaguer because I just studied docs on
papers about tracing compilation). But your answer confirms that my original analysis is correct, and that I should write more clearly maybe.
I think just raw CPython can be 340x slower
uses C)
You should check more and have less assumptions.
I did some checks, on PyPy's blog actually, not definitive though, and I stand by what I meant (see below). Without reading
PyPy (and than C (I assume NumPy the pastie in
full, however, my comments are out of context. I guess your tone is fine, since you thought I wrote nonsense. But in general, I have yet to see a guideline forbidding "IIRC" and similar ways of discussing (the above was an _educated_ guess), especially when the writer remembers correctly (as in this case). Having said that, I'm always happy to see counterexamples and learn something, if they exist. In this case, for what I actually meant (and wrote, IMHO), a counterexample would be a RPython or a JITted program
= 340x slower than C.
My comment was merely about "numpy is written in C".
For the speed ratio, the code pastie writes that
RPython JITted code
is 340x slower than NumPy code, and I was writing that it's unreasonable; in this case, it happens because of overspecialization caused by misuse of the JIT.
Yes.
For speed ratios among CPython, C, RPython, I was
comparing to
http://morepypy.blogspot.com/2010/06/jit-for-regular-expression-matching.htm.... What I meant is that JITted code can't be so much slower than C.
For NumPy, I had read this: http://morepypy.blogspot.com/2009/07/pypy-numeric-experiments.html, and it mostly implies that NumPy is written in C (it actually says "NumPy's C version", but I missed it). And for the specific discussed microbenchmark, the performance gap between NumPy and CPython is ~100x.
Yes, there is a slight difference :-) numpy is written mostly in C (at least glue code), but a lot of algorithms call back to some other stuff (depending what you have installed) which as far as I'm concerned might be whatever (most likely fortran or SSE assembler at some level.)
Best regards -- Paolo Giarrusso - Ph.D. Student http://www.informatik.uni-marburg.de/~pgiarrusso/
![](https://secure.gravatar.com/avatar/79679073d76b29e22a54de31f845ea6d.jpg?s=120&d=mm&r=g)
On 20 August 2010 16:04, Hart's Antler <bhartsho@yahoo.com> wrote:
Hi Paolo,
thanks for your in-depth response, i tried your suggestions and noticed a big speed improvement with no more degrading performance, i didn't realize having more green is bad. However it still runs 4x slower than just plain old compiled RPython, i checked if the JIT was really running, and your right its not actually using any JIT'ed code, it only traces and then aborts, though now i can not figure out why it aborts after trying several things.
I didn't write this as an app-level function because i wanted to understand how the JIT works on a deeper level and with RPython. I had seen the blog post before by Carl Friedrich Bolz about JIT'ing and that he was able to speed things up 22x faster than plain RPython translated to C, so that got me curious about the JIT. Now i understand that that was an exceptional case, but what other cases might RPython+JIT be useful? And its good to see here what if any speed up there will be in the worst case senairo.
Sorry about all the confusion about numpy being 340x faster, i should have added in that note that i compared numpy fast fourier transform to Rpython direct fourier transform, and direct is known to be hundreds of times slower. (numpy lacks a DFT to compare to)
updated code with only the length as green: http://pastebin.com/DnJikXze
The jitted function now checks jit.we_are_jitted(), and prints 'unjitted' if there is no jitting. abort: trace too long seems to happen every trace, so we_are_jitted() is never true, and the 4x overhead compared to compiled RPython is then understandable.
trace_limit is set to its maximum, so why is it aborting? Here is my settings: jitdriver.set_param('threshold', 4) jitdriver.set_param('trace_eagerness', 4) jitdriver.set_param('trace_limit', sys.maxint) jitdriver.set_param('debug', 3)
Tracing: 80 1.019871 Backend: 0 0.000000 Running asm: 0 Blackhole: 80 TOTAL: 16.785704 ops: 1456160 recorded ops: 1200000 calls: 99080 guards: 430120 opt ops: 0 opt guards: 0 forcings: 0 abort: trace too long: 80 abort: compiling: 0 abort: vable escape: 0 nvirtuals: 0 nvholes: 0 nvreused: 0
This application probably isn't a very good use for the jit because it has very little control flow. It may unroll the loop, but you're probably not gaining anything there. As long as the methods get inlined (as there is no polymorphic dispatch here that I can see), jit can't improve on this much. What optimisations do you expect it to make? -- William Leslie
![](https://secure.gravatar.com/avatar/5b37e6b4ac97453e4ba9dba37954cf79.jpg?s=120&d=mm&r=g)
Hi Hart, On Thu, Aug 19, 2010 at 11:04:48PM -0700, Hart's Antler wrote:
I had seen the blog post before by Carl Friedrich Bolz about JIT'ing and that he was able to speed things up 22x faster than plain RPython translated to C, so that got me curious about the JIT.
You cannot expect any program to get 22x faster with RPython+JIT than it is with just RPython. That would be like saying that any C program can get 22x faster if we apply some special JIT on it. For a general C program, such a statement makes no sense -- no JIT can help. The PyPy JIT can help *only* if the RPython program in question is some kind of interpreter, with a loose definition of interpreter. That's why we can apply the PyPy JIT to the Python interpreter written in RPython; or to some other examples, like Carl Friedrich's blog post about the regular expressions "interpreter". A bientot, Armin.
![](https://secure.gravatar.com/avatar/5b37e6b4ac97453e4ba9dba37954cf79.jpg?s=120&d=mm&r=g)
Hi Hart, On Wed, Aug 18, 2010 at 06:25:02PM -0700, Hart's Antler wrote:
I am starting to learn how to use the JIT, and i'm confused why my function gets slower over time, twice as slow after running for a few minutes. Using a virtualizable did speed up my code, but it still has the degrading performance problem. I have yesterdays SVN and using 64bit with boehm. I understand boehm is slower, but overall my JIT'ed function is many times slower than un-jitted, is this expected behavior from boehm?
It seems that there are still issues with the 64-bit JIT -- it could be something along the line of "the guards are not correctly overwritten", or likely something more subtle along these lines, causing more and more assembler to be produced. We have observed "infinite"-looking memory usage for long-running programs, too. Note that in the example you posted, you are doing the common mistake of putting some code (the looping condition) between can_enter_jit and jit_merge_point. We should really do something about checking that people don't do that. It mostly works, except in some cases where it doesn't :-( The issue is more precisely: while x < y: my_jit_driver.jit_merge_point(...) ...loop body... my_jit_driver.can_enter_jit(...) In this case, the "x < y" is evaluated between can_enter_jit and jit_merge_point, and that's the mistake. You should rewrite your examples as: while x < y: my_jit_driver.can_enter_jit(...) my_jit_driver.jit_merge_point(...) ...loop body... A bientot, Armin.
participants (5)
-
Armin Rigo
-
Hart's Antler
-
Maciej Fijalkowski
-
Paolo Giarrusso
-
William Leslie