Re: [pypy-dev] JIT'ed function performance degrades

Aug. 20, 2010

      Hi Paolo,

thanks for your in-depth response, i tried your suggestions and noticed a big speed improvement with no more degrading performance, i didn't realize having more green is bad.  However it still runs 4x slower than just plain old compiled RPython, i checked if the JIT was really running, and your right its not actually using any JIT'ed code, it only traces and then aborts, though now i can not figure out why it aborts after trying several things.

I didn't write this as an app-level function because i wanted to understand how the JIT works on a deeper level and with RPython.  I had seen the blog post before by Carl Friedrich Bolz about JIT'ing and that he was able to speed things up 22x faster than plain RPython translated to C, so that got me curious about the JIT.  Now i understand that that was an exceptional case, but what other cases might RPython+JIT be useful?  And its good to see here what if any speed up there will be in the worst case senairo.

Sorry about all the confusion about numpy being 340x faster, i should have added in that note that i compared numpy fast fourier transform to Rpython direct fourier transform, and direct is known to be hundreds of times slower.  (numpy lacks a DFT to compare to)

updated code with only the length as green: http://pastebin.com/DnJikXze

The jitted function now checks jit.we_are_jitted(), and prints 'unjitted' if there is no jitting.
abort: trace too long seems to happen every trace, so we_are_jitted() is never true, and the 4x overhead compared to compiled RPython is then understandable.

trace_limit is set to its maximum, so why is it aborting?  Here is my settings:
	jitdriver.set_param('threshold', 4)
	jitdriver.set_param('trace_eagerness', 4)
	jitdriver.set_param('trace_limit', sys.maxint)
	jitdriver.set_param('debug', 3)

Tracing:      	80	1.019871
Backend:      	0	0.000000
Running asm:     	0
Blackhole:       	80
TOTAL:      		16.785704
ops:             	1456160
recorded ops:    	1200000
  calls:         	99080
guards:          	430120
opt ops:         	0
opt guards:      	0
forcings:        	0
abort: trace too long:	80
abort: compiling:	0
abort: vable escape:	0
nvirtuals:       	0
nvholes:         	0
nvreused:        	0

--- On Thu, 8/19/10, Maciej Fijalkowski <fijall@gmail.com> wrote:
...
...
Hi Maciej,
I think you totally misunderstood me, possibly because
I was not
clear, see below. In short, I was wondering whether
...
the original code made any sense, and my guess was
"mostly not",
exactly because there is little constant folding
From: Maciej Fijalkowski <fijall@gmail.com>
Subject: Re: [pypy-dev] JIT'ed function performance degrades
To: "Paolo Giarrusso" <p.giarrusso@gmail.com>
Cc: "Hart's Antler" <bhartsho@yahoo.com>, pypy-dev@codespeak.net
Date: Thursday, 19 August, 2010, 4:55 AM
On Thu, Aug 19, 2010 at 1:34 PM,
Paolo Giarrusso <p.giarrusso@gmail.com>
wrote:
the approach of
possible in the code,
...
as it is written.
That's always possible :)
...
[Hart, I don't think that any O(N^2) implementation of
...
the code), i.e. two nested for loops, should be
written to explicitly
take advantage of the JIT. I don't know about the FFT
algorithm, but a
few vague ideas say "yes", because constant folding
...
_maybe_ allow constant folding the permutations
applied to data in the
Cooley–Tukey FFT algorithm.]
On Thu, Aug 19, 2010 at 12:11, Maciej Fijalkowski
<fijall@gmail.com>
wrote:
...
Hi
<snip>
...
...
Moreover, I'm not sure you need to use the JIT
yourself.
- Your code is RPython, so you could as well
just translate it without
JIT annotations, and it will be compiled to C
code.
- Otherwise, you could write that as a
app-level function, i.e. in
normal Python, and pass it to a translated
PyPy-JIT interpreter. Did
you try and benchmark the code?
Can I ask you why you did not write that as a
app-level function, i.e.
as normal Python code, to use PyPy's JIT
DFT (what is in
the length could
directly, without needing
...
...
...
detailed understanding of the JIT?
It would be interesting to see a comparison
(and have it on the web,
after some code review).
JIT can essentially speed up based on constant
folding based on
bytecode. Bytecode should be the only green
variable here and all
others (that you don't want to specialize over)
should be red and not
promoted. In your case it's very likely you
compile new loop very
often (overspecialization).
I see no bytecode in the example - it's a DFT
implementation.
For each combination of green variables, there are
1024 iterations,
and there are 1024 such combinations, so
overspecialization is almost
guaranteed.
Agreed.
...
My next question, inspired from the specific code, is:
...
ever thrown away, if too much is generated? Even for
valid use cases,
most JITs can generate too much code, and they need
is JITted code
then to choose
...
what to keep and what to throw away.
No, as of now, never. In general in case of Python it would
have to be
a heuristic anyway (since code objects are mostly immortal
and you
can't decide whether certain combination of assumptions
will occur in
the future or not). We have some ideas which code will
never run any
more and besides that, we need to implement some heuristics
when to
throw away code.
...
...
...
Especially, I'm not sure that as currently
...
...
...
speedup, and I seriously wonder whether the
JIT could give an
additional speedup over RPython here (the
regexp interpreter is a
completely different case, since it compiles a
regexp, but why do you
compile an array?).
That's silly, our python interpreter is an RPython
written you're getting any
program. Anything
...
...
that can have a meaningfully defined "bytecode" or
a "compile time
constant" can be sped up by the JIT. For example a
templating
language.
You misunderstood me, I totally agree with you, and my
understanding
is that in the given program (which I read almost
fully) constant
folding makes little sense.
Great :) I might have misunderstood you.
...
Since that program is written with RPython + JIT, but
it has green
variables which are not at all "compile time
constants", "I wonder
seriously" was meant as "I wonder seriously whether
what you are
trying makes any sense". As I argued, the only
constant folding
possible is for the array length. And again, I wonder
whether it's
worth it, my guess tends towards "no", but a benchmark
is needed
(there will be some improvement probably).
I guess the answer is "hell no", simply because if you
don't constant
fold our assembler would not be nearly as good as gcc's one
(if
nothing else).
...
I was just a bit vaguer because I just studied docs on
...
papers about tracing compilation). But your answer
confirms that my
original analysis is correct, and that I should write
more clearly
maybe.
...
...
I think just raw CPython can be 340x slower
...
...
...
uses C)
...
You should check more and have less assumptions.
I did some checks, on PyPy's blog actually, not
definitive though, and
I stand by what I meant (see below). Without reading
PyPy (and
than C (I assume NumPy
the pastie in
...
full, however, my comments are out of context.
I guess your tone is fine, since you thought I wrote
nonsense. But in
general, I have yet to see a guideline forbidding
"IIRC" and similar
ways of discussing (the above was an _educated_
guess), especially
when the writer remembers correctly (as in this
case).
Having said that, I'm always happy to see
counterexamples and learn
something, if they exist. In this case, for what I
actually meant (and
wrote, IMHO), a counterexample would be a RPython or a
JITted program
...
= 340x slower than C.
My comment was merely about "numpy is written in C".
...
For the speed ratio, the code pastie writes that
RPython JITted code
...
is 340x slower than NumPy code, and I was writing that
it's
unreasonable; in this case, it happens because of
overspecialization
caused by misuse of the JIT.
Yes.
...
For speed ratios among CPython, C, RPython, I was
comparing to
...
http://morepypy.blogspot.com/2010/06/jit-for-regular-expression-matching.htm....
What I meant is that JITted code can't be so much
slower than C.
For NumPy, I had read this:
http://morepypy.blogspot.com/2009/07/pypy-numeric-experiments.html,
and it mostly implies that NumPy is written in C (it
actually says
"NumPy's C version", but I missed it). And for the
specific discussed
microbenchmark, the performance gap between NumPy and
CPython is
~100x.
Yes, there is a slight difference :-) numpy is written
mostly in C (at
least glue code), but a lot of algorithms call back to some
other
stuff (depending what you have installed) which as far as
I'm
concerned might be whatever (most likely fortran or SSE
assembler at
some level.)
...
Best regards
--
Paolo Giarrusso - Ph.D. Student
http://www.informatik.uni-marburg.de/~pgiarrusso/