
Hi Manuel, We've been suitably impressed by the results on the new llvm backend during the sprint (well, or suitably un-impressed by both gcc and clang's failure to reconstruct the SSA meaning of the C code). The current issue seems to be debugging. It would be nice if gdb presented at least source ".ll" code rather than just the assembler instructions --- actually, it would be more than nice: debugging at the level of assembler is a no-no. I was thinking: would it make sense to emit from the translation toolchain some files that are not in ".ll" format, but that are more or less a straightforward text file representation of the flow graphs and constants, and then have a separate tool that converts these files (seen as source) into .ll files? There are some advantages of doing it this way, even if it looks like Yet Another intermediate step. The first advantage is that it would give some inspectable files, which we could also request during testing for cases where the pygame flow graph inspector doesn't really work (e.g. too many graphs). Another advantage is that we might refactor the C backend to also be a separate tool that inputs these flowgraph text files, minimizing the amount of duplicate work between the C and the LLVM backends. And of course, the point is that the flowgraph-to-".ll" conversion would insert file line numbers, so that we can debug it in gdb seeing the flowgraph source lines. (I can even think about hacks to do the same even if we go via C...) Of course the drawback is that it's some non-trivial refactorization. Does it make sense? A bientôt, Armin.

Hi again, On Sun, Sep 8, 2013 at 9:42 AM, Armin Rigo <arigo@tunes.org> wrote:
I have investigated a bit more and it's quite unclear that this would be the source of the difference. It seems that the "-flto" option of gcc, enabling link-time optimization, actually gives very good improvements over the same compilation without this option --- some 11-14%, more so than, say, the typical 5% reported with CPython. If I had to guess, I'd say it is because of the particularly disorganized kind of C code produced by RPyhon. About the llvm backend, one detail hints that it might be the reason for the speed improvement: the fact that the current llvm backend produces most of the source code in a single file. This may be what gives llvm extra room for improvements. This is precisely the same room for improvement that "-flto" also gives gcc, considering that we generate many C files with never-"static" functions. I tried to compile a no-jit version of PyPy from the llvm-translation-backend branch, for comparison, but this fails right now with "NotImplementedError: v585190 = debug_offset()". It successfully compiles targetrpystonedalone (in -O2 mode), though. I get the following results (with the argument "100000000"): plain gcc 4.7.3: 1.95 seconds llvm 3.3: 1.75 seconds gcc with -flto: 1.66 seconds If we get similar results on the whole PyPy, then I fear the llvm backend is going back to where it already went to several time: "not useful enough". We can simply add the -flto flag to the generated Makefiles. Manuel, do you feel like trying to compare? I'm modifying the Makefile manually as follows: CFLAGS = ...... -flto -fno-fat-lto-objects LDFLAGS = ..... -flto=8 -O3 A bientôt, Armin.

LLVM also has a link time optimization, is it on by default in LLVM, or do we need to benchmark with it enabled explicitly? Alex On Sun, Sep 8, 2013 at 8:17 AM, Armin Rigo <arigo@tunes.org> wrote:
-- "I disapprove of what you say, but I will defend to the death your right to say it." -- Evelyn Beatrice Hall (summarizing Voltaire) "The people's good is the highest law." -- Cicero GPG Key fingerprint: 125F 5C67 DFE9 4084

Hi Alex, On Sun, Sep 8, 2013 at 5:33 PM, Alex Gaynor <alex.gaynor@gmail.com> wrote:
LLVM also has a link time optimization, is it on by default in LLVM, or do we need to benchmark with it enabled explicitly?
The point I made in my mail was that the llvm backend is written in a way that makes link-time optimizations unnecessary. We could also not rely on "-flto" and instead write a single big .c file with the word "static" added everywhere. A bientôt, Armin.

Hi, I am missing some background information to follow what is being discussed here, so... What is the PyPy speed difference after using gcc versus llvm for the compilation of the PyPy-c backend? Would generating .ll instead of .c files really give any benefit? More interesting would still be using llvm as a PyPy-jit-backend. Is there anything new in the llvm world that would make this feasible? There used to be various issues with our previous attempts of using llvm, as we know all to clearly. Eric Op 8 sep. 2013 om 17:42 heeft Armin Rigo <arigo@tunes.org> het volgende geschreven:

Hi Eric, On Sun, Sep 8, 2013 at 7:00 PM, Eric van Riet Paap <ericvrp@gmail.com> wrote:
What is the PyPy speed difference after using gcc versus llvm for the compilation of the PyPy-c backend?
Currently, it seems that using the LLVM IR static translation backend of PyPy gives higher performance. We're still trying to figure out why. I'm quite unsure that it's solely because LLVM is better [citation needed]. In particular it's strange because, at the same time, generating .c files and compiling with clang is worse than compiling with GCC. That's why I currently think the performance difference can be fully attributed to details in the two backends. If anything, it seems that someone motivated could extract some critical information from comparing the optimized llvm code produced by clang and by Manuel's .ll backend. Depending on what he finds, he can then fix our C backend to reduce the difference --- and then the C files, compiled by GCC, might be correspondingly faster as well. Alternatively, we should also try to play with the GCC options pointed to by David. On a higher-level note, LLVM still has nothing concrete enough to give us for the topics of (1) root stack scanning and (2) tracing JIT. These two areas might get traction if someone is really motivated to go into LLVM development land. So far there has been no progress that I know of, since several years. This e-mail was written with my long-term experience of 4 or 5 failed attempts at using LLVM :-) If anyone is offended by the negativity of it, feel free to prove me wrong with some backing (I know that LLVM has progressed a lot). Manuel came up with an unexpected performance difference between clang and direct generation of equivalent LLVM IR. That's concrete enough, but until someone can really explain it, I fear that we won't really progress. A bientôt, Armin.

On Sun, Sep 8, 2013 at 5:42 PM, Armin Rigo <arigo@tunes.org> wrote:
One C file sounds bad, but we can add -ftlo and add a word "static" a bit everywhere too (I don't think we care for non-exported symbols at all). To be honest, a separate intermediate file is a very good idea (tm), for various reasons, like it would be trivial to parallelize the C-generation-from-something step. If we can make the low-level graphs a file format, we can even kinda-parallelize other steps, like JIT or GC. Cheers, fijal

On Sun, Sep 8, 2013 at 11:17 AM, Armin Rigo <arigo@tunes.org> wrote:
The type of machine-generated code produced PyPy is difficult for compilers to optimize (lots of seemingly unstructured gotos, state machines, unusual basic block heuristics) when presented in a high-level langauge like C. The distribution of the source code across a large number of source files also complicates the optimization process. GCC and LLVM link-time optimization can overcome some of these problems by allowing the compiler to "see" more of the program and optimize across the source files. Directly generating LLVM IR accomplishes a similar benefit. With some of the recent changes to GCC, one also directly could generate GCC IR. LLVM makes it very convenient to directly input the IR and take advantage of optimization opportunities allowed by such an input method, but the performance benefit is not likely due to other difference in optimization pipelines and code generation capabilities. In addition to the GCC -flto option, you should consider if -fwhole-program also is appropriate (I believe that it is). GCC has additional optimizations that can help with the style of code generated by programs like PyPy. PyPy does not generate code with computed gotos, but the aggressive use of gotos are different than normal user-written code and probably can benefit from non-default compiler optimization heuristics. There is no obvious recommendation, but experiments with enabling / disabling some forms of GCSE (-fgcse, -fgcse-lm, -fgcse-sm, -fgcse-las, -fgcse-after-reload) as well as some of the parameters (crossjumping, goto-duplication, inlining limits) might benefit PyPy. One can achieve performance gains with either compiler through adjustments to the generated code and the compiler optimization heuristics. Thanks, David

Hi again, On Sun, Sep 8, 2013 at 9:42 AM, Armin Rigo <arigo@tunes.org> wrote:
I have investigated a bit more and it's quite unclear that this would be the source of the difference. It seems that the "-flto" option of gcc, enabling link-time optimization, actually gives very good improvements over the same compilation without this option --- some 11-14%, more so than, say, the typical 5% reported with CPython. If I had to guess, I'd say it is because of the particularly disorganized kind of C code produced by RPyhon. About the llvm backend, one detail hints that it might be the reason for the speed improvement: the fact that the current llvm backend produces most of the source code in a single file. This may be what gives llvm extra room for improvements. This is precisely the same room for improvement that "-flto" also gives gcc, considering that we generate many C files with never-"static" functions. I tried to compile a no-jit version of PyPy from the llvm-translation-backend branch, for comparison, but this fails right now with "NotImplementedError: v585190 = debug_offset()". It successfully compiles targetrpystonedalone (in -O2 mode), though. I get the following results (with the argument "100000000"): plain gcc 4.7.3: 1.95 seconds llvm 3.3: 1.75 seconds gcc with -flto: 1.66 seconds If we get similar results on the whole PyPy, then I fear the llvm backend is going back to where it already went to several time: "not useful enough". We can simply add the -flto flag to the generated Makefiles. Manuel, do you feel like trying to compare? I'm modifying the Makefile manually as follows: CFLAGS = ...... -flto -fno-fat-lto-objects LDFLAGS = ..... -flto=8 -O3 A bientôt, Armin.

LLVM also has a link time optimization, is it on by default in LLVM, or do we need to benchmark with it enabled explicitly? Alex On Sun, Sep 8, 2013 at 8:17 AM, Armin Rigo <arigo@tunes.org> wrote:
-- "I disapprove of what you say, but I will defend to the death your right to say it." -- Evelyn Beatrice Hall (summarizing Voltaire) "The people's good is the highest law." -- Cicero GPG Key fingerprint: 125F 5C67 DFE9 4084

Hi Alex, On Sun, Sep 8, 2013 at 5:33 PM, Alex Gaynor <alex.gaynor@gmail.com> wrote:
LLVM also has a link time optimization, is it on by default in LLVM, or do we need to benchmark with it enabled explicitly?
The point I made in my mail was that the llvm backend is written in a way that makes link-time optimizations unnecessary. We could also not rely on "-flto" and instead write a single big .c file with the word "static" added everywhere. A bientôt, Armin.

Hi, I am missing some background information to follow what is being discussed here, so... What is the PyPy speed difference after using gcc versus llvm for the compilation of the PyPy-c backend? Would generating .ll instead of .c files really give any benefit? More interesting would still be using llvm as a PyPy-jit-backend. Is there anything new in the llvm world that would make this feasible? There used to be various issues with our previous attempts of using llvm, as we know all to clearly. Eric Op 8 sep. 2013 om 17:42 heeft Armin Rigo <arigo@tunes.org> het volgende geschreven:

Hi Eric, On Sun, Sep 8, 2013 at 7:00 PM, Eric van Riet Paap <ericvrp@gmail.com> wrote:
What is the PyPy speed difference after using gcc versus llvm for the compilation of the PyPy-c backend?
Currently, it seems that using the LLVM IR static translation backend of PyPy gives higher performance. We're still trying to figure out why. I'm quite unsure that it's solely because LLVM is better [citation needed]. In particular it's strange because, at the same time, generating .c files and compiling with clang is worse than compiling with GCC. That's why I currently think the performance difference can be fully attributed to details in the two backends. If anything, it seems that someone motivated could extract some critical information from comparing the optimized llvm code produced by clang and by Manuel's .ll backend. Depending on what he finds, he can then fix our C backend to reduce the difference --- and then the C files, compiled by GCC, might be correspondingly faster as well. Alternatively, we should also try to play with the GCC options pointed to by David. On a higher-level note, LLVM still has nothing concrete enough to give us for the topics of (1) root stack scanning and (2) tracing JIT. These two areas might get traction if someone is really motivated to go into LLVM development land. So far there has been no progress that I know of, since several years. This e-mail was written with my long-term experience of 4 or 5 failed attempts at using LLVM :-) If anyone is offended by the negativity of it, feel free to prove me wrong with some backing (I know that LLVM has progressed a lot). Manuel came up with an unexpected performance difference between clang and direct generation of equivalent LLVM IR. That's concrete enough, but until someone can really explain it, I fear that we won't really progress. A bientôt, Armin.

On Sun, Sep 8, 2013 at 5:42 PM, Armin Rigo <arigo@tunes.org> wrote:
One C file sounds bad, but we can add -ftlo and add a word "static" a bit everywhere too (I don't think we care for non-exported symbols at all). To be honest, a separate intermediate file is a very good idea (tm), for various reasons, like it would be trivial to parallelize the C-generation-from-something step. If we can make the low-level graphs a file format, we can even kinda-parallelize other steps, like JIT or GC. Cheers, fijal

On Sun, Sep 8, 2013 at 11:17 AM, Armin Rigo <arigo@tunes.org> wrote:
The type of machine-generated code produced PyPy is difficult for compilers to optimize (lots of seemingly unstructured gotos, state machines, unusual basic block heuristics) when presented in a high-level langauge like C. The distribution of the source code across a large number of source files also complicates the optimization process. GCC and LLVM link-time optimization can overcome some of these problems by allowing the compiler to "see" more of the program and optimize across the source files. Directly generating LLVM IR accomplishes a similar benefit. With some of the recent changes to GCC, one also directly could generate GCC IR. LLVM makes it very convenient to directly input the IR and take advantage of optimization opportunities allowed by such an input method, but the performance benefit is not likely due to other difference in optimization pipelines and code generation capabilities. In addition to the GCC -flto option, you should consider if -fwhole-program also is appropriate (I believe that it is). GCC has additional optimizations that can help with the style of code generated by programs like PyPy. PyPy does not generate code with computed gotos, but the aggressive use of gotos are different than normal user-written code and probably can benefit from non-default compiler optimization heuristics. There is no obvious recommendation, but experiments with enabling / disabling some forms of GCSE (-fgcse, -fgcse-lm, -fgcse-sm, -fgcse-las, -fgcse-after-reload) as well as some of the parameters (crossjumping, goto-duplication, inlining limits) might benefit PyPy. One can achieve performance gains with either compiler through adjustments to the generated code and the compiler optimization heuristics. Thanks, David
participants (5)
-
Alex Gaynor
-
Armin Rigo
-
David Edelsohn
-
Eric van Riet Paap
-
Maciej Fijalkowski