An idea about automatic parallelization in PyPy/RPython

Hi everyone, I’m a master student in Japan and I want to do some research in PyPy/RPython. I have read some papers about PyPy and I also had some ideas about it. I have communicated with Mr. Bloz and been advised to send my question here. Actually, I wonder if it is possible to make an automatic parallelization for the trace generated by JIT, that is, check if the hot loop is a parallel loop, if so, then try to run the trace parallel in multi-core CPU or GPU, make it faster. I think it maybe suitable because: 1. The traced-base JIT is targeting on loops, which is straight to parallel computation. 2. There is no control-flow in trace, which is suitable to the fragment program in GPU. 3. We may use the hint of @elidable in interpreter codes, since the elidable functions are nonsensitive in the execution ordering so can be executed parallel. What do you think about it? Best Regards, Huang Ruochen

Hi 黄若尘 This is generally a hard problem that projects like GCC or LLVM didn't get very far. The problem is slightly more advanced with PyPys JIT, but not much more. However, the problem is you can do it for simple loops, but the applications are limited outside of pure numerics (e.g. numpy) and also doing SSE stuff in such cases first seems like both a good starting point and a small enough project for master thesis. Cheers, fijal On Tue, Nov 18, 2014 at 3:46 AM, 黄若尘 <hrc706@gmail.com> wrote:

Hi Fijaklowski, Thank you very much for your reply. Yes, you are right, it’s too hard for me to implement automatic parallelization for the whole PyPy’s trace JIT. I think maybe I can firstly do some work with a very simple interpreter (for example the example-interpreter introduced by PyPy documentation), and try to change some behaviors of RPython JIT. By the way, could you tell me how can I get the traces and handle them before compiled to native code? I just want to try to convert some of the traces to OpenCL kernel codes and run them in other devices like GPU. Best Regards, Huang Ruochen

You get traces by running PYPYLOG=jit-log-opt,jit-backend:<filename> pypy .... There is a tool call jitviewer for viewing those traces. OpenCL is likely just written in C and the kernel itself does not contain any Python. On Fri, Nov 21, 2014 at 3:17 AM, 黄若尘 <hrc706@gmail.com> wrote:

Yes, I actually knew this way to get traces. Well, what I mean is that, I want to handle those traces in RUNTIME. I want to insert some code in RPython’s JIT to detect some traces which can be executed parallel, if so, then COMPILE them into OpenCL code (then into native code, and run), if not, compile them to normal native code as what RPython do now. So maybe it’s important to firstly prevent the normal compilation and analyze the trace. I’m sorry for my bad English and bad expression.

Hi I would suggest a different approach, more similar to Armin's idea of parallelization. You could just optimistically assume that the loop is parallelizable. Just execute few steps at once (each in its own memory sandbox) and check for conflicts later. This also plays nice with STM. So, the general solution would look like this: 1. Split the loop into individual step invocations. 2. Reserve a tiny memory block for each loop step. 3. Optional: compile the loop step into OpenCL. 4. Execute every loop step in parallel, saving the changes made by each invocation to its individual memory block. 5. Check if the changes are conflicting. 6. If not, merge them and commit to the global memory. 7. If they are, fall back to serial loop execution. I think all the building blocks are here, in particular the recent GIL removal and Armin's STM research. It sounds exciting. Cheers haael Od: "黄若尘" <hrc706@gmail.com> Do: "Maciej Fijalkowski" <fijall@gmail.com>; Wysłane: 7:49 Piątek 2014-11-21 Temat: Re: [pypy-dev] An idea about automatic parallelization in PyPy/RPython

Hi Haael, hi 黄若尘, On 21 November 2014 10:55, <haael@interia.pl> wrote:
I would suggest a different approach, more similar to Armin's idea of parallelization.
You could just optimistically assume that the loop is parallelizable. Just execute few steps at once (each in its own memory sandbox) and check for conflicts later. This also plays nice with STM.
I thought about that too, but the granularity is very wrong for STM: the overhead of running tiny transactions will completely dwarf any potential speed gains. If we're talking about tiny transactions then maybe HTM would be more suitable. I have no idea if HTM will ever start appearing on GPU, though. Moreover, you still have the general hard problems of automatic parallelization, like communicating between threads the progress made; unless it is carefully done on a case-by-case basis by a human, this often adds (again) considerable overheads. To 黄若尘: here's a quick answer to your question. It's not very clean, but I would patch rpython/jit/backend/x86/regalloc.py, prepare_loop(), just after it calls _prepare(). It gets a list of rewritten operations ready to be turned into assembler. I guess you'd need to check at this point if the loop contains only operations you support, and if so, produce some different code (possibly GPU). Then either abort the job here by raising some exception, or if it makes sense, change the 'operations' list so that it becomes just a few assembler instructions that will start and stop the GPU code. My own two cents about this project, however, is that it's relatively easy to support a few special cases, but it quickly becomes very, very hard to support more general code. You are likely to end up with a system that only compiles to GPU some very specific templates and nothing else. The end result for a user is obscure, because he won't get to use the GPU unless he writes loops that follow exactly some very strict rules. I certainly see why the end user might prefer to use a DSL instead: i.e. he knows he wants to use the GPU at specific places, and he is ready to use a separate very restricted "language" to express what he wants to do, as long as it is guaranteed to use the GPU. (The needs in this case are very different from the general PyPy JIT, which tries to accelerate any Python code.) A bientôt, Armin.

Hi 黄若尘 This is generally a hard problem that projects like GCC or LLVM didn't get very far. The problem is slightly more advanced with PyPys JIT, but not much more. However, the problem is you can do it for simple loops, but the applications are limited outside of pure numerics (e.g. numpy) and also doing SSE stuff in such cases first seems like both a good starting point and a small enough project for master thesis. Cheers, fijal On Tue, Nov 18, 2014 at 3:46 AM, 黄若尘 <hrc706@gmail.com> wrote:

Hi Fijaklowski, Thank you very much for your reply. Yes, you are right, it’s too hard for me to implement automatic parallelization for the whole PyPy’s trace JIT. I think maybe I can firstly do some work with a very simple interpreter (for example the example-interpreter introduced by PyPy documentation), and try to change some behaviors of RPython JIT. By the way, could you tell me how can I get the traces and handle them before compiled to native code? I just want to try to convert some of the traces to OpenCL kernel codes and run them in other devices like GPU. Best Regards, Huang Ruochen

You get traces by running PYPYLOG=jit-log-opt,jit-backend:<filename> pypy .... There is a tool call jitviewer for viewing those traces. OpenCL is likely just written in C and the kernel itself does not contain any Python. On Fri, Nov 21, 2014 at 3:17 AM, 黄若尘 <hrc706@gmail.com> wrote:

Yes, I actually knew this way to get traces. Well, what I mean is that, I want to handle those traces in RUNTIME. I want to insert some code in RPython’s JIT to detect some traces which can be executed parallel, if so, then COMPILE them into OpenCL code (then into native code, and run), if not, compile them to normal native code as what RPython do now. So maybe it’s important to firstly prevent the normal compilation and analyze the trace. I’m sorry for my bad English and bad expression.

Hi I would suggest a different approach, more similar to Armin's idea of parallelization. You could just optimistically assume that the loop is parallelizable. Just execute few steps at once (each in its own memory sandbox) and check for conflicts later. This also plays nice with STM. So, the general solution would look like this: 1. Split the loop into individual step invocations. 2. Reserve a tiny memory block for each loop step. 3. Optional: compile the loop step into OpenCL. 4. Execute every loop step in parallel, saving the changes made by each invocation to its individual memory block. 5. Check if the changes are conflicting. 6. If not, merge them and commit to the global memory. 7. If they are, fall back to serial loop execution. I think all the building blocks are here, in particular the recent GIL removal and Armin's STM research. It sounds exciting. Cheers haael Od: "黄若尘" <hrc706@gmail.com> Do: "Maciej Fijalkowski" <fijall@gmail.com>; Wysłane: 7:49 Piątek 2014-11-21 Temat: Re: [pypy-dev] An idea about automatic parallelization in PyPy/RPython

Hi Haael, hi 黄若尘, On 21 November 2014 10:55, <haael@interia.pl> wrote:
I would suggest a different approach, more similar to Armin's idea of parallelization.
You could just optimistically assume that the loop is parallelizable. Just execute few steps at once (each in its own memory sandbox) and check for conflicts later. This also plays nice with STM.
I thought about that too, but the granularity is very wrong for STM: the overhead of running tiny transactions will completely dwarf any potential speed gains. If we're talking about tiny transactions then maybe HTM would be more suitable. I have no idea if HTM will ever start appearing on GPU, though. Moreover, you still have the general hard problems of automatic parallelization, like communicating between threads the progress made; unless it is carefully done on a case-by-case basis by a human, this often adds (again) considerable overheads. To 黄若尘: here's a quick answer to your question. It's not very clean, but I would patch rpython/jit/backend/x86/regalloc.py, prepare_loop(), just after it calls _prepare(). It gets a list of rewritten operations ready to be turned into assembler. I guess you'd need to check at this point if the loop contains only operations you support, and if so, produce some different code (possibly GPU). Then either abort the job here by raising some exception, or if it makes sense, change the 'operations' list so that it becomes just a few assembler instructions that will start and stop the GPU code. My own two cents about this project, however, is that it's relatively easy to support a few special cases, but it quickly becomes very, very hard to support more general code. You are likely to end up with a system that only compiles to GPU some very specific templates and nothing else. The end result for a user is obscure, because he won't get to use the GPU unless he writes loops that follow exactly some very strict rules. I certainly see why the end user might prefer to use a DSL instead: i.e. he knows he wants to use the GPU at specific places, and he is ready to use a separate very restricted "language" to express what he wants to do, as long as it is guaranteed to use the GPU. (The needs in this case are very different from the general PyPy JIT, which tries to accelerate any Python code.) A bientôt, Armin.
participants (4)
-
Armin Rigo
-
haael@interia.pl
-
Maciej Fijalkowski
-
黄若尘