Re: [pypy-dev] An idea about automatic parallelization in PyPy/RPython
Hi 黄若尘, Armin,
I thought about that too, but the granularity is very wrong for STM: the overhead of running tiny transactions will completely dwarf any potential speed gains.
Then maybe we should go with a slightly different version of STM, specific to assembler loops in particular. A loop in assembly language operates on a linear memory model and usually modifies only a small number of memory cells. In fact, most changes made by a loop iteration get overwritten in the next iteration. Assuming that a single assembler instruction may modify only one memory cell, the number of cells changed will be no more than the count of loop iterations. We could replace (para-virtualize) any instruction that changes a memory cell with two stack pushes: save the memory address to the stack and the actual value that is written. Any memory read will also be replaced with the stack search before falling back to reading the actual memory contents. This is a big penalty of course but it's worth checking whether it pays in the future. This is a simple, assembler-specific flavor of STM. Then we could employ loop scheduling and run the modified code on all cores. Then we could check whether all the memory modifications agree, that means whether any two cores did not try to write different value to the same memory address. If not, then we could commit the transaction and exit the loop. The kind of loops that would benefit most from such optimization would be memset, memcpy and all map-like constructs: dst = map(fun, src) ---> for(int i = 0; i < len(src); i++) dst[i] = fun(src[i]);
can we just make a hybrid-system that firstly slightly screen some loops that is not suitable for parallelization and then run others with STM?
Following the general "try and fail" philosophy of Python, I would suggest the following:
Just run the unmodified loop on one core and use the other cores to optimize/execute the modified version. If the optimization turns out unsuitable or the serial execution ends first, just abort the optimized run. If the loop turns out to be parralelizable, return the results instead.
Thanks
haael
Od: "黄若塵"
Hi Haael, Rigo,
2014/11/21 19:21、Armin Rigo のメール:
Hi Haael, hi 黄若尘,
On 21 November 2014 10:55, wrote:
I would suggest a different approach, more similar to Armin's idea of parallelization.
You could just optimistically assume that the loop is parallelizable. Just execute few steps at once (each in its own memory sandbox) and check for conflicts later. This also plays nice with STM.
I thought about that too, but the granularity is very wrong for STM: the overhead of running tiny transactions will completely dwarf any potential speed gains. If we're talking about tiny transactions then maybe HTM would be more suitable. I have no idea if HTM will ever start appearing on GPU, though. Moreover, you still have the general hard problems of automatic parallelization, like communicating between threads the progress made; unless it is carefully done on a case-by-case basis by a human, this often adds (again) considerable overheads.
Well, recently I have read some papers about TLS, and also realized the heavy performance penalty of STM. What am I considering is that, is it possible to simplify a STM for the trace generated by RPython using some features of it (for example there is no control flow but only guard; there are some jit.elidable functions in the interpreter), or, can we just make a hybrid-system that firstly slightly screen some loops that is not suitable for parallelization and then run others with STM?
To 黄若尘: here's a quick answer to your question. It's not very clean, but I would patch rpython/jit/backend/x86/regalloc.py, prepare_loop(), just after it calls _prepare(). It gets a list of rewritten operations ready to be turned into assembler. I guess you'd need to check at this point if the loop contains only operations you support, and if so, produce some different code (possibly GPU). Then either abort the job here by raising some exception, or if it makes sense, change the 'operations' list so that it becomes just a few assembler instructions that will start and stop the GPU code.
My own two cents about this project, however, is that it's relatively easy to support a few special cases, but it quickly becomes very, very hard to support more general code. You are likely to end up with a system that only compiles to GPU some very specific templates and nothing else. The end result for a user is obscure, because he won't get to use the GPU unless he writes loops that follow exactly some very strict rules. I certainly see why the end user might prefer to use a DSL instead: i.e. he knows he wants to use the GPU at specific places, and he is ready to use a separate very restricted "language" to express what he wants to do, as long as it is guaranteed to use the GPU. (The needs in this case are very different from the general PyPy JIT, which tries to accelerate any Python code.)
A bientôt,
Armin.
participants (1)
-
haael@interia.pl