I thought about that too, but the granularity is very wrong for STM:
the overhead of running tiny transactions will completely dwarf any
potential speed gains.
Then maybe we should go with a slightly different version of STM, specific to assembler loops in particular.
A loop in assembly language operates on a linear memory model and usually modifies only a small number of memory cells. In fact, most changes made by a loop iteration get overwritten in the next iteration. Assuming that a single assembler instruction may modify only one memory cell, the number of cells changed will be no more than the count of loop iterations.
We could replace (para-virtualize) any instruction that changes a memory cell with two stack pushes: save the memory address to the stack and the actual value that is written. Any memory read will also be replaced with the stack search before falling back to reading the actual memory contents. This is a big penalty of course but it's worth checking whether it pays in the future.
This is a simple, assembler-specific flavor of STM.
Then we could employ loop scheduling and run the modified code on all cores. Then we could check whether all the memory modifications agree, that means whether any two cores did not try to write different value to the same memory address. If not, then we could commit the transaction and exit the loop.
The kind of loops that would benefit most from such optimization would be memset, memcpy and all map-like constructs:
dst = map(fun, src)
--->
for(int i = 0; i < len(src); i++)
dst[i] = fun(src[i]);
can we just make a hybrid-system that firstly slightly screen
some loops that is not suitable for parallelization and then run others with STM?
Following the general "try and fail" philosophy of Python, I would suggest the following:
Just run the unmodified loop on one core and use the other cores to optimize/execute the modified version. If the optimization turns out unsuitable or the serial execution ends first, just abort the optimized run. If the loop turns out to be parralelizable, return the results instead.
Thanks
haael
Od: "黄若塵" <hrc706@gmail.com>
Do: "Armin Rigo" <arigo@tunes.org>; haael@interia.pl;
Wysłane: 9:07 Środa 2014-11-26
Temat: Re: [pypy-dev] An idea about automatic parallelization in PyPy/RPython
Hi Haael, Rigo,
2014/11/21 19:21、Armin Rigo のメール:
Hi Haael, hi 黄若尘,
On 21 November 2014 10:55, wrote:
I would suggest a different approach, more similar to Armin's idea of parallelization.
You could just optimistically assume that the loop is parallelizable. Just execute few steps at once (each in its own memory sandbox) and check for conflicts later. This also plays nice with STM.
I thought about that too, but the granularity is very wrong for STM:
the overhead of running tiny transactions will completely dwarf any
potential speed gains. If we're talking about tiny transactions then
maybe HTM would be more suitable. I have no idea if HTM will ever
start appearing on GPU, though. Moreover, you still have the general
hard problems of automatic parallelization, like communicating between
threads the progress made; unless it is carefully done on a
case-by-case basis by a human, this often adds (again) considerable
overheads.
Well, recently I have read some papers about TLS, and also realized the heavy
performance penalty of STM. What am I considering is that, is it possible to
simplify a STM for the trace generated by RPython using some features of it (for
example there is no control flow but only guard; there are some jit.elidable functions
in the interpreter), or, can we just make a hybrid-system that firstly slightly screen
some loops that is not suitable for parallelization and then run others with STM?
To 黄若尘: here's a quick answer to your question. It's not very clean,
but I would patch rpython/jit/backend/x86/regalloc.py, prepare_loop(),
just after it calls _prepare(). It gets a list of rewritten
operations ready to be turned into assembler. I guess you'd need to
check at this point if the loop contains only operations you support,
and if so, produce some different code (possibly GPU). Then either
abort the job here by raising some exception, or if it makes sense,
change the 'operations' list so that it becomes just a few assembler
instructions that will start and stop the GPU code.
My own two cents about this project, however, is that it's relatively
easy to support a few special cases, but it quickly becomes very, very
hard to support more general code. You are likely to end up with a
system that only compiles to GPU some very specific templates and
nothing else. The end result for a user is obscure, because he won't
get to use the GPU unless he writes loops that follow exactly some
very strict rules. I certainly see why the end user might prefer to
use a DSL instead: i.e. he knows he wants to use the GPU at specific
places, and he is ready to use a separate very restricted "language"
to express what he wants to do, as long as it is guaranteed to use the
GPU. (The needs in this case are very different from the general PyPy
JIT, which tries to accelerate any Python code.)
A bientôt,
Armin.