[pypy-dev] An idea about automatic parallelization in PyPy/RPython

Armin Rigo arigo at tunes.org
Fri Nov 21 11:21:25 CET 2014


Hi Haael, hi 黄若尘,

On 21 November 2014 10:55,  <haael at interia.pl> wrote:
> I would suggest a different approach, more similar to Armin's idea of parallelization.
>
> You could just optimistically assume that the loop is parallelizable. Just execute few steps at once (each in its own memory sandbox) and check for conflicts later. This also plays nice with STM.

I thought about that too, but the granularity is very wrong for STM:
the overhead of running tiny transactions will completely dwarf any
potential speed gains.  If we're talking about tiny transactions then
maybe HTM would be more suitable.  I have no idea if HTM will ever
start appearing on GPU, though.  Moreover, you still have the general
hard problems of automatic parallelization, like communicating between
threads the progress made; unless it is carefully done on a
case-by-case basis by a human, this often adds (again) considerable
overheads.

To 黄若尘: here's a quick answer to your question.  It's not very clean,
but I would patch rpython/jit/backend/x86/regalloc.py, prepare_loop(),
just after it calls _prepare().  It gets a list of rewritten
operations ready to be turned into assembler.  I guess you'd need to
check at this point if the loop contains only operations you support,
and if so, produce some different code (possibly GPU).  Then either
abort the job here by raising some exception, or if it makes sense,
change the 'operations' list so that it becomes just a few assembler
instructions that will start and stop the GPU code.

My own two cents about this project, however, is that it's relatively
easy to support a few special cases, but it quickly becomes very, very
hard to support more general code.  You are likely to end up with a
system that only compiles to GPU some very specific templates and
nothing else.  The end result for a user is obscure, because he won't
get to use the GPU unless he writes loops that follow exactly some
very strict rules.  I certainly see why the end user might prefer to
use a DSL instead: i.e. he knows he wants to use the GPU at specific
places, and he is ready to use a separate very restricted "language"
to express what he wants to do, as long as it is guaranteed to use the
GPU.  (The needs in this case are very different from the general PyPy
JIT, which tries to accelerate any Python code.)


A bientôt,

Armin.


More information about the pypy-dev mailing list