[pypy-dev] gpgpu and pypy

Elmo elmo.mantynen at iki.fi
Wed Sep 1 11:01:34 CEST 2010

This seems similar to what MyHDL does. It's a python framework to be 
used as an HDL (hardware description language) for describing gate array 
configurations (outputs Verilog and VHDL for FPGAs or ASICs). It uses a 
similar approach to RPython as the compilation works on objects created 
by running the python code, with restrictions applying mostly to the 
code inside the generators that are used to encode the intended behavior 
(which is intrinsically parallel).

A bit on a different level, since you could use MyHDL to describe (and 
implement) a GPU, but I thought it would be interesting :)


On 08/21/2010 10:06 AM, Hakan Ardo wrote:
> Hi,
> here is a another effort allowing you to write GPU kernels using
> python, targeted at gpgpu. The programmer has to explicitly state the
> parallelism and there are restrictions on what kind of constructs are
> allowed in the kernels, but it's pretty cool:
>    http://www.cs.lth.se/home/Calle_Lejdfors/pygpu/
> On Sat, Aug 21, 2010 at 12:46 AM, Nick Bray<ncbray at gmail.com>  wrote:
>> I can't speak for GPGPU, but I have compiled a subset of Python onto
>> the GPU for real-time rendering.  The subset is a little broader than
>> RPython in some ways (for example, attributes are semantically
>> identical to Python) and a little narrower in some ways (many forms of
>> recursion are disallowed.)  This big idea is that it allows you to
>> create a real-time rendering system with a single code base, and
>> transparently share functions and data structures between the CPU and
>> GPU.
>> http://www.ncbray.com/pystream.html
>> http://www.ncbray.com/ncbray-dissertation.pdf
>> It's at least ~100,000x faster than interpreting Python on the CPU.
>> "At least" because the measurements neglect doing things on the CPU
>> like texture sampling.  This speedup is pretty obscene, but if you
>> break it down it isn't too unbelievable... 100x for interpreted ->
>> compiled, 10x for abstraction overhead of using floats instead of
>> doubles, 100x for using the GPU and using it for a task it was built
>> for.
>> Parallelism issues are sidestepped by explicitly identifying the
>> parallel sections (one function processes every vertex, one function
>> processes every fragment), requiring the parallel sections have no
>> global side effects, and that certain I/O conventions are followed.
>> Sorry, no big answers here - it's essentially Pythonic stream
>> programming.
>> The biggest issues with getting Python onto the GPU is memory.  I was
>> actually targeting GLSL, not CUDA (it can't access the full rendering
>> pipeline), so pointers were not available.  To work around this, the
>> code is optimized to an extreme degree to remove as many memory
>> operations as possible.  The remaining memory operations are emulated
>> by splitting the heap into regions, indirecting through arrays, and
>> copying constant data wherever possible.  From what I've seen this is
>> where PyPy would have the most trouble: its analysis algorithms are
>> good enough for inferring types and  allowing compilation /
>> translation... they aren't designed to enable aggressive optimization
>> of memory operations (there's not a huge reason to do this if you're
>> translating RPython into C... the C compiler will do it for you).  In
>> general, GPU programming doesn't work well with memory access (too
>> many functional units, too little bandwidth).  Most of the "C-like"
>> GPU languages are designed to they can easily boil down into code
>> operating out of registers.  Python, on the other hand, is addicted to
>> heap memory.  Even if you target CUDA, eliminating memory operations
>> will be a huge win.
>> I'll freely admit there's some ugly things going on, such as the lack
>> of recursion, reliance on exhaustive inlining, requiring GPU code
>> follow a specific form, and not working well with container objects in
>> certain situations (it needs to bound the size of the heap).  In the
>> end, however, it's a talking dog... the grammar may not be perfect,
>> but the dog talks!  If anyone has questions, either private or on the
>> list, I'd be happy to answer them.  I have not done enough to
>> advertise my project, and this seems like a good place to start.
>> - Nick Bray
>> 2010/8/20 Paolo Giarrusso<p.giarrusso at gmail.com>:
>>> 2010/8/20 Jorge Timón<timon.elviejo at gmail.com>:
>>>> Hi, I'm just curious about the feasibility of running python code
in a gpu
>>>> by extending pypy.
>>> Disclaimer: I am not a PyPy developer, even if I've been following the
>>> project with interest. Nor am I an expert of GPU - I provide links to
>>> the literature I've read.
>>> Yet, I believe that such an attempt is unlikely to be interesting.
>>> Quoting Wikipedia's synthesis:
>>> "Unlike CPUs however, GPUs have a parallel throughput architecture
>>> that emphasizes executing many concurrent threads slowly, rather than
>>> executing a single thread very fast."
>>> And significant optimizations are needed anyway to get performance for
>>> GPU code (and if you don't need the last bit of performance, why
>>> bother with a GPU?), so I think that the need to use a C-like language
>>> is the smallest problem.
>>>> I don't have the time (and probably the knowledge neither) to
develop that
>>>> pypy extension, but I just want to know if it's possible.
>>>> I'm interested in languages like openCL and nvidia's CUDA because
I think
>>>> the future of supercomputing is going to be GPGPU.
>>> I would like to point out that while for some cases it might be right,
>>> the importance of GPGPU is probably often exaggerated:
>>> Researchers in the field are mostly aware of the fact that GPGPU is
>>> the way to go only for a very restricted category of code. For that
>>> code, fine.
>>> Thus, instead of running Python code in a GPU, designing from scratch
>>> an easy way to program a GPU efficiently, for those task, is better,
>>> and projects for that already exist (i.e. what you cite).
>>> Additionally, it would take probably a different kind of JIT to
>>> exploit GPUs. No branch prediction, very small non-coherent caches, no
>>> efficient synchronization primitives, as I read from this paper... I'm
>>> no expert, but I guess you'd need to rearchitecture from scratch the
>>> needed optimizations.
>>> And it took 20-30 years to get from the first, slow Lisp (1958) to,
>>> say, Self (1991), a landmark in performant high-level languages,
>>> derived from SmallTalk. Most of that would have to be redone.
>>> So, I guess that the effort to compile Python code for a GPU is not
>>> worth it. There might be further reasons due to the kind of code a JIT
>>> generates, since a GPU has no branch predictor, no caches, and so on,
>>> but I'm no GPU expert and I would have to check again.
>>> Finally, for general purpose code, exploiting the big expected number
>>> of CPUs on our desktop systems is already a challenge.
>>>> There's people working in
>>>> bringing GPGPU to python:
>>>> http://mathema.tician.de/software/pyopencl
>>>> http://mathema.tician.de/software/pycuda
>>>> Would it be possible to run python code in parallel without the
need (for
>>>> the developer) of actively parallelizing the code?
>>> I would say that Python is not yet the language to use to write
>>> efficient parallel code, because of the Global Interpreter Lock
>>> (Google for "Python GIL"). The two implementations having no GIL are
>>> IronPython (as slow as CPython) and Jython (slower). PyPy has a GIL,
>>> and the current focus is not on removing it.
>>> Scientific computing uses external libraries (like NumPy) - for the
>>> supported algorithms, one could introduce parallelism at that level.
>>> If that's enough for your application, good.
>>> If you want to write a parallel algorithm in Python, we're not
there yet.
>>>> I'm not talking about code of hard concurrency, but of code with
>>>> parallelism (let's say matrix multiplication).
>>> Automatic parallelization is hard, see:
>>> http://en.wikipedia.org/wiki/Automatic_parallelization
>>> Lots of scientists have tried, lots of money has been invested, but
>>> it's still hard.
>>> The only practical approaches still require the programmer to
>>> introduce parallelism, but in ways much simpler than using
>>> multithreading directly. Google OpenMP and Cilk.
>>>> Would a JIT compilation be capable of detecting parallelism?
>>> Summing up what is above, probably not.
>>> Moreover, matrix multiplication may not be so easy as one might think.
>>> I do not know how to write it for a GPU, but in the end I reference
>>> some suggestions from that paper (where it is one of the benchmarks).
>>> But here, I explain why writing it for a CPU is complicated. You can
>>> multiply two matrixes with a triply nested for, but such an algorithm
>>> has poor performance for big matrixes because of bad cache locality.
>>> GPUs, according to the above mentioned paper, provide no caches and
>>> hides latency in other ways.
>>> See here for the two main alternative ideas which allow solving this
>>> problem of writing an efficient matrix multiplication algorithm:
>>> http://en.wikipedia.org/wiki/Cache_blocking
>>> http://en.wikipedia.org/wiki/Cache-oblivious_algorithm
>>> Then, you need to parallelize the resulting code yourself, which might
>>> or might not be easy (depending on the interactions between the
>>> parallel blocks that are found there).
>>> In that paper, where matrix multiplication is called as SGEMM (the
>>> BLAS routine implementing it), they suggest using a cache-blocked
>>> version of matrix multiplication for both CPUs and GPUs, and argue
>>> that parallelization is then easy.
>>> Cheers,
>>> --
>>> Paolo Giarrusso - Ph.D. Student
>>> http://www.informatik.uni-marburg.de/~pgiarrusso/
>>> _______________________________________________
>>> pypy-dev at codespeak.net
>>> http://codespeak.net/mailman/listinfo/pypy-dev
>> _______________________________________________
>> pypy-dev at codespeak.net
>> http://codespeak.net/mailman/listinfo/pypy-dev

More information about the Pypy-dev mailing list