[pypy-dev] gpgpu and pypy

Fri Aug 20 22:51:42 CEST 2010

2010/8/20 Paolo Giarrusso <p.giarrusso at gmail.com>:
> 2010/8/20 Jorge Timón <timon.elviejo at gmail.com>:
>> Hi, I'm just curious about the feasibility of running python code in a gpu
>> by extending pypy.
> Disclaimer: I am not a PyPy developer, even if I've been following the
> project with interest. Nor am I an expert of GPU - I provide links to
> the literature I've read.
> Yet, I believe that such an attempt is unlikely to be interesting.
> Quoting Wikipedia's synthesis:
> "Unlike CPUs however, GPUs have a parallel throughput architecture
> that emphasizes executing many concurrent threads slowly, rather than
> executing a single thread very fast."
> And significant optimizations are needed anyway to get performance for
> GPU code (and if you don't need the last bit of performance, why
> bother with a GPU?), so I think that the need to use a C-like language
> is the smallest problem.
>
>> I don't have the time (and probably the knowledge neither) to develop that
>> pypy extension, but I just want to know if it's possible.
>> I'm interested in languages like openCL and nvidia's CUDA because I think
>> the future of supercomputing is going to be GPGPU.

Python is a very different language than CUDA or openCL, hence it's
not completely to map python's semantics to something that will make
sense for GPU.

>
> I would like to point out that while for some cases it might be right,
> the importance of GPGPU is probably often exaggerated:
>
> http://portal.acm.org/citation.cfm?id=1816021&coll=GUIDE&dl=GUIDE&CFID=11111111&CFTOKEN=2222222&ret=1#
>
> Researchers in the field are mostly aware of the fact that GPGPU is
> the way to go only for a very restricted category of code. For that
> code, fine.
> Thus, instead of running Python code in a GPU, designing from scratch
> an easy way to program a GPU efficiently, for those task, is better,
> and projects for that already exist (i.e. what you cite).
>
> Additionally, it would take probably a different kind of JIT to
> exploit GPUs. No branch prediction, very small non-coherent caches, no
> efficient synchronization primitives, as I read from this paper... I'm
> no expert, but I guess you'd need to rearchitecture from scratch the
> needed optimizations.
> And it took 20-30 years to get from the first, slow Lisp (1958) to,
> say, Self (1991), a landmark in performant high-level languages,
> derived from SmallTalk. Most of that would have to be redone.
>
> So, I guess that the effort to compile Python code for a GPU is not
> worth it. There might be further reasons due to the kind of code a JIT
> generates, since a GPU has no branch predictor, no caches, and so on,
> but I'm no GPU expert and I would have to check again.
>
> Finally, for general purpose code, exploiting the big expected number
> of CPUs on our desktop systems is already a challenge.
>
>> There's people working in
>> bringing GPGPU to python:
>>
>> http://mathema.tician.de/software/pyopencl
>> http://mathema.tician.de/software/pycuda
>>
>> Would it be possible to run python code in parallel without the need (for
>> the developer) of actively parallelizing the code?
>
> I would say that Python is not yet the language to use to write
> efficient parallel code, because of the Global Interpreter Lock
> (Google for "Python GIL"). The two implementations having no GIL are
> IronPython (as slow as CPython) and Jython (slower). PyPy has a GIL,
> and the current focus is not on removing it.
> Scientific computing uses external libraries (like NumPy) - for the
> supported algorithms, one could introduce parallelism at that level.
> If that's enough for your application, good.
> If you want to write a parallel algorithm in Python, we're not there yet.
>
>> I'm not talking about code of hard concurrency, but of code with intrinsic
>> parallelism (let's say matrix multiplication).
>
> Automatic parallelization is hard, see:
> http://en.wikipedia.org/wiki/Automatic_parallelization
>
> Lots of scientists have tried, lots of money has been invested, but
> it's still hard.
> The only practical approaches still require the programmer to
> introduce parallelism, but in ways much simpler than using
> multithreading directly. Google OpenMP and Cilk.
>
>> Would a JIT compilation be capable of detecting parallelism?
> Summing up what is above, probably not.
>
> Moreover, matrix multiplication may not be so easy as one might think.
> I do not know how to write it for a GPU, but in the end I reference
> some suggestions from that paper (where it is one of the benchmarks).
> But here, I explain why writing it for a CPU is complicated. You can
> multiply two matrixes with a triply nested for, but such an algorithm
> has poor performance for big matrixes because of bad cache locality.
> GPUs, according to the above mentioned paper, provide no caches and
> hides latency in other ways.
>
> See here for the two main alternative ideas which allow solving this
> problem of writing an efficient matrix multiplication algorithm:
> http://en.wikipedia.org/wiki/Cache_blocking
> http://en.wikipedia.org/wiki/Cache-oblivious_algorithm
>
> Then, you need to parallelize the resulting code yourself, which might
> or might not be easy (depending on the interactions between the
> parallel blocks that are found there).
> In that paper, where matrix multiplication is called as SGEMM (the
> BLAS routine implementing it), they suggest using a cache-blocked
> version of matrix multiplication for both CPUs and GPUs, and argue
> that parallelization is then easy.

What's interesting in using GPU and a JIT is optimizing numpy
vectorized operations to speed up things like big_array_a +
big_array_b using SSE and GPU. However, I don't think anyone plans to
work on it in a near future and if you don't have time this stays as a
topic of interest only :)

>
> Cheers,
> --
> Paolo Giarrusso - Ph.D. Student
> http://www.informatik.uni-marburg.de/~pgiarrusso/
> _______________________________________________
> pypy-dev at codespeak.net
> http://codespeak.net/mailman/listinfo/pypy-dev