
Hi, I'm just curious about the feasibility of running python code in a gpu by extending pypy. I don't have the time (and probably the knowledge neither) to develop that pypy extension, but I just want to know if it's possible. I'm interested in languages like openCL and nvidia's CUDA because I think the future of supercomputing is going to be GPGPU. There's people working in bringing GPGPU to python: http://mathema.tician.de/software/pyopencl http://mathema.tician.de/software/pycuda Would it be possible to run python code in parallel without the need (for the developer) of actively parallelizing the code? I'm not talking about code of hard concurrency, but of code with intrinsic parallelism (let's say matrix multiplication). Would a JIT compilation be capable of detecting parallelism? Would it be interesting or that's a job we must leave to humans by now? What do you think? I don't know if I had explain myself because English is not my first language. Cheers, Jorge Timón

2010/8/20 Jorge Timón <timon.elviejo@gmail.com>: project with interest. Nor am I an expert of GPU - I provide links to the literature I've read. Yet, I believe that such an attempt is unlikely to be interesting. Quoting Wikipedia's synthesis: "Unlike CPUs however, GPUs have a parallel throughput architecture that emphasizes executing many concurrent threads slowly, rather than executing a single thread very fast." And significant optimizations are needed anyway to get performance for GPU code (and if you don't need the last bit of performance, why bother with a GPU?), so I think that the need to use a C-like language is the smallest problem.
I would like to point out that while for some cases it might be right, the importance of GPGPU is probably often exaggerated: http://portal.acm.org/citation.cfm?id=1816021&coll=GUIDE&dl=GUIDE&CFID=11111111&CFTOKEN=2222222&ret=1# Researchers in the field are mostly aware of the fact that GPGPU is the way to go only for a very restricted category of code. For that code, fine. Thus, instead of running Python code in a GPU, designing from scratch an easy way to program a GPU efficiently, for those task, is better, and projects for that already exist (i.e. what you cite). Additionally, it would take probably a different kind of JIT to exploit GPUs. No branch prediction, very small non-coherent caches, no efficient synchronization primitives, as I read from this paper... I'm no expert, but I guess you'd need to rearchitecture from scratch the needed optimizations. And it took 20-30 years to get from the first, slow Lisp (1958) to, say, Self (1991), a landmark in performant high-level languages, derived from SmallTalk. Most of that would have to be redone. So, I guess that the effort to compile Python code for a GPU is not worth it. There might be further reasons due to the kind of code a JIT generates, since a GPU has no branch predictor, no caches, and so on, but I'm no GPU expert and I would have to check again. Finally, for general purpose code, exploiting the big expected number of CPUs on our desktop systems is already a challenge.
I would say that Python is not yet the language to use to write efficient parallel code, because of the Global Interpreter Lock (Google for "Python GIL"). The two implementations having no GIL are IronPython (as slow as CPython) and Jython (slower). PyPy has a GIL, and the current focus is not on removing it. Scientific computing uses external libraries (like NumPy) - for the supported algorithms, one could introduce parallelism at that level. If that's enough for your application, good. If you want to write a parallel algorithm in Python, we're not there yet.
I'm not talking about code of hard concurrency, but of code with intrinsic parallelism (let's say matrix multiplication).
Automatic parallelization is hard, see: http://en.wikipedia.org/wiki/Automatic_parallelization Lots of scientists have tried, lots of money has been invested, but it's still hard. The only practical approaches still require the programmer to introduce parallelism, but in ways much simpler than using multithreading directly. Google OpenMP and Cilk.
Would a JIT compilation be capable of detecting parallelism? Summing up what is above, probably not.
Moreover, matrix multiplication may not be so easy as one might think. I do not know how to write it for a GPU, but in the end I reference some suggestions from that paper (where it is one of the benchmarks). But here, I explain why writing it for a CPU is complicated. You can multiply two matrixes with a triply nested for, but such an algorithm has poor performance for big matrixes because of bad cache locality. GPUs, according to the above mentioned paper, provide no caches and hides latency in other ways. See here for the two main alternative ideas which allow solving this problem of writing an efficient matrix multiplication algorithm: http://en.wikipedia.org/wiki/Cache_blocking http://en.wikipedia.org/wiki/Cache-oblivious_algorithm Then, you need to parallelize the resulting code yourself, which might or might not be easy (depending on the interactions between the parallel blocks that are found there). In that paper, where matrix multiplication is called as SGEMM (the BLAS routine implementing it), they suggest using a cache-blocked version of matrix multiplication for both CPUs and GPUs, and argue that parallelization is then easy. Cheers, -- Paolo Giarrusso - Ph.D. Student http://www.informatik.uni-marburg.de/~pgiarrusso/

Jython single-threaded performance has little to do with a lack of the GIL. Probably the only direct manifestation is seen in the overhead of allocating __dict__ (or dict) objects because Python attributes have volatile memory semantics, which is ensured by the backing of a ConcurrentHashMap, which can be expensive to allocate. There are workarounds. 2010/8/20 Paolo Giarrusso <p.giarrusso@gmail.com>

2010/8/20 Jim Baker <jbaker@zyasoft.com>:
Jython single-threaded performance has little to do with a lack of the GIL.
Never implied that - I do believe that a GIL-less fast Python is possible. I just meant we don't have one yet.
I've only found the Unladen Swallow proposals for a memory model: http://code.google.com/p/unladen-swallow/wiki/MemoryModel (and python-safethread, which I don't like). As a Java programmer using Jython, I wouldn't expect to have any volatile field ever, but I would expect to be able to act on different fields indipendently - the race conditions we have to protect from are the ones on structual modification (unless the table uses open addressing). _This_ can be implemented through ConcurrentHashMap (which also makes all fields volatile), but an implementation not guaranteeing volatile semantics (if possible) would have been equally valid. I am interested because I want to experiment with alternatives. Of course, you can offer stronger semantics, but then you should also advertise that fields are volatile, thus I don't need a lock to pass a reference.
, which is ensured by the backing of a ConcurrentHashMap, which can be expensive to allocate. There are workarounds.
I'm also curious about such workarounds - are they currently implemented or speculations?
-- Paolo Giarrusso - Ph.D. Student http://www.informatik.uni-marburg.de/~pgiarrusso/

The Unladen Swallow doc, which was derived from a PEP that Jeff proposed, seems to be a fair descriptive outline of Python memory models in general, and Jython's in specific. Obviously the underlying implementation in the JVM is happens-before consistency; everything else derives from there. The CHM provides additional consistency constraints that should imply sequential consistency for a (vast) subset of Python programs. However, I can readily construct a program that violates sequential consistency: maybe it uses slots (stored in a Java array), or the array module (which also just wraps Java arrays), or by accesses local variables in a frame from another thread (same storage, same problem). Likewise I can also create Python programs that access Java classes (since this is Jython!), and they too will only see happens-before consistency. Naturally, the workarounds I mentioned for improving performance in object allocation all rely on not using CHM and its (modestly) expensive semantics. So this would mean using a Java class in some way, possibly a HashMap (especially one that's been exposed through our type expose mechanism to avoid reflection overhead), or directly using a Java class of some kind (again exposing is best, much like are builtin types like PyInteger), possibly with all fields marked as volatile. Hope this helps! If you are interested in studying this problem in more depth for Jython, or other implementations, and the implications of our hybrid model, it would certainly be most welcome. Unfortunately, it's not something that Jython development itself will be working on (standard time constraints apply here). - Jim 2010/8/20 Paolo Giarrusso <p.giarrusso@gmail.com>

2010/8/20 Jim Baker <jbaker@zyasoft.com>:
Your mention of slots is very cool! You made me recall that once you get shadow classes in Python, you can not only do inline caching, but you also have the _same_ object layout as in slots, because adding a member causes a hidden class transition, getting rid of any kind of dictionary _after compilation_. Two exceptions: * an immutable dictionary mapping field names to offsets is used both during JIT compilation and when inline caching fails, for * a fallback case for when __dict__ is used, I guess, is needed. Not necessarily a dictionary must be used though: one could also make __dict__ usage just cause class transitions. * beyond a certain member count, i.e., if __dict__ is used as a general-purpose dictionary, one might want to switch back to a dictionary representation. This only applies if this is done in Pythonic code (guess not) - I remember this case from V8, for JavaScript, where the expected usage is different.
Such constraints apply to me too - but I hope this to work on that.
-- Paolo Giarrusso - Ph.D. Student http://www.informatik.uni-marburg.de/~pgiarrusso/

2010/8/20 Paolo Giarrusso <p.giarrusso@gmail.com>:
Python is a very different language than CUDA or openCL, hence it's not completely to map python's semantics to something that will make sense for GPU.
What's interesting in using GPU and a JIT is optimizing numpy vectorized operations to speed up things like big_array_a + big_array_b using SSE and GPU. However, I don't think anyone plans to work on it in a near future and if you don't have time this stays as a topic of interest only :)

On 8/20/2010 1:51 PM, Maciej Fijalkowski wrote:

On 8/20/2010 2:18 PM, Maciej Fijalkowski wrote: > On Fri, Aug 20, 2010 at 11:05 PM, Jeff Anderson-Lee > <jonah@eecs.berkeley.edu> wrote: >> On 8/20/2010 1:51 PM, Maciej Fijalkowski wrote: >>> 2010/8/20 Paolo Giarrusso<p.giarrusso@gmail.com>: >>>> 2010/8/20 Jorge Timón<timon.elviejo@gmail.com>: >>>>> Hi, I'm just curious about the feasibility of running python code in a gpu >>>>> by extending pypy. >>>> Disclaimer: I am not a PyPy developer, even if I've been following the >>>> project with interest. Nor am I an expert of GPU - I provide links to >>>> the literature I've read. >>>> Yet, I believe that such an attempt is unlikely to be interesting. >>>> Quoting Wikipedia's synthesis: >>>> "Unlike CPUs however, GPUs have a parallel throughput architecture >>>> that emphasizes executing many concurrent threads slowly, rather than >>>> executing a single thread very fast." >>>> And significant optimizations are needed anyway to get performance for >>>> GPU code (and if you don't need the last bit of performance, why >>>> bother with a GPU?), so I think that the need to use a C-like language >>>> is the smallest problem. >>>> >>>>> I don't have the time (and probably the knowledge neither) to develop that >>>>> pypy extension, but I just want to know if it's possible. >>>>> I'm interested in languages like openCL and nvidia's CUDA because I think >>>>> the future of supercomputing is going to be GPGPU. >>> Python is a very different language than CUDA or openCL, hence it's >>> not completely to map python's semantics to something that will make >>> sense for GPU. >> Try googling: copperhead cuda >> Also look at: >> >> http://code.google.com/p/copperhead/wiki/Installing >> > What's the point of posting here project which has not released any code? 1) He is packaging it up for release this month: > Comment by bryan.catanzaro > <http://code.google.com/u/bryan.catanzaro/>, Aug 05, 2010 > > Before the end of August. I'm working on packaging it up right now. =) > 2) Bryan's got a good head on his shoulders and has been working on this problem or some time. Rather than (or at least before) starting off in a completely new direction, its worth looking at something that has been in the works for a while now and is attaining some maturity. 3) You are welcome to ignore it, but some folks might be interested, and at least they now know it is there and where to look for more information and forthcoming code.

I can't speak for GPGPU, but I have compiled a subset of Python onto the GPU for real-time rendering. The subset is a little broader than RPython in some ways (for example, attributes are semantically identical to Python) and a little narrower in some ways (many forms of recursion are disallowed.) This big idea is that it allows you to create a real-time rendering system with a single code base, and transparently share functions and data structures between the CPU and GPU. http://www.ncbray.com/pystream.html http://www.ncbray.com/ncbray-dissertation.pdf It's at least ~100,000x faster than interpreting Python on the CPU. "At least" because the measurements neglect doing things on the CPU like texture sampling. This speedup is pretty obscene, but if you break it down it isn't too unbelievable... 100x for interpreted -> compiled, 10x for abstraction overhead of using floats instead of doubles, 100x for using the GPU and using it for a task it was built for. Parallelism issues are sidestepped by explicitly identifying the parallel sections (one function processes every vertex, one function processes every fragment), requiring the parallel sections have no global side effects, and that certain I/O conventions are followed. Sorry, no big answers here - it's essentially Pythonic stream programming. The biggest issues with getting Python onto the GPU is memory. I was actually targeting GLSL, not CUDA (it can't access the full rendering pipeline), so pointers were not available. To work around this, the code is optimized to an extreme degree to remove as many memory operations as possible. The remaining memory operations are emulated by splitting the heap into regions, indirecting through arrays, and copying constant data wherever possible. From what I've seen this is where PyPy would have the most trouble: its analysis algorithms are good enough for inferring types and allowing compilation / translation... they aren't designed to enable aggressive optimization of memory operations (there's not a huge reason to do this if you're translating RPython into C... the C compiler will do it for you). In general, GPU programming doesn't work well with memory access (too many functional units, too little bandwidth). Most of the "C-like" GPU languages are designed to they can easily boil down into code operating out of registers. Python, on the other hand, is addicted to heap memory. Even if you target CUDA, eliminating memory operations will be a huge win. I'll freely admit there's some ugly things going on, such as the lack of recursion, reliance on exhaustive inlining, requiring GPU code follow a specific form, and not working well with container objects in certain situations (it needs to bound the size of the heap). In the end, however, it's a talking dog... the grammar may not be perfect, but the dog talks! If anyone has questions, either private or on the list, I'd be happy to answer them. I have not done enough to advertise my project, and this seems like a good place to start. - Nick Bray 2010/8/20 Paolo Giarrusso <p.giarrusso@gmail.com>:

Hi, here is a another effort allowing you to write GPU kernels using python, targeted at gpgpu. The programmer has to explicitly state the parallelism and there are restrictions on what kind of constructs are allowed in the kernels, but it's pretty cool: http://www.cs.lth.se/home/Calle_Lejdfors/pygpu/ On Sat, Aug 21, 2010 at 12:46 AM, Nick Bray <ncbray@gmail.com> wrote:
-- Håkan Ardö

This seems similar to what MyHDL does. It's a python framework to be used as an HDL (hardware description language) for describing gate array configurations (outputs Verilog and VHDL for FPGAs or ASICs). It uses a similar approach to RPython as the compilation works on objects created by running the python code, with restrictions applying mostly to the code inside the generators that are used to encode the intended behavior (which is intrinsically parallel). A bit on a different level, since you could use MyHDL to describe (and implement) a GPU, but I thought it would be interesting :) Elmo On 08/21/2010 10:06 AM, Hakan Ardo wrote: there yet.

2010/8/20 Jorge Timón <timon.elviejo@gmail.com>: project with interest. Nor am I an expert of GPU - I provide links to the literature I've read. Yet, I believe that such an attempt is unlikely to be interesting. Quoting Wikipedia's synthesis: "Unlike CPUs however, GPUs have a parallel throughput architecture that emphasizes executing many concurrent threads slowly, rather than executing a single thread very fast." And significant optimizations are needed anyway to get performance for GPU code (and if you don't need the last bit of performance, why bother with a GPU?), so I think that the need to use a C-like language is the smallest problem.
I would like to point out that while for some cases it might be right, the importance of GPGPU is probably often exaggerated: http://portal.acm.org/citation.cfm?id=1816021&coll=GUIDE&dl=GUIDE&CFID=11111111&CFTOKEN=2222222&ret=1# Researchers in the field are mostly aware of the fact that GPGPU is the way to go only for a very restricted category of code. For that code, fine. Thus, instead of running Python code in a GPU, designing from scratch an easy way to program a GPU efficiently, for those task, is better, and projects for that already exist (i.e. what you cite). Additionally, it would take probably a different kind of JIT to exploit GPUs. No branch prediction, very small non-coherent caches, no efficient synchronization primitives, as I read from this paper... I'm no expert, but I guess you'd need to rearchitecture from scratch the needed optimizations. And it took 20-30 years to get from the first, slow Lisp (1958) to, say, Self (1991), a landmark in performant high-level languages, derived from SmallTalk. Most of that would have to be redone. So, I guess that the effort to compile Python code for a GPU is not worth it. There might be further reasons due to the kind of code a JIT generates, since a GPU has no branch predictor, no caches, and so on, but I'm no GPU expert and I would have to check again. Finally, for general purpose code, exploiting the big expected number of CPUs on our desktop systems is already a challenge.
I would say that Python is not yet the language to use to write efficient parallel code, because of the Global Interpreter Lock (Google for "Python GIL"). The two implementations having no GIL are IronPython (as slow as CPython) and Jython (slower). PyPy has a GIL, and the current focus is not on removing it. Scientific computing uses external libraries (like NumPy) - for the supported algorithms, one could introduce parallelism at that level. If that's enough for your application, good. If you want to write a parallel algorithm in Python, we're not there yet.
I'm not talking about code of hard concurrency, but of code with intrinsic parallelism (let's say matrix multiplication).
Automatic parallelization is hard, see: http://en.wikipedia.org/wiki/Automatic_parallelization Lots of scientists have tried, lots of money has been invested, but it's still hard. The only practical approaches still require the programmer to introduce parallelism, but in ways much simpler than using multithreading directly. Google OpenMP and Cilk.
Would a JIT compilation be capable of detecting parallelism? Summing up what is above, probably not.
Moreover, matrix multiplication may not be so easy as one might think. I do not know how to write it for a GPU, but in the end I reference some suggestions from that paper (where it is one of the benchmarks). But here, I explain why writing it for a CPU is complicated. You can multiply two matrixes with a triply nested for, but such an algorithm has poor performance for big matrixes because of bad cache locality. GPUs, according to the above mentioned paper, provide no caches and hides latency in other ways. See here for the two main alternative ideas which allow solving this problem of writing an efficient matrix multiplication algorithm: http://en.wikipedia.org/wiki/Cache_blocking http://en.wikipedia.org/wiki/Cache-oblivious_algorithm Then, you need to parallelize the resulting code yourself, which might or might not be easy (depending on the interactions between the parallel blocks that are found there). In that paper, where matrix multiplication is called as SGEMM (the BLAS routine implementing it), they suggest using a cache-blocked version of matrix multiplication for both CPUs and GPUs, and argue that parallelization is then easy. Cheers, -- Paolo Giarrusso - Ph.D. Student http://www.informatik.uni-marburg.de/~pgiarrusso/

Jython single-threaded performance has little to do with a lack of the GIL. Probably the only direct manifestation is seen in the overhead of allocating __dict__ (or dict) objects because Python attributes have volatile memory semantics, which is ensured by the backing of a ConcurrentHashMap, which can be expensive to allocate. There are workarounds. 2010/8/20 Paolo Giarrusso <p.giarrusso@gmail.com>

2010/8/20 Jim Baker <jbaker@zyasoft.com>:
Jython single-threaded performance has little to do with a lack of the GIL.
Never implied that - I do believe that a GIL-less fast Python is possible. I just meant we don't have one yet.
I've only found the Unladen Swallow proposals for a memory model: http://code.google.com/p/unladen-swallow/wiki/MemoryModel (and python-safethread, which I don't like). As a Java programmer using Jython, I wouldn't expect to have any volatile field ever, but I would expect to be able to act on different fields indipendently - the race conditions we have to protect from are the ones on structual modification (unless the table uses open addressing). _This_ can be implemented through ConcurrentHashMap (which also makes all fields volatile), but an implementation not guaranteeing volatile semantics (if possible) would have been equally valid. I am interested because I want to experiment with alternatives. Of course, you can offer stronger semantics, but then you should also advertise that fields are volatile, thus I don't need a lock to pass a reference.
, which is ensured by the backing of a ConcurrentHashMap, which can be expensive to allocate. There are workarounds.
I'm also curious about such workarounds - are they currently implemented or speculations?
-- Paolo Giarrusso - Ph.D. Student http://www.informatik.uni-marburg.de/~pgiarrusso/

The Unladen Swallow doc, which was derived from a PEP that Jeff proposed, seems to be a fair descriptive outline of Python memory models in general, and Jython's in specific. Obviously the underlying implementation in the JVM is happens-before consistency; everything else derives from there. The CHM provides additional consistency constraints that should imply sequential consistency for a (vast) subset of Python programs. However, I can readily construct a program that violates sequential consistency: maybe it uses slots (stored in a Java array), or the array module (which also just wraps Java arrays), or by accesses local variables in a frame from another thread (same storage, same problem). Likewise I can also create Python programs that access Java classes (since this is Jython!), and they too will only see happens-before consistency. Naturally, the workarounds I mentioned for improving performance in object allocation all rely on not using CHM and its (modestly) expensive semantics. So this would mean using a Java class in some way, possibly a HashMap (especially one that's been exposed through our type expose mechanism to avoid reflection overhead), or directly using a Java class of some kind (again exposing is best, much like are builtin types like PyInteger), possibly with all fields marked as volatile. Hope this helps! If you are interested in studying this problem in more depth for Jython, or other implementations, and the implications of our hybrid model, it would certainly be most welcome. Unfortunately, it's not something that Jython development itself will be working on (standard time constraints apply here). - Jim 2010/8/20 Paolo Giarrusso <p.giarrusso@gmail.com>

2010/8/20 Jim Baker <jbaker@zyasoft.com>:
Your mention of slots is very cool! You made me recall that once you get shadow classes in Python, you can not only do inline caching, but you also have the _same_ object layout as in slots, because adding a member causes a hidden class transition, getting rid of any kind of dictionary _after compilation_. Two exceptions: * an immutable dictionary mapping field names to offsets is used both during JIT compilation and when inline caching fails, for * a fallback case for when __dict__ is used, I guess, is needed. Not necessarily a dictionary must be used though: one could also make __dict__ usage just cause class transitions. * beyond a certain member count, i.e., if __dict__ is used as a general-purpose dictionary, one might want to switch back to a dictionary representation. This only applies if this is done in Pythonic code (guess not) - I remember this case from V8, for JavaScript, where the expected usage is different.
Such constraints apply to me too - but I hope this to work on that.
-- Paolo Giarrusso - Ph.D. Student http://www.informatik.uni-marburg.de/~pgiarrusso/

2010/8/20 Paolo Giarrusso <p.giarrusso@gmail.com>:
Python is a very different language than CUDA or openCL, hence it's not completely to map python's semantics to something that will make sense for GPU.
What's interesting in using GPU and a JIT is optimizing numpy vectorized operations to speed up things like big_array_a + big_array_b using SSE and GPU. However, I don't think anyone plans to work on it in a near future and if you don't have time this stays as a topic of interest only :)

On 8/20/2010 1:51 PM, Maciej Fijalkowski wrote:

On 8/20/2010 2:18 PM, Maciej Fijalkowski wrote: > On Fri, Aug 20, 2010 at 11:05 PM, Jeff Anderson-Lee > <jonah@eecs.berkeley.edu> wrote: >> On 8/20/2010 1:51 PM, Maciej Fijalkowski wrote: >>> 2010/8/20 Paolo Giarrusso<p.giarrusso@gmail.com>: >>>> 2010/8/20 Jorge Timón<timon.elviejo@gmail.com>: >>>>> Hi, I'm just curious about the feasibility of running python code in a gpu >>>>> by extending pypy. >>>> Disclaimer: I am not a PyPy developer, even if I've been following the >>>> project with interest. Nor am I an expert of GPU - I provide links to >>>> the literature I've read. >>>> Yet, I believe that such an attempt is unlikely to be interesting. >>>> Quoting Wikipedia's synthesis: >>>> "Unlike CPUs however, GPUs have a parallel throughput architecture >>>> that emphasizes executing many concurrent threads slowly, rather than >>>> executing a single thread very fast." >>>> And significant optimizations are needed anyway to get performance for >>>> GPU code (and if you don't need the last bit of performance, why >>>> bother with a GPU?), so I think that the need to use a C-like language >>>> is the smallest problem. >>>> >>>>> I don't have the time (and probably the knowledge neither) to develop that >>>>> pypy extension, but I just want to know if it's possible. >>>>> I'm interested in languages like openCL and nvidia's CUDA because I think >>>>> the future of supercomputing is going to be GPGPU. >>> Python is a very different language than CUDA or openCL, hence it's >>> not completely to map python's semantics to something that will make >>> sense for GPU. >> Try googling: copperhead cuda >> Also look at: >> >> http://code.google.com/p/copperhead/wiki/Installing >> > What's the point of posting here project which has not released any code? 1) He is packaging it up for release this month: > Comment by bryan.catanzaro > <http://code.google.com/u/bryan.catanzaro/>, Aug 05, 2010 > > Before the end of August. I'm working on packaging it up right now. =) > 2) Bryan's got a good head on his shoulders and has been working on this problem or some time. Rather than (or at least before) starting off in a completely new direction, its worth looking at something that has been in the works for a while now and is attaining some maturity. 3) You are welcome to ignore it, but some folks might be interested, and at least they now know it is there and where to look for more information and forthcoming code.

I can't speak for GPGPU, but I have compiled a subset of Python onto the GPU for real-time rendering. The subset is a little broader than RPython in some ways (for example, attributes are semantically identical to Python) and a little narrower in some ways (many forms of recursion are disallowed.) This big idea is that it allows you to create a real-time rendering system with a single code base, and transparently share functions and data structures between the CPU and GPU. http://www.ncbray.com/pystream.html http://www.ncbray.com/ncbray-dissertation.pdf It's at least ~100,000x faster than interpreting Python on the CPU. "At least" because the measurements neglect doing things on the CPU like texture sampling. This speedup is pretty obscene, but if you break it down it isn't too unbelievable... 100x for interpreted -> compiled, 10x for abstraction overhead of using floats instead of doubles, 100x for using the GPU and using it for a task it was built for. Parallelism issues are sidestepped by explicitly identifying the parallel sections (one function processes every vertex, one function processes every fragment), requiring the parallel sections have no global side effects, and that certain I/O conventions are followed. Sorry, no big answers here - it's essentially Pythonic stream programming. The biggest issues with getting Python onto the GPU is memory. I was actually targeting GLSL, not CUDA (it can't access the full rendering pipeline), so pointers were not available. To work around this, the code is optimized to an extreme degree to remove as many memory operations as possible. The remaining memory operations are emulated by splitting the heap into regions, indirecting through arrays, and copying constant data wherever possible. From what I've seen this is where PyPy would have the most trouble: its analysis algorithms are good enough for inferring types and allowing compilation / translation... they aren't designed to enable aggressive optimization of memory operations (there's not a huge reason to do this if you're translating RPython into C... the C compiler will do it for you). In general, GPU programming doesn't work well with memory access (too many functional units, too little bandwidth). Most of the "C-like" GPU languages are designed to they can easily boil down into code operating out of registers. Python, on the other hand, is addicted to heap memory. Even if you target CUDA, eliminating memory operations will be a huge win. I'll freely admit there's some ugly things going on, such as the lack of recursion, reliance on exhaustive inlining, requiring GPU code follow a specific form, and not working well with container objects in certain situations (it needs to bound the size of the heap). In the end, however, it's a talking dog... the grammar may not be perfect, but the dog talks! If anyone has questions, either private or on the list, I'd be happy to answer them. I have not done enough to advertise my project, and this seems like a good place to start. - Nick Bray 2010/8/20 Paolo Giarrusso <p.giarrusso@gmail.com>:

Hi, here is a another effort allowing you to write GPU kernels using python, targeted at gpgpu. The programmer has to explicitly state the parallelism and there are restrictions on what kind of constructs are allowed in the kernels, but it's pretty cool: http://www.cs.lth.se/home/Calle_Lejdfors/pygpu/ On Sat, Aug 21, 2010 at 12:46 AM, Nick Bray <ncbray@gmail.com> wrote:
-- Håkan Ardö

This seems similar to what MyHDL does. It's a python framework to be used as an HDL (hardware description language) for describing gate array configurations (outputs Verilog and VHDL for FPGAs or ASICs). It uses a similar approach to RPython as the compilation works on objects created by running the python code, with restrictions applying mostly to the code inside the generators that are used to encode the intended behavior (which is intrinsically parallel). A bit on a different level, since you could use MyHDL to describe (and implement) a GPU, but I thought it would be interesting :) Elmo On 08/21/2010 10:06 AM, Hakan Ardo wrote: there yet.
participants (9)
-
Carl Friedrich Bolz
-
Elmo
-
Hakan Ardo
-
Jeff Anderson-Lee
-
Jim Baker
-
Jorge Timón
-
Maciej Fijalkowski
-
Nick Bray
-
Paolo Giarrusso