[Cython] OpenCL support

Wed Feb 8 15:46:23 CET 2012

On 02/05/2012 10:57 PM, mark florisson wrote:
> Hey,
>
> I created a CEP for opencl support: http://wiki.cython.org/enhancements/opencl
> What do you think?

To start with my own conclusion on this, my feel is that it is too 
little gain, at least for a GPU solution. There's already Theano for 
trivial SIMD-stuff and PyOpenCL for the getting-hands-dirty stuff. (Of 
course, this CEP would be more convenient to use than Theano if one is 
already using Cython.)

But that's just my feeling, and I'm not the one potentially signing up 
to do the work, so whether it is "worth it" is really not my decision, 
the weighing is done with your weights, not mine. Given an 
implementation, I definitely support the inclusion in Cython for these 
kind of features (FWIW).

First, CPU:

OpenCL is probably a very good way of portably making use of SSE/AVX 
etc. But to really get a payoff then I would think that the real value 
would be in *not* using OpenCL vector types, just many threads, so that 
the OpenCL driver does the dirty work of mapping each thread to each 
slot in the CPU registers? I'd think the gain in using OpenCL is to emit 
scalar code and leave the dirty work to OpenCL. If one does the hard 
part and mapped variables to vectors and memory accesses to shuffles, 
one might as well go the whole length and emit SSE/AVX rather than 
OpenCL to avoid the startup overhead.

I don't really know how good the Intel and AMD CPU drivers are w.r.t. 
this -- I have seen the Intel driver emit "vectorizing" and "could not 
vectorize", but didn't explore the circumstances.

Then, on to GPU:

It is not a generic-purpose solution, you still need to bring in 
pyopencl for lots of cases, and so the question is how many cases it 
fits with and if it is enough to grow a userbase around it. And, 
importantly, how much performance is sacrificed for the resulting 
user-friendlyness. 50% performance hit is usually OK, 95% maybe not. And 
a 95% hit is not unimaginable if the memory movement is done in a bad 
way for some code?

I think the fundamental problem is one of programming paradigms. 
Fortran, C++, Cython are all sequential in nature; even with OpenMP it 
is like you have a modest bit of parallelism tacked on to speed up a 
sequential-looking program. With "massively parallel" solutions such as 
CUDA and OpenCL, and also MPI in fact, the fundamental assumption that 
you have thousands or hundreds of thousands of threads. And that just 
changes how you need to think about writing code, which would tend to 
show up at a syntax level. So, at least if you want good performance, 
you need to change your way of thinking enough that a new syntax 
(loosely cooperating threads rather than parallel-for-loop or SIMD 
instruction) is actually an advantage, as it keeps you reminded of how 
the hardware works.

So I think the most important thing to do (if you bother) is: Gather a 
set of real worl(-ish) CUDA or OpenCL programs, port them to Cython + 
this CEP (without a working Cython implementation for it), and see how 
that goes. That's really the only way to evaluate it.

Some experiences from the single instance GPU code I've written:

  - For starters I had to give up OpenCL and use CUDA to use all the 48 
KB available shared memory on Nvidia compute-capability-2.0 (perhaps I 
just didn't find the OpenCL option for that). And increasing from 16 to 
48 KB allowed a fundamentally faster and qualitatively different 
algorithm to be used. But OpenCL vs. CUDA is kind of beside the point 
here....

  - When mucking about with various "obvious" ports of sequential code 
to GPU code, I got performance in the range of 5 to 20 GFLOP/s (out of 
490 GFLOP/s or so theoretical; NVidia Tesla M2050). When really 
understanding the hardware, and making good use of the 48 KB of 
thread-shared memory, I achieved 209 GFLOP/s, without really doing any 
microoptimization. I don't think the CEP includes any features for 
intra-thread communication, so that's off the table.

(My code is here:

https://github.com/wavemoth/wavemoth/blob/cuda/wavemoth/cuda/legendre_transform.cu.in

Though it's badly documented and rush-for-deadline-quality; I plan to 
polish it up and publish it when I get time in autumn).

I guess I mention this as the kind of computation your CEP definitely 
does NOT cover. That's probably OK, but one should figure out 
specifically how many usecases it does cover (in particular with no 
control over thread blocks and intra-block communication). Is the CEP a 
80%-solution, or a 10%-solution?

Dag Sverre