[Cython] OpenCL support

Thu Feb 9 00:15:43 CET 2012

On 02/08/2012 11:11 PM, mark florisson wrote:
> On 8 February 2012 14:46, Dag Sverre Seljebotn
> <d.s.seljebotn at astro.uio.no>  wrote:
>> On 02/05/2012 10:57 PM, mark florisson wrote:
>>>
>>> Hey,
>>>
>>> I created a CEP for opencl support:
>>> http://wiki.cython.org/enhancements/opencl
>>> What do you think?
>>
>>
>> To start with my own conclusion on this, my feel is that it is too little
>> gain, at least for a GPU solution. There's already Theano for trivial
>> SIMD-stuff and PyOpenCL for the getting-hands-dirty stuff. (Of course, this
>> CEP would be more convenient to use than Theano if one is already using
>> Cython.)
>
> Yes, vector operations and elemental or reduction functions operator
> on vectors (which is what we can use Theano for, right?) don't quite
> merit the use of OpenCL. However, the upside is that OpenCL allows
> easier vectorization and multi-threading. We can appease to
> auto-vectorizing compilers, but e.g. using OpenMP for multithreading
> will still segfault the program if used outside the main thread with
> gcc's implementation. I believe intel allows you to use it in any
> thread. (Of course, keeping a thread pool around and managing it
> manually isn't too hard, but...)
>
>> But that's just my feeling, and I'm not the one potentially signing up to do
>> the work, so whether it is "worth it" is really not my decision, the
>> weighing is done with your weights, not mine. Given an implementation, I
>> definitely support the inclusion in Cython for these kind of features
>> (FWIW).
>>
>> First, CPU:
>>
>> OpenCL is probably a very good way of portably making use of SSE/AVX etc.
>> But to really get a payoff then I would think that the real value would be
>> in *not* using OpenCL vector types, just many threads, so that the OpenCL
>> driver does the dirty work of mapping each thread to each slot in the CPU
>> registers? I'd think the gain in using OpenCL is to emit scalar code and
>> leave the dirty work to OpenCL. If one does the hard part and mapped
>> variables to vectors and memory accesses to shuffles, one might as well go
>> the whole length and emit SSE/AVX rather than OpenCL to avoid the startup
>> overhead.
>>
>> I don't really know how good the Intel and AMD CPU drivers are w.r.t. this
>> -- I have seen the Intel driver emit "vectorizing" and "could not
>> vectorize", but didn't explore the circumstances.
>>
>
> I initially thought the same thing, single kernel invocations should
> be trivially auto-vectorizable one would think. At least with Apple
> OpenCL I am getting better performance with vector types though on the
> CPU (up to 35%). I would personally consider emitting vector data
> types bonus points.
>
> But I don't quite agree that emitting SSE or AVX directly would be
> almost as easy in that case. You'd still have to detect at runtime
> which instruction set is supported and generate SSE, SSE2, (SSE4?) and
> AVX. And that's not even all of them :) The OpenCL drivers just hide
> that pain. With handwritten code you might be coding for a specific
> architecture and might be fine with only SSE2, but as a compiler we
> can't really make that same decision.

You make good points.

>
>> Then, on to GPU:
>>
>> It is not a generic-purpose solution, you still need to bring in pyopencl
>> for lots of cases, and so the question is how many cases it fits with and if
>> it is enough to grow a userbase around it. And, importantly, how much
>> performance is sacrificed for the resulting user-friendlyness. 50%
>> performance hit is usually OK, 95% maybe not. And a 95% hit is not
>> unimaginable if the memory movement is done in a bad way for some code?
>
> Yes, I don't expect this to change a lot suddenly. In the long term I
> think the implementation could be sufficiently good to support at
> least most codes. And the user still has full control over data
> movement, if wanted (the pinning thing, which isn't mentioned in the
> CEP).
>
>> I think the fundamental problem is one of programming paradigms. Fortran,
>> C++, Cython are all sequential in nature; even with OpenMP it is like you
>> have a modest bit of parallelism tacked on to speed up a sequential-looking
>> program. With "massively parallel" solutions such as CUDA and OpenCL, and
>> also MPI in fact, the fundamental assumption that you have thousands or
>> hundreds of thousands of threads. And that just changes how you need to
>> think about writing code, which would tend to show up at a syntax level. So,
>> at least if you want good performance, you need to change your way of
>> thinking enough that a new syntax (loosely cooperating threads rather than
>> parallel-for-loop or SIMD instruction) is actually an advantage, as it keeps
>> you reminded of how the hardware works.
>>
>> So I think the most important thing to do (if you bother) is: Gather a set
>> of real worl(-ish) CUDA or OpenCL programs, port them to Cython + this CEP
>> (without a working Cython implementation for it), and see how that goes.
>> That's really the only way to evaluate it.
>
> I've been wanting to do that for a long time now, also to evaluate the
> capabilities of cython.parallel as it stands now. It's a really good
> idea, I'll try to port some codes, and not just the trivial ones like
> Jacobi's method :).
>
>> Some experiences from the single instance GPU code I've written:
>>
>>   - For starters I had to give up OpenCL and use CUDA to use all the 48 KB
>> available shared memory on Nvidia compute-capability-2.0 (perhaps I just
>> didn't find the OpenCL option for that). And increasing from 16 to 48 KB
>> allowed a fundamentally faster and qualitatively different algorithm to be
>> used. But OpenCL vs. CUDA is kind of beside the point here....
>>
>>   - When mucking about with various "obvious" ports of sequential code to GPU
>> code, I got performance in the range of 5 to 20 GFLOP/s (out of 490 GFLOP/s
>> or so theoretical; NVidia Tesla M2050). When really understanding the
>> hardware, and making good use of the 48 KB of thread-shared memory, I
>> achieved 209 GFLOP/s, without really doing any microoptimization. I don't
>> think the CEP includes any features for intra-thread communication, so
>> that's off the table.
>
> The CEP doesn't mention barriers (discussed earlier), but they should
> be supported, and __local memory (that's "shared memory" in CUDA terms
> right?) could be utilized using a more explicit scheme (or implicitly
> if the compiler is smart). The only issue with barriers is that with
> OpenCL you have multiple levels of synchronization, but barriers only
> work within the work group / thread block, whereas with openmp it
> works simply for all your threads. I think a global barrier would have
> to mean kernel termination + start of a new one, which could be hard
> to support depending on where it is placed in the code...
>
>> (My code is here:
>>
>> https://github.com/wavemoth/wavemoth/blob/cuda/wavemoth/cuda/legendre_transform.cu.in
>>
>> Though it's badly documented and rush-for-deadline-quality; I plan to polish
>> it up and publish it when I get time in autumn).
>>
>> I guess I mention this as the kind of computation your CEP definitely does
>> NOT cover. That's probably OK, but one should figure out specifically how
>> many usecases it does cover (in particular with no control over thread
>> blocks and intra-block communication). Is the CEP a 80%-solution, or a
>> 10%-solution?
>
> I haven't looked too carefully at the code, but a large portion is
> dedicated to a reduction right? What I don't see is how your reduction
> spans across multiple work-groups / thread blocks? Because
> __syncthreads should only sync stuff within a single block. The CEP

There's no need to reduce across thread blocks because (conveniently 
enough) there's 8000 independent computations to be performed with 
different parameters. I simply used one thread block for each problem.

It's basically a matrix-vector product where the matrix must be 
generated on the fly columnwise (one entry can be generated from the 
preceding two in the same column), but the summation is row-wise.

And turns out that getting inter-thread sum-reduction to work well was 
harder than I expected; a 32-by-32 matrix (needed since warps are 32 
threads) is too big to fit in memory, but tree-reduction makes a lot of 
the threads in a warp do nothing. So I ended up with a hybrid approach; 
there's visual demo from page 49 onwards here:

http://folk.uio.no/dagss/talk-gpusht.pdf

Getting back to Cython, I'll admit that this form of inter-thread 
reduction is quite generic, and that my specific problem could be solved 
by basically coding a set of inter-thread reduction algorithms suitable 
for different hardware into Cython.

> didn't mention reductions, but they should be supported (I'm thinking
> multi-stage or sequential within the workgroup (whatever works
> better), followed by another kernel invocation if the result is
> needed).

Multiple kernel invocations for global barriers appear to be pretty 
standard, and it's why OpenCL support queueing tasks with dependencies etc.

> As mentioned earlier in a different thread (on parallelism I think),
> reduction arrays (i.e. memoryviews or C arrays) as well as generally
> private arrays should be supported. An issue with that is that you
> can't really dedicate an array to each work item / thread (too much
> memory would be consumed).
>
> Again, declarations within blocks would solve many problems:
>
>      cdef float[n] shared_by_work_group
>      with parallel():
>          cdef float[n] local_to_work_group
>          for i in prange(...):
>              cdef float[n] local_to_work_item
>
> For arrays, the reductions could be somewhat more explicit, where
> there is an explicit 'my_memoryview += my_local_scratch_data'. That
> should probably only be allowed for memory local to the work group.
>
> Anyway, I'll try porting some numerical codes to this scheme over the
> coming weeks and see what is missing and how it can be solved. I still
> believe it can all be made to work quite properly, without adjusting
> the language to fit the hardware model. The prange (and OpenMP) model
> look like sequential code, but they tell the compiler a lot, namely
> that each iteration is independent and could therefore be scheduled as
> a separate thread.

Again, good points.

Dag