[Cython] OpenCL support

Thu Feb 9 13:52:07 CET 2012

On 8 February 2012 23:28, Dag Sverre Seljebotn
<d.s.seljebotn at astro.uio.no> wrote:
> On 02/09/2012 12:15 AM, Dag Sverre Seljebotn wrote:
>>
>> On 02/08/2012 11:11 PM, mark florisson wrote:
>>>
>>> On 8 February 2012 14:46, Dag Sverre Seljebotn
>>> <d.s.seljebotn at astro.uio.no> wrote:
>>>>
>>>> On 02/05/2012 10:57 PM, mark florisson wrote:
>>>>>
>>>>>
>>>>> Hey,
>>>>>
>>>>> I created a CEP for opencl support:
>>>>> http://wiki.cython.org/enhancements/opencl
>>>>> What do you think?
>>>>
>>>>
>>>>
>>>> To start with my own conclusion on this, my feel is that it is too
>>>> little
>>>> gain, at least for a GPU solution. There's already Theano for trivial
>>>> SIMD-stuff and PyOpenCL for the getting-hands-dirty stuff. (Of
>>>> course, this
>>>> CEP would be more convenient to use than Theano if one is already using
>>>> Cython.)
>>>
>>>
>>> Yes, vector operations and elemental or reduction functions operator
>>> on vectors (which is what we can use Theano for, right?) don't quite
>>> merit the use of OpenCL. However, the upside is that OpenCL allows
>>> easier vectorization and multi-threading. We can appease to
>>> auto-vectorizing compilers, but e.g. using OpenMP for multithreading
>>> will still segfault the program if used outside the main thread with
>>> gcc's implementation. I believe intel allows you to use it in any
>>> thread. (Of course, keeping a thread pool around and managing it
>>> manually isn't too hard, but...)
>>>
>>>> But that's just my feeling, and I'm not the one potentially signing
>>>> up to do
>>>> the work, so whether it is "worth it" is really not my decision, the
>>>> weighing is done with your weights, not mine. Given an implementation, I
>>>> definitely support the inclusion in Cython for these kind of features
>>>> (FWIW).
>>>>
>>>> First, CPU:
>>>>
>>>> OpenCL is probably a very good way of portably making use of SSE/AVX
>>>> etc.
>>>> But to really get a payoff then I would think that the real value
>>>> would be
>>>> in *not* using OpenCL vector types, just many threads, so that the
>>>> OpenCL
>>>> driver does the dirty work of mapping each thread to each slot in the
>>>> CPU
>>>> registers? I'd think the gain in using OpenCL is to emit scalar code and
>>>> leave the dirty work to OpenCL. If one does the hard part and mapped
>>>> variables to vectors and memory accesses to shuffles, one might as
>>>> well go
>>>> the whole length and emit SSE/AVX rather than OpenCL to avoid the
>>>> startup
>>>> overhead.
>>>>
>>>> I don't really know how good the Intel and AMD CPU drivers are w.r.t.
>>>> this
>>>> -- I have seen the Intel driver emit "vectorizing" and "could not
>>>> vectorize", but didn't explore the circumstances.
>>>>
>>>
>>> I initially thought the same thing, single kernel invocations should
>>> be trivially auto-vectorizable one would think. At least with Apple
>>> OpenCL I am getting better performance with vector types though on the
>>> CPU (up to 35%). I would personally consider emitting vector data
>>> types bonus points.
>>>
>>> But I don't quite agree that emitting SSE or AVX directly would be
>>> almost as easy in that case. You'd still have to detect at runtime
>>> which instruction set is supported and generate SSE, SSE2, (SSE4?) and
>>> AVX. And that's not even all of them :) The OpenCL drivers just hide
>>> that pain. With handwritten code you might be coding for a specific
>>> architecture and might be fine with only SSE2, but as a compiler we
>>> can't really make that same decision.
>>
>>
>> You make good points.
>>
>>>
>>>> Then, on to GPU:
>>>>
>>>> It is not a generic-purpose solution, you still need to bring in
>>>> pyopencl
>>>> for lots of cases, and so the question is how many cases it fits with
>>>> and if
>>>> it is enough to grow a userbase around it. And, importantly, how much
>>>> performance is sacrificed for the resulting user-friendlyness. 50%
>>>> performance hit is usually OK, 95% maybe not. And a 95% hit is not
>>>> unimaginable if the memory movement is done in a bad way for some code?
>>>
>>>
>>> Yes, I don't expect this to change a lot suddenly. In the long term I
>>> think the implementation could be sufficiently good to support at
>>> least most codes. And the user still has full control over data
>>> movement, if wanted (the pinning thing, which isn't mentioned in the
>>> CEP).
>>>
>>>> I think the fundamental problem is one of programming paradigms.
>>>> Fortran,
>>>> C++, Cython are all sequential in nature; even with OpenMP it is like
>>>> you
>>>> have a modest bit of parallelism tacked on to speed up a
>>>> sequential-looking
>>>> program. With "massively parallel" solutions such as CUDA and OpenCL,
>>>> and
>>>> also MPI in fact, the fundamental assumption that you have thousands or
>>>> hundreds of thousands of threads. And that just changes how you need to
>>>> think about writing code, which would tend to show up at a syntax
>>>> level. So,
>>>> at least if you want good performance, you need to change your way of
>>>> thinking enough that a new syntax (loosely cooperating threads rather
>>>> than
>>>> parallel-for-loop or SIMD instruction) is actually an advantage, as
>>>> it keeps
>>>> you reminded of how the hardware works.
>>>>
>>>> So I think the most important thing to do (if you bother) is: Gather
>>>> a set
>>>> of real worl(-ish) CUDA or OpenCL programs, port them to Cython +
>>>> this CEP
>>>> (without a working Cython implementation for it), and see how that goes.
>>>> That's really the only way to evaluate it.
>>>
>>>
>>> I've been wanting to do that for a long time now, also to evaluate the
>>> capabilities of cython.parallel as it stands now. It's a really good
>>> idea, I'll try to port some codes, and not just the trivial ones like
>>> Jacobi's method :).
>>>
>>>> Some experiences from the single instance GPU code I've written:
>>>>
>>>> - For starters I had to give up OpenCL and use CUDA to use all the 48 KB
>>>> available shared memory on Nvidia compute-capability-2.0 (perhaps I just
>>>> didn't find the OpenCL option for that). And increasing from 16 to 48 KB
>>>> allowed a fundamentally faster and qualitatively different algorithm
>>>> to be
>>>> used. But OpenCL vs. CUDA is kind of beside the point here....
>>>>
>>>> - When mucking about with various "obvious" ports of sequential code
>>>> to GPU
>>>> code, I got performance in the range of 5 to 20 GFLOP/s (out of 490
>>>> GFLOP/s
>>>> or so theoretical; NVidia Tesla M2050). When really understanding the
>>>> hardware, and making good use of the 48 KB of thread-shared memory, I
>>>> achieved 209 GFLOP/s, without really doing any microoptimization. I
>>>> don't
>>>> think the CEP includes any features for intra-thread communication, so
>>>> that's off the table.
>>>
>>>
>>> The CEP doesn't mention barriers (discussed earlier), but they should
>>> be supported, and __local memory (that's "shared memory" in CUDA terms
>>> right?) could be utilized using a more explicit scheme (or implicitly
>>> if the compiler is smart). The only issue with barriers is that with
>>> OpenCL you have multiple levels of synchronization, but barriers only
>>> work within the work group / thread block, whereas with openmp it
>>> works simply for all your threads. I think a global barrier would have
>>> to mean kernel termination + start of a new one, which could be hard
>>> to support depending on where it is placed in the code...
>>>
>>>> (My code is here:
>>>>
>>>>
>>>> https://github.com/wavemoth/wavemoth/blob/cuda/wavemoth/cuda/legendre_transform.cu.in
>>>>
>>>>
>>>> Though it's badly documented and rush-for-deadline-quality; I plan to
>>>> polish
>>>> it up and publish it when I get time in autumn).
>>>>
>>>> I guess I mention this as the kind of computation your CEP definitely
>>>> does
>>>> NOT cover. That's probably OK, but one should figure out specifically
>>>> how
>>>> many usecases it does cover (in particular with no control over thread
>>>> blocks and intra-block communication). Is the CEP a 80%-solution, or a
>>>> 10%-solution?
>>>
>>>
>>> I haven't looked too carefully at the code, but a large portion is
>>> dedicated to a reduction right? What I don't see is how your reduction
>>> spans across multiple work-groups / thread blocks? Because
>>> __syncthreads should only sync stuff within a single block. The CEP
>
>
> Most of the time there's actually no explicit synchronization, but the code
> relies on all threads of a warp being on the same instruction in the
> scheduler. __synchtreads is then only used at the end of the reduction when
> all within-warp additions have been done. Calling __syncthreads at each step
> of the algorithm would have totally killed performance.
>
> Dag Sverre
>

Ah, clever. I don't think there's any way to figure out the warp size
with OpenCL, but maybe if the user specifies it in some way similar
optimizations can be made.

>>
>> There's no need to reduce across thread blocks because (conveniently
>> enough) there's 8000 independent computations to be performed with
>> different parameters. I simply used one thread block for each problem.
>>
>> It's basically a matrix-vector product where the matrix must be
>> generated on the fly columnwise (one entry can be generated from the
>> preceding two in the same column), but the summation is row-wise.
>>
>> And turns out that getting inter-thread sum-reduction to work well was
>> harder than I expected; a 32-by-32 matrix (needed since warps are 32
>> threads) is too big to fit in memory, but tree-reduction makes a lot of
>> the threads in a warp do nothing. So I ended up with a hybrid approach;
>> there's visual demo from page 49 onwards here:
>>
>> http://folk.uio.no/dagss/talk-gpusht.pdf
>>
>> Getting back to Cython, I'll admit that this form of inter-thread
>> reduction is quite generic, and that my specific problem could be solved
>> by basically coding a set of inter-thread reduction algorithms suitable
>> for different hardware into Cython.
>>
>>> didn't mention reductions, but they should be supported (I'm thinking
>>> multi-stage or sequential within the workgroup (whatever works
>>> better), followed by another kernel invocation if the result is
>>> needed).
>>
>>
>> Multiple kernel invocations for global barriers appear to be pretty
>> standard, and it's why OpenCL support queueing tasks with dependencies
>> etc.
>>
>>> As mentioned earlier in a different thread (on parallelism I think),
>>> reduction arrays (i.e. memoryviews or C arrays) as well as generally
>>> private arrays should be supported. An issue with that is that you
>>> can't really dedicate an array to each work item / thread (too much
>>> memory would be consumed).
>>>
>>> Again, declarations within blocks would solve many problems:
>>>
>>> cdef float[n] shared_by_work_group
>>> with parallel():
>>> cdef float[n] local_to_work_group
>>> for i in prange(...):
>>> cdef float[n] local_to_work_item
>>>
>>> For arrays, the reductions could be somewhat more explicit, where
>>> there is an explicit 'my_memoryview += my_local_scratch_data'. That
>>> should probably only be allowed for memory local to the work group.
>>>
>>> Anyway, I'll try porting some numerical codes to this scheme over the
>>> coming weeks and see what is missing and how it can be solved. I still
>>> believe it can all be made to work quite properly, without adjusting
>>> the language to fit the hardware model. The prange (and OpenMP) model
>>> look like sequential code, but they tell the compiler a lot, namely
>>> that each iteration is independent and could therefore be scheduled as
>>> a separate thread.
>>
>>
>> Again, good points.
>>
>> Dag
>> _______________________________________________
>> cython-devel mailing list
>> cython-devel at python.org
>> http://mail.python.org/mailman/listinfo/cython-devel
>
>
> _______________________________________________
> cython-devel mailing list
> cython-devel at python.org
> http://mail.python.org/mailman/listinfo/cython-devel