[Cython] Adding GPU support to cython

Fri Jan 31 06:29:36 EST 2020

Hi,
I opened a feature ticket: https://github.com/cython/cython/issues/3342
It describes my current prototype based on OpenMP.

Any feedback?

Also, I would like to do some more advanced analysis to improve the map-clauses. I do not want to go to a complex index analysis or alike, but a simple access analysis should cover many cases. All I would like to figure out is if a given variable (memview) was used (other than instantiated) before and/or after the device/parallel/device) block and ideally of a use was definitely a read-only. Any suggestion/hint how to do that?

Thanks

frank

-----Original Message-----
From: cython-devel <cython-devel-bounces+frank.schlimbach=intel.com at python.org> On Behalf Of Schlimbach, Frank
Sent: Friday, January 24, 2020 12:55 PM
To: Core developer mailing list of the Cython compiler <cython-devel at python.org>
Subject: Re: [Cython] Adding GPU support to cython

Hi Stefan,
thanks for your response. Good to hear this is still of interest.

Yes, I realized these are rather old CEPs. I spent some time with looking into the Cython code and concluded that it'd be the most consistent (and simplest) approach to stick with OpenMP and use it's offload pragmas (e.g. 'target' introduced in 4.5). Given a properly setup compiler this would in theory only require one or two compiler flags to enable offloading. I even have a first prototype which generates code that existing compilers seem to swallow. It's not ready for a PR since I have not been able to get it linked an run on GPU and I wanted to get some general feedback first. You can find the code on my offload branch https://github.com/fschlimb/cython/tree/offload (it's wip so please apologize that not all comments have been updated yet to reflect my changes).

Here's what it does:
- accept a new 'with' directive 'device' which marks a region/block to be offloaded to a device (OpenMP target)
  - I also considered extending 'gil' or 'parallel' to accept an optional 'device' argument but an extra directive seemed more general/flexible to also allow non-parallel code
  - I don't believe we should try to automate offloading right now. Once we have something that works on explicit demand we can still think about a performance model and auto-enable offloading.
- the DeviceWithBlockNode is added to the 'parallel stack' and can occur only as the outmost parallel directive
- a 'with device()' requires 'nogil'
- a 'with device()' will create a new scope annotated with a '#pragma omp target'
  - all variables which get assigned within the 'with device()' block are currently mapped as 'tofrom'
  - all other variables used are mapped as 'to'
  - identifying 'from' candidates is harder and not yet done (need to know that there is required allocation but no assignment before the 'with device()' block)
  - identifying 'alloc' candidates would also need additional analysis (e.g. not used outside the 'device()' block)
- all object mode stuff (like exceptions for error handling) are currently disabled in a 'with device()' block

Example:

def f(int[:,::1] X):
    cdef int v = 1
    cdef int i
    with gil, device(), parallel():
        for i in prange(4):
            X[i] = v

the 'with device' block becomes something like (simplified)

{
    size_t __pyx_v_X__count = __pyx_v_X.shape[0]*__pyx_v_X.shape[1];
    #pragma omp target map(to: __pyx_v_v) map(tofrom: __pyx_v_i , __pyx_v_X.data[0:__pyx_v_X__count], __pyx_v_X.memview, __pyx_v_X.shape, __pyx_v_X.strides, __pyx_v_X.suboffsets)
    {
        #pragma omp parallel
        #pragma omp for firstprivate(__pyx_v_i) lastprivate(__pyx_v_i)
        for((__pyx_v_i=0; __pyx_v_i<4; ++__pyx_v_i) {
             __pyx_v_X[__pyx_v_i] = __pyx_v_v;
        }
    }
}

There are lots of things to be added and improved, in particular I am currently adding an optional argument 'map' to 'device()' which allows manually setting the map-clauses for each variable. This is necessary to allow not only optimizations but also sending only partial array data to/from the device (like when the device memory cannot hold an entire array the developer would block the computation). We can probably add some magic for simple cases but there is probably no solution for the general problem of determining the accessed index-space.

Among others, things to also look at include
- non-contiguous arrays/memviews
- overlapping arrays/memviews
- keeping data on the device between 'with device()' blocks (USM (unified shared memory) or omp target data?)
- error handling
- tests
- docu/comments

I found that the functionality I needed to touch is somewhat scattered around the compiler pipeline. It might be worth thinking about restructuring a few things to make the whole OpenMP/parallel/offload stuff more maintainable. Of course you might see other solutions than mine which make this simpler.

Any thoughts/feedback/usecases appreciated

frank

-----Original Message-----
From: cython-devel <cython-devel-bounces+frank.schlimbach=intel.com at python.org> On Behalf Of Stefan Behnel
Sent: Friday, January 24, 2020 11:22 AM
To: cython-devel at python.org
Subject: Re: [Cython] Adding GPU support to cython

Hi Frank,

sorry for leaving this unresponded for a while. I'm far from an expert in this, but it looks like no-one else jumped in, so here's my response.

Schlimbach, Frank schrieb am 06.01.20 um 12:09:
> I would like to work on extending cython with a way to offload cython code to a GPU. I found to related CEPs (https://github.com/cython/cython/wiki/enhancements-opencl and https://github.com/cython/cython/wiki/enchancements-metadefintions).

So, just for a bit of context, these CEPs were written a decade ago, and Cython's code base, feature set, and the general Python ecosystem have evolved a lot since then. For example, "cython.parallel" and "prange()"
weren't even implemented back then.

> My current thinking is that a solution along the OpenCL CEP is most effective, it does require many code changes and seems to provide a good tradeoff between usability and efficiency.
> 
> I would like to suggest a few modifications to this approach, like
> 
>   *   using SYCL instead of OpenCL to closely follow existing parallel/prange semantics more easily
>   *   selecting the device (CPU, GPU) per region rather than per file
>   *   maybe allowing calling appropriately annotated and written external functions
> 
> I would be very grateful for any thoughts about this topic in general and for any advice on how to approach this so that a solution is found that is most broadly useful and most cythonic.

It would definitely be cool to generate GPU support from the existing Cython patterns, in addition to the OpenMP code that we already generate.
If that can be done, then users could enable GPU support by adding a C compiler define to their CFLAGS (rather than rerunning Cython), or even select between the two versions at runtime.

If the GPU support is per region, then how is the code section shipped to the GPU? Is the infrastructure for this provided by the OpenCL framework or does the user or the module need to set something up in addition?

Finally, generally speaking:
- PR welcome
- simple approach preferred (at least to get this started and prototyped)
- discussion welcome on this mailing list
- GitHub feature ticket seems to be missing, with a link to the ML thread https://mail.python.org/pipermail/cython-devel/2020-January/005262.html

Stefan
_______________________________________________
cython-devel mailing list
cython-devel at python.org
https://mail.python.org/mailman/listinfo/cython-devel
Intel Deutschland GmbH
Registered Address: Am Campeon 10-12, 85579 Neubiberg, Germany
Tel: +49 89 99 8853-0, www.intel.de
Managing Directors: Christin Eisenschmid, Gary Kershaw Chairperson of the Supervisory Board: Nicole Lau Registered Office: Munich Commercial Register: Amtsgericht Muenchen HRB 186928

_______________________________________________
cython-devel mailing list
cython-devel at python.org
https://mail.python.org/mailman/listinfo/cython-devel
Intel Deutschland GmbH
Registered Address: Am Campeon 10-12, 85579 Neubiberg, Germany
Tel: +49 89 99 8853-0, www.intel.de
Managing Directors: Christin Eisenschmid, Gary Kershaw
Chairperson of the Supervisory Board: Nicole Lau
Registered Office: Munich
Commercial Register: Amtsgericht Muenchen HRB 186928