[Cython] OpenMP support
Dag Sverre Seljebotn
d.s.seljebotn at astro.uio.no
Fri Mar 11 08:56:51 CET 2011
On 03/11/2011 08:20 AM, Stefan Behnel wrote:
> Robert Bradshaw, 11.03.2011 01:46:
>> On Tue, Mar 8, 2011 at 11:16 AM, Francesc Alted<faltet at pytables.org>
>>> A Tuesday 08 March 2011 18:50:15 Stefan Behnel escrigué:
>>>> mark florisson, 08.03.2011 18:00:
>>>>> What I meant was that the
>>>>> wrapper returned by the decorator would have to call the closure
>>>>> for every iteration, which introduces function call overhead.
>>>>> I guess we just have to establish what we want to do: do we
>>>>> want to support code with Python objects (and exceptions etc), or
>>>>> just C code written in Cython?
>>>> I like the approach that Sturla mentioned: using closures to
>>>> implement worker threads. I think that's very pythonic. You could do
>>>> something like this, for example:
>>>> def worker():
>>>> for item in queue:
>>>> with nogil:
>>>> start_threads(worker, count)
>>>> Note that the queue is only needed to tell the thread what to work
>>>> on. A lot of things can be shared over the closure. So the queue may
>>>> not even be required in many cases.
>>> I like this approach too. I suppose that you will need to annotate the
>>> items so that they are not Python objects, no? Something like:
>>> def worker():
>>> cdef int item # tell that item is not a Python object!
>>> for item in queue:
>>> with nogil:
>>> start_threads(worker, count)
>> On a slightly higher level, are we just trying to use OpenMP from
>> Cython, or are we trying to build it into the language? If the former,
>> it may make sense to stick closer than one might otherwise be tempted
>> in terms of API to the underlying C to leverage the existing
>> documentation. A library with a more Pythonic interface could perhaps
>> be written on top of that. Alternatively, if we're building it into
>> Cython itself, I'd it might be worth modeling it after the
>> multiprocessing module (though I understand it would be implemented
>> with threads), which I think is a decent enough model for managing
>> embarrassingly parallel operations.
>> The above code is similar to that,
>> though I'd prefer the for loop implicit rather than as part of the
>> worker method (or at least as an argument).
> It provides a simple way to write per-thread initialisation code,
> though. And it's likely easier to make looping fast than to speed up
> the call into a closure. However, eventually, both ways will need to
> be supported anyway.
>> If we went this route,
>> what are the advantages of using OpenMP over, say, pthreads in the
>> background? (And could the latter be done with just a library + some
>> fancy GIL specifications?)
> In the above example, basically everything is explicit and nothing
> more than a simplified threading setup is needed. Even the
> implementation of "start_threads()" could be done in a couple of lines
> of Python code, including the collection of results and errors. If
> someone thinks we need more than that, I'd like to see a couple of
> concrete use cases and code examples first.
>> One thing that's nice about OpenMP as
>> implemented in C is that the serial code looks almost exactly like the
>> parallel code; the code at http://wiki.cython.org/enhancements/openmp
>> has this property too.
> Writing it with a closure isn't really that much different. You can
> put the inner function right where it would normally get executed and
> add a bit of calling/load distributing code below it. Not that bad IMO.
> It may be worth providing some ready-to-use decorators to do the load
> balancing, but I don't really like the idea of having a decorator
> magically invoke the function in-place that it decorates.
>> Also, I like the idea of being able to hold the GIL by the invoking
>> thread and having the "sharing" threads do the appropriate locking
>> among themselves when needed if possible, e.g. for exception raising.
> I like the explicit "with nogil" block in my example above. It makes
> it easy to use normal Python setup code, to synchronise based on the
> GIL if desired (e.g. to use a normal Python queue for communication),
> and it's simple enough not to get in the way.
I'm supporting Robert here. Basically, I'm +1 to anything that can make
me pretend the GIL doesn't exist, even if it comes with a 2x performance
hit: Because that will make me write parallell code (which I can't be
bothered to do in Cython currently), and I have 4 cores on the laptop I
use for debugging, so I'd still get a 2x speedup.
Perhaps the long-term solution is something like an "autogil" mode could
work where Cython automatically releases the GIL on blocks where it can
(such as a typed for-loop), and acquires it back when needed (an
exception-raising if-block within said for-loop). And when doing
multi-threading, GIL-requiring calls are dispatched to a master
GIL-holding thread (which would not be a worker thread, i.e. on 4 cores
you'd have 4 workers + 1 GIL-holding support thread). So the advice for
speeding up code is simply "make sure your code is all typed", just like
before, but people can follow that advice without even having to learn
about the GIL.
It's all about a) lowering learning curve for trivial purposes, b) allow
inserting temporary debug print statements using the GIL without having
to rework the code.
As for the discussion we had on using the GIL for locking, I think that
should be made explicit, even if it is a noop currently. I once wrote
code relying on the GIL, and really missed something like
"cython.gil.lock()" to put in there just for better code readability
(yes, I used comments, but...).
> I think it simplifies things a lot when code can rely on the GIL being
> held when entering the thread function. Threading is complicated
> enough to keep it as explicit as possible.
That's exactly the thing about OpenMP: It tends to hide the complexity
of threading and allow you to get on with your life. When you say this,
it sounds a bit like "people who don't want to learn the technical inner
details of Python should just use another language than Cython".
If I write code in Fortran it may get parallelized, whereas I almost
never write parallel code in Cython (well, MPI, but not shared-memory),
all the "is-the-gil-held-or-not" is just too much too keep in my head.
More information about the cython-devel