[Cython] cython.parallel tasks, single, master, critical, barriers

mark florisson markflorisson88 at gmail.com
Fri Oct 14 22:18:14 CEST 2011

On 14 October 2011 21:07, mark florisson <markflorisson88 at gmail.com> wrote:
> On 14 October 2011 19:31, Robert Bradshaw <robertwb at math.washington.edu> wrote:
>> On Wed, Oct 12, 2011 at 7:55 AM, mark florisson
>> <markflorisson88 at gmail.com> wrote:
>>>>> I ultimately feel things like that is more important than 100% coverage of
>>>>> the OpenMP standard. Of course, OpenMP is a lot lower-hanging fruit.
>>>> +1 Prange handles the (corse-grained) SIMD case nicely, and a
>>>> task/futures model based on closures would I think flesh this out to
>>>> the next level of generality (and complexity).
>>> Futures are definitely nice. I suppose I think really like "inline
>>> futures", i.e. openmp tasks. I realize that futures may look more
>>> pythonic. However, as mentioned previously, I also see issues with
>>> that. When you submit a task then you expect a future object, which
>>> you might want to pass around. But we don't have the GIL for that. I
>>> personally feel that futures is something that should be done by a
>>> library (such as concurrent.futures in python 3.2), and inline tasks
>>> by a language. It also means I have to write an entire function or
>>> closure for perhaps only a few lines of code.
>>> I might also want to submit other functions that are not closures, or
>>> I might want to reuse my closures that are used for tasks and for
>>> something else. So what if my tasks contain more parallel constructs?
>>> e.g. what if I have a task closure that I return from my function that
>>> generates more tasks itself? Would you just execute them sequentially
>>> outside of the parallel construct, or would you simply disallow that?
>>> Also, do you restrict future "objects" to only the parallel section?
>>> Another problem is that you can only wait on tasks of your direct
>>> children. So what if I get access to my parent's future object
>>> (assuming you allow tasks to generate tasks), and then want the result
>>> of my parent?
>>> Or what if I store these future objects in an array or list and access
>>> them arbitrarily? You will only know at runtime which task to wait on,
>>> and openmp only has a static, lexical taskwait.
>>> I suppose my point is that without either a drastic rewrite (e.g., use
>>> pthreads instead of openmp) or quite a bit of contraints, I am unsure
>>> how futures would work here. Perhaps you guys have some concrete
>>> syntax and semantics proposals?
>> It feels to me that OpenMP tasks took a different model of parallelism
>> and tried to force them into the OpenMP model/constraints, and so it'd
>> be even more difficult to fit them into a nice pythonic interface.
>> Perhaps to make progress on this front we need to have a concrete
>> example to look at. I'm also wondering if the standard threading
>> module (perhaps with overlay support) used with nogil functions would
>> be sufficient--locking is required for handling the queues, etc. so
>> the fact that the GIL is involved is not a big deal. It is possible
>> that this won't scale to as small of work units, but the overhead
>> should be minimal once your work unit is a sufficient size (which is
>> probably quite small) and it's already implemented and well
>> documented/used.
> It's all definitely possible with normal threads, but the thing you
> lose is convenience and conciseness. For big problems the programmer
> might sum up the courage and effort to implement it, but typically you
> will just stick to a serial version. This is really where OpenMP is
> powerful, you can take a simple sequential piece of code and make it
> parallel with minimal effort and without having to restructure,
> rethink and rewrite your algorithms.
> Something like concurrent.futures is definitely nice, but most people
> cannot afford to mandate python 3.2 for their users.
> The most classical examples I can think of for tasks are
> 1) independent code sections, i.e. two or more pieces of code that
> don't depend on each other which you want to execute in parallel
> 2) traversal of some kind of custom data structure, like a tree or a linked list
> 3) some kind of other producer/consumer model
> e.g. using with task syntax:
> cdef postorder_traverse(tree *t): # bullet 1) and 2)
>    with task:
>        traverse(t.left)
>    with task:
>        traverse(t.right)
>    taskwait() # wait until we traversed our subtrees
>    use(t.data)
> cdef list_traverse(linkedlist *L): # bullet 2)
>    with nogil, parallel():
>        if threadid() == 0:
>            while L.next:
>                with task:
>                    do_something(L.data)
> In the latter case we don't need a taskwait as we don't care about any
> particular order. Only one thread generates the tasks where the others
> just hit the barrier and see the tasks they can execute.
> The good thing is that the OpenMP runtime can decide at task
> generation point (not only at taskwait or barrier points!) decide to
> stop generating more tasks and start executing them. So you won't
> exhaust memory if you might have lots of tasks.
>> As for critical and barrier, the notion of a critical block as a with
>> statement is very useful. Creating/naming locks (rather than being
>> implicit on the file/line number) is more powerful, but is a larger
>> burden on the user and more difficult to support with the OpenMP
>> backend.
> Actually, as I mentioned before, critical sections do not at all
> depend on their line or file number. All they depend on their implicit
> or explicit name (the name is implicit when you simply omit it, so all
> unnamed critical sections exclude each other).
> Indeed, supporting creation of locks dynamically and allowing them to
> be passed around arbitrarily would be hard (and likely not worth the
> effort). Naming them is trivial though, which might not be incredibly
> pythonic but is very convenient, easy and readable.
>> barrier, if supported, should be a function call not a
>> context. Not as critical as with the tasks case, but a good example to
>> see how it flows would be useful here as well.
> I agree, it really doesn't have any associated code and trying to
> associate code with it is likely more confusing than meaningful. It
> was just an idea.
> Often you can rely on implicit barriers from e.g. prange, but not
> always. I can't think of any real-world example, but you usually need
> it to ensure that everyone gets a sane view on some shared data, e.g.
> with nogil, parallel():
>    array[threadid()] = func(threadid())
>    barrier()
>    use array[threadid() + 1 % omp_num_threads()] # access data of
> some neighbour
> This is a rather contrived example, but (see below) it would be
> especially useful if you use single/master/once/first that sets some
> shared data everyone will operate on (for instance in a prange). To
> ensure the data is sane before you use it, you have to put the barrier
> to 1) ensure the data has been written and 2) that the data has been
> flushed.
> Basically, you'll always know when you need a barrier, but it's pretty
> hard to come up with a real-world example for it when you have to :)
>> As for single, I see doing this manually does require boilerplate
>> locking, so what about
>> if cython.parallel.once():  # will return True once for a tread group.
>>    ...
>> we could implement this via our own locking/checking/flushing to allow
>> it to occur in arbitrary expressions, e.g.
>> special_worker = cython.parallel.once()
>> if special_worker:
>>   ...
>> [common code]
>> if special_worker:   # single wouldn't work here
>>   ...
> That looks OK. I've actually been thinking that if we have barriers we
> don't really need is_master(), once() or single() or anything. We
> already have threadid() and you usually don't care what thread gets
> there first, you only care about doing it once. So one could just
> write
> if parallel.threadid() == 0:
>    ...
> parallel.barrier() # if required
> It might also be convenient to declare variables explicitly shared
> here, e.g. this code will not work:
> cdef int *buf
> with nogil, parallel.parallel():
>    if parallel.threadid() == 0:
>        buf = ...
>    parallel.barrier()
>    # will will likely segfault, as buf is private because we assigned
> to it. It's only valid in thread 0
>    use buf[...]
> So basically you'd have to do something like (&buf)[0][...], which
> frankly looks pretty weird. However I do think such cases are rather
> uncommon.
>> - Robert
>> _______________________________________________
>> cython-devel mailing list
>> cython-devel at python.org
>> http://mail.python.org/mailman/listinfo/cython-devel

BTW, I think orphaned constructs might also really be worth our while.
Suppose you have a piece of code:

execute so many times:
    do some work we want to do in parallel
    compute something that needs to happen sequentially

And also suppose that the two things we do in the loop might be
factored out into separate functions. Now we could have a
prange()/parallel() for the parallel work, but that means we have to
start up a new parallel section every time.
If we're unlucky, the user might also innocently release the GIL as
well. There is a significant performance penalty to this, i.e. it
would be vastly more efficient to do the following:

with nogil, parallel():
    do so many times:

        if threadid() == 0:
            compute something that needs to happen sequentially

cdef void my_parallel_function(...):
    for i in prange(..., orphan=True): # workshare this loop with the
other threads in the team

This is not currently possible. Currently, every thread would call
my_parallel_function, and every function call would do the same
computations and not share any work. You can only currently avoid the
overhead  by writing all your code in the one function.

Another possibility for a keyword argument is 'worksharing', but that
would suggest normal prange()s don't share work.

What do you guys think, is this too confusing for people? I think this
is really reasonably common-ish situation.

More information about the cython-devel mailing list