[Cython] cython.parallel tasks, single, master, critical, barriers

Wed Oct 19 20:19:15 CEST 2011

On 19 October 2011 06:01, Robert Bradshaw <robertwb at math.washington.edu> wrote:
> On Fri, Oct 14, 2011 at 1:07 PM, mark florisson
> <markflorisson88 at gmail.com> wrote:
>> On 14 October 2011 19:31, Robert Bradshaw <robertwb at math.washington.edu> wrote:
>>> On Wed, Oct 12, 2011 at 7:55 AM, mark florisson
>>> <markflorisson88 at gmail.com> wrote:
>>>>>> I ultimately feel things like that is more important than 100% coverage of
>>>>>> the OpenMP standard. Of course, OpenMP is a lot lower-hanging fruit.
>>>>>
>>>>> +1 Prange handles the (corse-grained) SIMD case nicely, and a
>>>>> task/futures model based on closures would I think flesh this out to
>>>>> the next level of generality (and complexity).
>>>>
>>>> Futures are definitely nice. I suppose I think really like "inline
>>>> futures", i.e. openmp tasks. I realize that futures may look more
>>>> pythonic. However, as mentioned previously, I also see issues with
>>>> that. When you submit a task then you expect a future object, which
>>>> you might want to pass around. But we don't have the GIL for that. I
>>>> personally feel that futures is something that should be done by a
>>>> library (such as concurrent.futures in python 3.2), and inline tasks
>>>> by a language. It also means I have to write an entire function or
>>>> closure for perhaps only a few lines of code.
>>>>
>>>> I might also want to submit other functions that are not closures, or
>>>> I might want to reuse my closures that are used for tasks and for
>>>> something else. So what if my tasks contain more parallel constructs?
>>>> e.g. what if I have a task closure that I return from my function that
>>>> generates more tasks itself? Would you just execute them sequentially
>>>> outside of the parallel construct, or would you simply disallow that?
>>>> Also, do you restrict future "objects" to only the parallel section?
>>>>
>>>> Another problem is that you can only wait on tasks of your direct
>>>> children. So what if I get access to my parent's future object
>>>> (assuming you allow tasks to generate tasks), and then want the result
>>>> of my parent?
>>>> Or what if I store these future objects in an array or list and access
>>>> them arbitrarily? You will only know at runtime which task to wait on,
>>>> and openmp only has a static, lexical taskwait.
>>>>
>>>> I suppose my point is that without either a drastic rewrite (e.g., use
>>>> pthreads instead of openmp) or quite a bit of contraints, I am unsure
>>>> how futures would work here. Perhaps you guys have some concrete
>>>> syntax and semantics proposals?
>>>
>>> It feels to me that OpenMP tasks took a different model of parallelism
>>> and tried to force them into the OpenMP model/constraints, and so it'd
>>> be even more difficult to fit them into a nice pythonic interface.
>>> Perhaps to make progress on this front we need to have a concrete
>>> example to look at. I'm also wondering if the standard threading
>>> module (perhaps with overlay support) used with nogil functions would
>>> be sufficient--locking is required for handling the queues, etc. so
>>> the fact that the GIL is involved is not a big deal. It is possible
>>> that this won't scale to as small of work units, but the overhead
>>> should be minimal once your work unit is a sufficient size (which is
>>> probably quite small) and it's already implemented and well
>>> documented/used.
>>
>> It's all definitely possible with normal threads, but the thing you
>> lose is convenience and conciseness. For big problems the programmer
>> might sum up the courage and effort to implement it, but typically you
>> will just stick to a serial version. This is really where OpenMP is
>> powerful, you can take a simple sequential piece of code and make it
>> parallel with minimal effort and without having to restructure,
>> rethink and rewrite your algorithms.
>
> That is a very good point.
>
>> Something like concurrent.futures is definitely nice, but most people
>> cannot afford to mandate python 3.2 for their users.
>>
>> The most classical examples I can think of for tasks are
>>
>> 1) independent code sections, i.e. two or more pieces of code that
>> don't depend on each other which you want to execute in parallel
>> 2) traversal of some kind of custom data structure, like a tree or a linked list
>> 3) some kind of other producer/consumer model
>>
>> e.g. using with task syntax:
>>
>> cdef postorder_traverse(tree *t): # bullet 1) and 2)
>>    with task:
>>        traverse(t.left)
>>    with task:
>>        traverse(t.right)
>>
>>    taskwait() # wait until we traversed our subtrees
>>    use(t.data)
>
> Is there an implicit parallel block here? Perhaps in the caller?

Yes, it was implicit in my example. If you'd use that code, you'd call
it from a parallel section. Depending on what semantics you'd define
(see below), you'd call it either from one thread in the team, or with
all of them.

>> cdef list_traverse(linkedlist *L): # bullet 2)
>>    with nogil, parallel():
>>        if threadid() == 0:
>>            while L.next:
>>                with task:
>>                    do_something(L.data)
>>
>> In the latter case we don't need a taskwait as we don't care about any
>> particular order. Only one thread generates the tasks where the others
>> just hit the barrier and see the tasks they can execute.
>
> I guess it's the fact that Python doesn't have a nice syntax for
> anonymous functions or blocks does make this syntax more appealing
> than an explicit closure.
>
> Perhaps if we came up with a more pythonic/natural name which would
> make the intent clear. Makes me want to do something like
>
> pool = ThreadPool(10)
> for item in L:
>    with pool:
>        process(item)
>
> but then you get into issues of passing the pool around. OpenMP has
> the implicit pool of the nesting parallel block, so "with one thread"
> or "with cython.parallel.pool" or something like that might be more
> readable.

I think with pool would be good, it must be clear that the task is
submitted to a threadpool and hence may be executed asynchronously.

>> The good thing is that the OpenMP runtime can decide at task
>> generation point (not only at taskwait or barrier points!) decide to
>> stop generating more tasks and start executing them. So you won't
>> exhaust memory if you might have lots of tasks.
>
> Often threadpools have queues that block when their buffer gets full
> to achieve the same goal.
>
>>> As for critical and barrier, the notion of a critical block as a with
>>> statement is very useful. Creating/naming locks (rather than being
>>> implicit on the file/line number) is more powerful, but is a larger
>>> burden on the user and more difficult to support with the OpenMP
>>> backend.
>>
>> Actually, as I mentioned before, critical sections do not at all
>> depend on their line or file number. All they depend on their implicit
>> or explicit name (the name is implicit when you simply omit it, so all
>> unnamed critical sections exclude each other).
>
> Ah, yes. In this case "with cython.parallel.lock([optional name])"
> could be obvious enough.
>
>> Indeed, supporting creation of locks dynamically and allowing them to
>> be passed around arbitrarily would be hard (and likely not worth the
>> effort). Naming them is trivial though, which might not be incredibly
>> pythonic but is very convenient, easy and readable.
>
> You can view this as a lookup by name, not a lock creation. Not
> allowing them to be used outside of a with clause is a reasonable
> restriction, and does not preclude a (possibly very distant) extension
> to being able to pass them around.
>
>>> barrier, if supported, should be a function call not a
>>> context. Not as critical as with the tasks case, but a good example to
>>> see how it flows would be useful here as well.
>>
>> I agree, it really doesn't have any associated code and trying to
>> associate code with it is likely more confusing than meaningful. It
>> was just an idea.
>> Often you can rely on implicit barriers from e.g. prange, but not
>> always. I can't think of any real-world example, but you usually need
>> it to ensure that everyone gets a sane view on some shared data, e.g.
>>
>> with nogil, parallel():
>>    array[threadid()] = func(threadid())
>>    barrier()
>>    use array[threadid() + 1 % omp_num_threads()] # access data of
>> some neighbour
>>
>> This is a rather contrived example, but (see below) it would be
>> especially useful if you use single/master/once/first that sets some
>> shared data everyone will operate on (for instance in a prange). To
>> ensure the data is sane before you use it, you have to put the barrier
>> to 1) ensure the data has been written and 2) that the data has been
>> flushed.
>>
>> Basically, you'll always know when you need a barrier, but it's pretty
>> hard to come up with a real-world example for it when you have to :)
>
> Yes, I think barriers are explanatory enough.
>
>>> As for single, I see doing this manually does require boilerplate
>>> locking, so what about
>>>
>>> if cython.parallel.once():  # will return True once for a tread group.
>>>    ...
>>>
>>> we could implement this via our own locking/checking/flushing to allow
>>> it to occur in arbitrary expressions, e.g.
>>>
>>> special_worker = cython.parallel.once()
>>> if special_worker:
>>>   ...
>>> [common code]
>>> if special_worker:   # single wouldn't work here
>>>   ...
>>>
>>
>> That looks OK. I've actually been thinking that if we have barriers we
>> don't really need is_master(), once() or single() or anything. We
>> already have threadid() and you usually don't care what thread gets
>> there first, you only care about doing it once. So one could just
>> write
>>
>> if parallel.threadid() == 0:
>>    ...
>>
>> parallel.barrier() # if required
>
> Perhaps you want the first free thread to take it up to minimize idle
> threads. I agree if parallel.threadid() == 0 is a synonym for
> is_master(), so probably not needed. However, what are the OpenMP
> semantics of
>
> cdef f():
>    with parallel():
>        g()
>        g()
>
> cdef g():
>    with single():
>        ... # executed once, right?
>    with task:
>        ... # executed twice, right?

Hmm, not quite. The thing is that function g is called by every thread
in the team, say N threads, and for each time the team encounters the
single directive, it will execute it once, so in total it will execute
the code in the single block twice, as the team encounters it twice.

It will however create 2N tasks to execute, as every thread that
encounters it creates a task. This is probably not what you want, so
you usually want

with parallel():
    if threadid() == 0:
        g()

and have the code in g (executed by one thread only) create the tasks.

Note also how 'for _ in prange(1):' would not have the same semantics
here, as it generates a 'parallel for' and not a worksharing for in
the function (because we don't support orphaned pranges).

I think this may all be confusing for users, I think usually you will
want to create just a single task irrespective of whether you are in a
parallel or a prange and not "however many threads are in the team for
parallel and just one for prange because we're sharing work". This
would also work for orphaned tasks, e.g. you expect 2 tasks in your
snippet above, not 2N. Fortunately, that would be easy to support.
We would however have to introduce the same restriction as with
(implicit) barriers: either all or none of the threads must encounter
the construct (or maybe loosen it to "if you actually want to create
the task, make sure at least thread 0 encounters it", which may lead
users to write more efficient code).

>> It might also be convenient to declare variables explicitly shared
>> here, e.g. this code will not work:
>>
>> cdef int *buf
>>
>> with nogil, parallel.parallel():
>>    if parallel.threadid() == 0:
>>        buf = ...
>>
>>    parallel.barrier()
>>
>>    # will will likely segfault, as buf is private because we assigned
>> to it. It's only valid in thread 0
>>    use buf[...]
>>
>> So basically you'd have to do something like (&buf)[0][...], which
>> frankly looks pretty weird. However I do think such cases are rather
>> uncommon.
>
> True. Perhaps this could be declared via "with nogil,
> parallel.parallel(), parallel.shared(buf)" or something like that.

That looks elegant enough.

> - Robert
> _______________________________________________
> cython-devel mailing list
> cython-devel at python.org
> http://mail.python.org/mailman/listinfo/cython-devel
>