[Cython] cython.parallel tasks, single, master, critical, barriers

Wed Oct 19 21:45:02 CEST 2011

On 19 October 2011 19:19, mark florisson <markflorisson88 at gmail.com> wrote:
> On 19 October 2011 06:01, Robert Bradshaw <robertwb at math.washington.edu> wrote:
>> On Fri, Oct 14, 2011 at 1:07 PM, mark florisson
>> <markflorisson88 at gmail.com> wrote:
>>> On 14 October 2011 19:31, Robert Bradshaw <robertwb at math.washington.edu> wrote:
>>>> On Wed, Oct 12, 2011 at 7:55 AM, mark florisson
>>>> <markflorisson88 at gmail.com> wrote:
>>>>>>> I ultimately feel things like that is more important than 100% coverage of
>>>>>>> the OpenMP standard. Of course, OpenMP is a lot lower-hanging fruit.
>>>>>>
>>>>>> +1 Prange handles the (corse-grained) SIMD case nicely, and a
>>>>>> task/futures model based on closures would I think flesh this out to
>>>>>> the next level of generality (and complexity).
>>>>>
>>>>> Futures are definitely nice. I suppose I think really like "inline
>>>>> futures", i.e. openmp tasks. I realize that futures may look more
>>>>> pythonic. However, as mentioned previously, I also see issues with
>>>>> that. When you submit a task then you expect a future object, which
>>>>> you might want to pass around. But we don't have the GIL for that. I
>>>>> personally feel that futures is something that should be done by a
>>>>> library (such as concurrent.futures in python 3.2), and inline tasks
>>>>> by a language. It also means I have to write an entire function or
>>>>> closure for perhaps only a few lines of code.
>>>>>
>>>>> I might also want to submit other functions that are not closures, or
>>>>> I might want to reuse my closures that are used for tasks and for
>>>>> something else. So what if my tasks contain more parallel constructs?
>>>>> e.g. what if I have a task closure that I return from my function that
>>>>> generates more tasks itself? Would you just execute them sequentially
>>>>> outside of the parallel construct, or would you simply disallow that?
>>>>> Also, do you restrict future "objects" to only the parallel section?
>>>>>
>>>>> Another problem is that you can only wait on tasks of your direct
>>>>> children. So what if I get access to my parent's future object
>>>>> (assuming you allow tasks to generate tasks), and then want the result
>>>>> of my parent?
>>>>> Or what if I store these future objects in an array or list and access
>>>>> them arbitrarily? You will only know at runtime which task to wait on,
>>>>> and openmp only has a static, lexical taskwait.
>>>>>
>>>>> I suppose my point is that without either a drastic rewrite (e.g., use
>>>>> pthreads instead of openmp) or quite a bit of contraints, I am unsure
>>>>> how futures would work here. Perhaps you guys have some concrete
>>>>> syntax and semantics proposals?
>>>>
>>>> It feels to me that OpenMP tasks took a different model of parallelism
>>>> and tried to force them into the OpenMP model/constraints, and so it'd
>>>> be even more difficult to fit them into a nice pythonic interface.
>>>> Perhaps to make progress on this front we need to have a concrete
>>>> example to look at. I'm also wondering if the standard threading
>>>> module (perhaps with overlay support) used with nogil functions would
>>>> be sufficient--locking is required for handling the queues, etc. so
>>>> the fact that the GIL is involved is not a big deal. It is possible
>>>> that this won't scale to as small of work units, but the overhead
>>>> should be minimal once your work unit is a sufficient size (which is
>>>> probably quite small) and it's already implemented and well
>>>> documented/used.
>>>
>>> It's all definitely possible with normal threads, but the thing you
>>> lose is convenience and conciseness. For big problems the programmer
>>> might sum up the courage and effort to implement it, but typically you
>>> will just stick to a serial version. This is really where OpenMP is
>>> powerful, you can take a simple sequential piece of code and make it
>>> parallel with minimal effort and without having to restructure,
>>> rethink and rewrite your algorithms.
>>
>> That is a very good point.
>>
>>> Something like concurrent.futures is definitely nice, but most people
>>> cannot afford to mandate python 3.2 for their users.
>>>
>>> The most classical examples I can think of for tasks are
>>>
>>> 1) independent code sections, i.e. two or more pieces of code that
>>> don't depend on each other which you want to execute in parallel
>>> 2) traversal of some kind of custom data structure, like a tree or a linked list
>>> 3) some kind of other producer/consumer model
>>>
>>> e.g. using with task syntax:
>>>
>>> cdef postorder_traverse(tree *t): # bullet 1) and 2)
>>>    with task:
>>>        traverse(t.left)
>>>    with task:
>>>        traverse(t.right)
>>>
>>>    taskwait() # wait until we traversed our subtrees
>>>    use(t.data)
>>
>> Is there an implicit parallel block here? Perhaps in the caller?
>
> Yes, it was implicit in my example. If you'd use that code, you'd call
> it from a parallel section. Depending on what semantics you'd define
> (see below), you'd call it either from one thread in the team, or with
> all of them.
>
>>> cdef list_traverse(linkedlist *L): # bullet 2)
>>>    with nogil, parallel():
>>>        if threadid() == 0:
>>>            while L.next:
>>>                with task:
>>>                    do_something(L.data)
>>>
>>> In the latter case we don't need a taskwait as we don't care about any
>>> particular order. Only one thread generates the tasks where the others
>>> just hit the barrier and see the tasks they can execute.
>>
>> I guess it's the fact that Python doesn't have a nice syntax for
>> anonymous functions or blocks does make this syntax more appealing
>> than an explicit closure.
>>
>> Perhaps if we came up with a more pythonic/natural name which would
>> make the intent clear. Makes me want to do something like
>>
>> pool = ThreadPool(10)
>> for item in L:
>>    with pool:
>>        process(item)
>>
>> but then you get into issues of passing the pool around. OpenMP has
>> the implicit pool of the nesting parallel block, so "with one thread"
>> or "with cython.parallel.pool" or something like that might be more
>> readable.
>
> I think with pool would be good, it must be clear that the task is
> submitted to a threadpool and hence may be executed asynchronously.
>
>>> The good thing is that the OpenMP runtime can decide at task
>>> generation point (not only at taskwait or barrier points!) decide to
>>> stop generating more tasks and start executing them. So you won't
>>> exhaust memory if you might have lots of tasks.
>>
>> Often threadpools have queues that block when their buffer gets full
>> to achieve the same goal.
>>
>>>> As for critical and barrier, the notion of a critical block as a with
>>>> statement is very useful. Creating/naming locks (rather than being
>>>> implicit on the file/line number) is more powerful, but is a larger
>>>> burden on the user and more difficult to support with the OpenMP
>>>> backend.
>>>
>>> Actually, as I mentioned before, critical sections do not at all
>>> depend on their line or file number. All they depend on their implicit
>>> or explicit name (the name is implicit when you simply omit it, so all
>>> unnamed critical sections exclude each other).
>>
>> Ah, yes. In this case "with cython.parallel.lock([optional name])"
>> could be obvious enough.
>>
>>> Indeed, supporting creation of locks dynamically and allowing them to
>>> be passed around arbitrarily would be hard (and likely not worth the
>>> effort). Naming them is trivial though, which might not be incredibly
>>> pythonic but is very convenient, easy and readable.
>>
>> You can view this as a lookup by name, not a lock creation. Not
>> allowing them to be used outside of a with clause is a reasonable
>> restriction, and does not preclude a (possibly very distant) extension
>> to being able to pass them around.
>>
>>>> barrier, if supported, should be a function call not a
>>>> context. Not as critical as with the tasks case, but a good example to
>>>> see how it flows would be useful here as well.
>>>
>>> I agree, it really doesn't have any associated code and trying to
>>> associate code with it is likely more confusing than meaningful. It
>>> was just an idea.
>>> Often you can rely on implicit barriers from e.g. prange, but not
>>> always. I can't think of any real-world example, but you usually need
>>> it to ensure that everyone gets a sane view on some shared data, e.g.
>>>
>>> with nogil, parallel():
>>>    array[threadid()] = func(threadid())
>>>    barrier()
>>>    use array[threadid() + 1 % omp_num_threads()] # access data of
>>> some neighbour
>>>
>>> This is a rather contrived example, but (see below) it would be
>>> especially useful if you use single/master/once/first that sets some
>>> shared data everyone will operate on (for instance in a prange). To
>>> ensure the data is sane before you use it, you have to put the barrier
>>> to 1) ensure the data has been written and 2) that the data has been
>>> flushed.
>>>
>>> Basically, you'll always know when you need a barrier, but it's pretty
>>> hard to come up with a real-world example for it when you have to :)
>>
>> Yes, I think barriers are explanatory enough.
>>
>>>> As for single, I see doing this manually does require boilerplate
>>>> locking, so what about
>>>>
>>>> if cython.parallel.once():  # will return True once for a tread group.
>>>>    ...
>>>>
>>>> we could implement this via our own locking/checking/flushing to allow
>>>> it to occur in arbitrary expressions, e.g.
>>>>
>>>> special_worker = cython.parallel.once()
>>>> if special_worker:
>>>>   ...
>>>> [common code]
>>>> if special_worker:   # single wouldn't work here
>>>>   ...
>>>>
>>>
>>> That looks OK. I've actually been thinking that if we have barriers we
>>> don't really need is_master(), once() or single() or anything. We
>>> already have threadid() and you usually don't care what thread gets
>>> there first, you only care about doing it once. So one could just
>>> write
>>>
>>> if parallel.threadid() == 0:
>>>    ...
>>>
>>> parallel.barrier() # if required
>>
>> Perhaps you want the first free thread to take it up to minimize idle
>> threads. I agree if parallel.threadid() == 0 is a synonym for
>> is_master(), so probably not needed. However, what are the OpenMP
>> semantics of
>>
>> cdef f():
>>    with parallel():
>>        g()
>>        g()
>>
>> cdef g():
>>    with single():
>>        ... # executed once, right?
>>    with task:
>>        ... # executed twice, right?
>
> Hmm, not quite. The thing is that function g is called by every thread
> in the team, say N threads, and for each time the team encounters the
> single directive, it will execute it once, so in total it will execute
> the code in the single block twice, as the team encounters it twice.
>
> It will however create 2N tasks to execute, as every thread that
> encounters it creates a task. This is probably not what you want, so
> you usually want
>
> with parallel():
>    if threadid() == 0:
>        g()
>
> and have the code in g (executed by one thread only) create the tasks.
>
> Note also how 'for _ in prange(1):' would not have the same semantics
> here, as it generates a 'parallel for' and not a worksharing for in
> the function (because we don't support orphaned pranges).
>
> I think this may all be confusing for users, I think usually you will
> want to create just a single task irrespective of whether you are in a
> parallel or a prange and not "however many threads are in the team for
> parallel and just one for prange because we're sharing work". This
> would also work for orphaned tasks, e.g. you expect 2 tasks in your
> snippet above, not 2N. Fortunately, that would be easy to support.
> We would however have to introduce the same restriction as with
> (implicit) barriers: either all or none of the threads must encounter
> the construct (or maybe loosen it to "if you actually want to create
> the task, make sure at least thread 0 encounters it", which may lead
> users to write more efficient code).
>
>>> It might also be convenient to declare variables explicitly shared
>>> here, e.g. this code will not work:
>>>
>>> cdef int *buf
>>>
>>> with nogil, parallel.parallel():
>>>    if parallel.threadid() == 0:
>>>        buf = ...
>>>
>>>    parallel.barrier()
>>>
>>>    # will will likely segfault, as buf is private because we assigned
>>> to it. It's only valid in thread 0
>>>    use buf[...]
>>>
>>> So basically you'd have to do something like (&buf)[0][...], which
>>> frankly looks pretty weird. However I do think such cases are rather
>>> uncommon.
>>
>> True. Perhaps this could be declared via "with nogil,
>> parallel.parallel(), parallel.shared(buf)" or something like that.
>
> That looks elegant enough.

Likewise, I think something like parallel.private(buf) would also be
really nice for arrays, especially if we also allow arrays with
runtime sizes (behind the scenes we could malloc and free). I think
those cases are much more common than parallel.shared().

>> - Robert
>> _______________________________________________
>> cython-devel mailing list
>> cython-devel at python.org
>> http://mail.python.org/mailman/listinfo/cython-devel
>>
>