[Cython] cython.parallel tasks, single, master, critical, barriers

Dag Sverre Seljebotn d.s.seljebotn at astro.uio.no
Thu Oct 20 10:42:15 CEST 2011

Meta: I've been meaning to respond to this thread, but can't find the 
time. What's the time-frame for implementing this? If it's hypothetical 
at the moment and just is a question of getting things spec-ed, one 
could perhaps look at discussing it at the next Cython workshop, or 
perhaps a Skype call with the three of us as some point...

Regarding the tasks: One of my biggest problems with Python is the lack 
of an elegant syntax for anonymous functions. But since Python has that 
problem, I feel it is not necesarrily something we should fix (by using 
the with statements to create tasks). Sometimes Pythonic-ness is more 
important than elegance (for Cython).

In general I'm happy as long as there's a chance of getting things to 
work in pure Python mode as well (with serial execution). So if, e.g., 
with statements creating tasks have the same effect when running the 
same code (serially) in pure Python, I'm less opposed (didn't look at it 
in detail).

Dag Sverre

On 10/19/2011 09:53 PM, mark florisson wrote:
> On 19 October 2011 06:01, Robert Bradshaw<robertwb at math.washington.edu>  wrote:
>> On Fri, Oct 14, 2011 at 1:07 PM, mark florisson
>> <markflorisson88 at gmail.com>  wrote:
>>> On 14 October 2011 19:31, Robert Bradshaw<robertwb at math.washington.edu>  wrote:
>>>> On Wed, Oct 12, 2011 at 7:55 AM, mark florisson
>>>> <markflorisson88 at gmail.com>  wrote:
>>>>>>> I ultimately feel things like that is more important than 100% coverage of
>>>>>>> the OpenMP standard. Of course, OpenMP is a lot lower-hanging fruit.
>>>>>> +1 Prange handles the (corse-grained) SIMD case nicely, and a
>>>>>> task/futures model based on closures would I think flesh this out to
>>>>>> the next level of generality (and complexity).
>>>>> Futures are definitely nice. I suppose I think really like "inline
>>>>> futures", i.e. openmp tasks. I realize that futures may look more
>>>>> pythonic. However, as mentioned previously, I also see issues with
>>>>> that. When you submit a task then you expect a future object, which
>>>>> you might want to pass around. But we don't have the GIL for that. I
>>>>> personally feel that futures is something that should be done by a
>>>>> library (such as concurrent.futures in python 3.2), and inline tasks
>>>>> by a language. It also means I have to write an entire function or
>>>>> closure for perhaps only a few lines of code.
>>>>> I might also want to submit other functions that are not closures, or
>>>>> I might want to reuse my closures that are used for tasks and for
>>>>> something else. So what if my tasks contain more parallel constructs?
>>>>> e.g. what if I have a task closure that I return from my function that
>>>>> generates more tasks itself? Would you just execute them sequentially
>>>>> outside of the parallel construct, or would you simply disallow that?
>>>>> Also, do you restrict future "objects" to only the parallel section?
>>>>> Another problem is that you can only wait on tasks of your direct
>>>>> children. So what if I get access to my parent's future object
>>>>> (assuming you allow tasks to generate tasks), and then want the result
>>>>> of my parent?
>>>>> Or what if I store these future objects in an array or list and access
>>>>> them arbitrarily? You will only know at runtime which task to wait on,
>>>>> and openmp only has a static, lexical taskwait.
>>>>> I suppose my point is that without either a drastic rewrite (e.g., use
>>>>> pthreads instead of openmp) or quite a bit of contraints, I am unsure
>>>>> how futures would work here. Perhaps you guys have some concrete
>>>>> syntax and semantics proposals?
>>>> It feels to me that OpenMP tasks took a different model of parallelism
>>>> and tried to force them into the OpenMP model/constraints, and so it'd
>>>> be even more difficult to fit them into a nice pythonic interface.
>>>> Perhaps to make progress on this front we need to have a concrete
>>>> example to look at. I'm also wondering if the standard threading
>>>> module (perhaps with overlay support) used with nogil functions would
>>>> be sufficient--locking is required for handling the queues, etc. so
>>>> the fact that the GIL is involved is not a big deal. It is possible
>>>> that this won't scale to as small of work units, but the overhead
>>>> should be minimal once your work unit is a sufficient size (which is
>>>> probably quite small) and it's already implemented and well
>>>> documented/used.
>>> It's all definitely possible with normal threads, but the thing you
>>> lose is convenience and conciseness. For big problems the programmer
>>> might sum up the courage and effort to implement it, but typically you
>>> will just stick to a serial version. This is really where OpenMP is
>>> powerful, you can take a simple sequential piece of code and make it
>>> parallel with minimal effort and without having to restructure,
>>> rethink and rewrite your algorithms.
>> That is a very good point.
>>> Something like concurrent.futures is definitely nice, but most people
>>> cannot afford to mandate python 3.2 for their users.
>>> The most classical examples I can think of for tasks are
>>> 1) independent code sections, i.e. two or more pieces of code that
>>> don't depend on each other which you want to execute in parallel
>>> 2) traversal of some kind of custom data structure, like a tree or a linked list
>>> 3) some kind of other producer/consumer model
>>> e.g. using with task syntax:
>>> cdef postorder_traverse(tree *t): # bullet 1) and 2)
>>>     with task:
>>>         traverse(t.left)
>>>     with task:
>>>         traverse(t.right)
>>>     taskwait() # wait until we traversed our subtrees
>>>     use(t.data)
>> Is there an implicit parallel block here? Perhaps in the caller?
>>> cdef list_traverse(linkedlist *L): # bullet 2)
>>>     with nogil, parallel():
>>>         if threadid() == 0:
>>>             while L.next:
>>>                 with task:
>>>                     do_something(L.data)
>>> In the latter case we don't need a taskwait as we don't care about any
>>> particular order. Only one thread generates the tasks where the others
>>> just hit the barrier and see the tasks they can execute.
>> I guess it's the fact that Python doesn't have a nice syntax for
>> anonymous functions or blocks does make this syntax more appealing
>> than an explicit closure.
>> Perhaps if we came up with a more pythonic/natural name which would
>> make the intent clear. Makes me want to do something like
>> pool = ThreadPool(10)
>> for item in L:
>>     with pool:
>>         process(item)
>> but then you get into issues of passing the pool around. OpenMP has
>> the implicit pool of the nesting parallel block, so "with one thread"
>> or "with cython.parallel.pool" or something like that might be more
>> readable.
>>> The good thing is that the OpenMP runtime can decide at task
>>> generation point (not only at taskwait or barrier points!) decide to
>>> stop generating more tasks and start executing them. So you won't
>>> exhaust memory if you might have lots of tasks.
>> Often threadpools have queues that block when their buffer gets full
>> to achieve the same goal.
>>>> As for critical and barrier, the notion of a critical block as a with
>>>> statement is very useful. Creating/naming locks (rather than being
>>>> implicit on the file/line number) is more powerful, but is a larger
>>>> burden on the user and more difficult to support with the OpenMP
>>>> backend.
>>> Actually, as I mentioned before, critical sections do not at all
>>> depend on their line or file number. All they depend on their implicit
>>> or explicit name (the name is implicit when you simply omit it, so all
>>> unnamed critical sections exclude each other).
>> Ah, yes. In this case "with cython.parallel.lock([optional name])"
>> could be obvious enough.
> We could also support atomic updates. We could either rewrite
> parallel.lock() blocks to atomics if all statements use inplace
> operators, but that might actually not be safe as the exclusion might
> be used for the rhs expressions. So I think you'd want a
> parallel.atomic() directive or some such.
> Alternatively, if you support parallel.shared(), you could specify
> that inplace operators on any such variables would actually be atomic
> updates, even if you use the operators on the elements of the shared
> variable. e.g.
> cdef int array1[N]
> cdef int array2[N]
> with parallel(), shared(array1):
>      # atomic update
>      array1[i] += ...
>      # not an atomic update, as it is "implicitly shared"
>      array2[i] += ...
> I'm not sure if that's more confusing than enlightening though.
>>> Indeed, supporting creation of locks dynamically and allowing them to
>>> be passed around arbitrarily would be hard (and likely not worth the
>>> effort). Naming them is trivial though, which might not be incredibly
>>> pythonic but is very convenient, easy and readable.
>> You can view this as a lookup by name, not a lock creation. Not
>> allowing them to be used outside of a with clause is a reasonable
>> restriction, and does not preclude a (possibly very distant) extension
>> to being able to pass them around.
>>>> barrier, if supported, should be a function call not a
>>>> context. Not as critical as with the tasks case, but a good example to
>>>> see how it flows would be useful here as well.
>>> I agree, it really doesn't have any associated code and trying to
>>> associate code with it is likely more confusing than meaningful. It
>>> was just an idea.
>>> Often you can rely on implicit barriers from e.g. prange, but not
>>> always. I can't think of any real-world example, but you usually need
>>> it to ensure that everyone gets a sane view on some shared data, e.g.
>>> with nogil, parallel():
>>>     array[threadid()] = func(threadid())
>>>     barrier()
>>>     use array[threadid() + 1 % omp_num_threads()] # access data of
>>> some neighbour
>>> This is a rather contrived example, but (see below) it would be
>>> especially useful if you use single/master/once/first that sets some
>>> shared data everyone will operate on (for instance in a prange). To
>>> ensure the data is sane before you use it, you have to put the barrier
>>> to 1) ensure the data has been written and 2) that the data has been
>>> flushed.
>>> Basically, you'll always know when you need a barrier, but it's pretty
>>> hard to come up with a real-world example for it when you have to :)
>> Yes, I think barriers are explanatory enough.
>>>> As for single, I see doing this manually does require boilerplate
>>>> locking, so what about
>>>> if cython.parallel.once():  # will return True once for a tread group.
>>>>     ...
>>>> we could implement this via our own locking/checking/flushing to allow
>>>> it to occur in arbitrary expressions, e.g.
>>>> special_worker = cython.parallel.once()
>>>> if special_worker:
>>>>    ...
>>>> [common code]
>>>> if special_worker:   # single wouldn't work here
>>>>    ...
>>> That looks OK. I've actually been thinking that if we have barriers we
>>> don't really need is_master(), once() or single() or anything. We
>>> already have threadid() and you usually don't care what thread gets
>>> there first, you only care about doing it once. So one could just
>>> write
>>> if parallel.threadid() == 0:
>>>     ...
>>> parallel.barrier() # if required
>> Perhaps you want the first free thread to take it up to minimize idle
>> threads. I agree if parallel.threadid() == 0 is a synonym for
>> is_master(), so probably not needed. However, what are the OpenMP
>> semantics of
>> cdef f():
>>     with parallel():
>>         g()
>>         g()
>> cdef g():
>>     with single():
>>         ... # executed once, right?
>>     with task:
>>         ... # executed twice, right?
>>> It might also be convenient to declare variables explicitly shared
>>> here, e.g. this code will not work:
>>> cdef int *buf
>>> with nogil, parallel.parallel():
>>>     if parallel.threadid() == 0:
>>>         buf = ...
>>>     parallel.barrier()
>>>     # will will likely segfault, as buf is private because we assigned
>>> to it. It's only valid in thread 0
>>>     use buf[...]
>>> So basically you'd have to do something like (&buf)[0][...], which
>>> frankly looks pretty weird. However I do think such cases are rather
>>> uncommon.
>> True. Perhaps this could be declared via "with nogil,
>> parallel.parallel(), parallel.shared(buf)" or something like that.
>> - Robert
>> _______________________________________________
>> cython-devel mailing list
>> cython-devel at python.org
>> http://mail.python.org/mailman/listinfo/cython-devel
> _______________________________________________
> cython-devel mailing list
> cython-devel at python.org
> http://mail.python.org/mailman/listinfo/cython-devel

More information about the cython-devel mailing list