[Cython] cython.parallel tasks, single, master, critical, barriers

mark florisson markflorisson88 at gmail.com
Fri Oct 14 22:07:07 CEST 2011


On 14 October 2011 19:31, Robert Bradshaw <robertwb at math.washington.edu> wrote:
> On Wed, Oct 12, 2011 at 7:55 AM, mark florisson
> <markflorisson88 at gmail.com> wrote:
>>>> I ultimately feel things like that is more important than 100% coverage of
>>>> the OpenMP standard. Of course, OpenMP is a lot lower-hanging fruit.
>>>
>>> +1 Prange handles the (corse-grained) SIMD case nicely, and a
>>> task/futures model based on closures would I think flesh this out to
>>> the next level of generality (and complexity).
>>
>> Futures are definitely nice. I suppose I think really like "inline
>> futures", i.e. openmp tasks. I realize that futures may look more
>> pythonic. However, as mentioned previously, I also see issues with
>> that. When you submit a task then you expect a future object, which
>> you might want to pass around. But we don't have the GIL for that. I
>> personally feel that futures is something that should be done by a
>> library (such as concurrent.futures in python 3.2), and inline tasks
>> by a language. It also means I have to write an entire function or
>> closure for perhaps only a few lines of code.
>>
>> I might also want to submit other functions that are not closures, or
>> I might want to reuse my closures that are used for tasks and for
>> something else. So what if my tasks contain more parallel constructs?
>> e.g. what if I have a task closure that I return from my function that
>> generates more tasks itself? Would you just execute them sequentially
>> outside of the parallel construct, or would you simply disallow that?
>> Also, do you restrict future "objects" to only the parallel section?
>>
>> Another problem is that you can only wait on tasks of your direct
>> children. So what if I get access to my parent's future object
>> (assuming you allow tasks to generate tasks), and then want the result
>> of my parent?
>> Or what if I store these future objects in an array or list and access
>> them arbitrarily? You will only know at runtime which task to wait on,
>> and openmp only has a static, lexical taskwait.
>>
>> I suppose my point is that without either a drastic rewrite (e.g., use
>> pthreads instead of openmp) or quite a bit of contraints, I am unsure
>> how futures would work here. Perhaps you guys have some concrete
>> syntax and semantics proposals?
>
> It feels to me that OpenMP tasks took a different model of parallelism
> and tried to force them into the OpenMP model/constraints, and so it'd
> be even more difficult to fit them into a nice pythonic interface.
> Perhaps to make progress on this front we need to have a concrete
> example to look at. I'm also wondering if the standard threading
> module (perhaps with overlay support) used with nogil functions would
> be sufficient--locking is required for handling the queues, etc. so
> the fact that the GIL is involved is not a big deal. It is possible
> that this won't scale to as small of work units, but the overhead
> should be minimal once your work unit is a sufficient size (which is
> probably quite small) and it's already implemented and well
> documented/used.

It's all definitely possible with normal threads, but the thing you
lose is convenience and conciseness. For big problems the programmer
might sum up the courage and effort to implement it, but typically you
will just stick to a serial version. This is really where OpenMP is
powerful, you can take a simple sequential piece of code and make it
parallel with minimal effort and without having to restructure,
rethink and rewrite your algorithms.

Something like concurrent.futures is definitely nice, but most people
cannot afford to mandate python 3.2 for their users.

The most classical examples I can think of for tasks are

1) independent code sections, i.e. two or more pieces of code that
don't depend on each other which you want to execute in parallel
2) traversal of some kind of custom data structure, like a tree or a linked list
3) some kind of other producer/consumer model

e.g. using with task syntax:

cdef postorder_traverse(tree *t): # bullet 1) and 2)
    with task:
        traverse(t.left)
    with task:
        traverse(t.right)

    taskwait() # wait until we traversed our subtrees
    use(t.data)

cdef list_traverse(linkedlist *L): # bullet 2)
    with nogil, parallel():
        if threadid() == 0:
            while L.next:
                with task:
                    do_something(L.data)

In the latter case we don't need a taskwait as we don't care about any
particular order. Only one thread generates the tasks where the others
just hit the barrier and see the tasks they can execute.

The good thing is that the OpenMP runtime can decide at task
generation point (not only at taskwait or barrier points!) decide to
stop generating more tasks and start executing them. So you won't
exhaust memory if you might have lots of tasks.

> As for critical and barrier, the notion of a critical block as a with
> statement is very useful. Creating/naming locks (rather than being
> implicit on the file/line number) is more powerful, but is a larger
> burden on the user and more difficult to support with the OpenMP
> backend.

Actually, as I mentioned before, critical sections do not at all
depend on their line or file number. All they depend on their implicit
or explicit name (the name is implicit when you simply omit it, so all
unnamed critical sections exclude each other).
Indeed, supporting creation of locks dynamically and allowing them to
be passed around arbitrarily would be hard (and likely not worth the
effort). Naming them is trivial though, which might not be incredibly
pythonic but is very convenient, easy and readable.

> barrier, if supported, should be a function call not a
> context. Not as critical as with the tasks case, but a good example to
> see how it flows would be useful here as well.

I agree, it really doesn't have any associated code and trying to
associate code with it is likely more confusing than meaningful. It
was just an idea.
Often you can rely on implicit barriers from e.g. prange, but not
always. I can't think of any real-world example, but you usually need
it to ensure that everyone gets a sane view on some shared data, e.g.

with nogil, parallel():
    array[threadid()] = func(threadid())
    barrier()
    use array[threadid() + 1 % omp_num_threads()] # access data of
some neighbour

This is a rather contrived example, but (see below) it would be
especially useful if you use single/master/once/first that sets some
shared data everyone will operate on (for instance in a prange). To
ensure the data is sane before you use it, you have to put the barrier
to 1) ensure the data has been written and 2) that the data has been
flushed.

Basically, you'll always know when you need a barrier, but it's pretty
hard to come up with a real-world example for it when you have to :)

> As for single, I see doing this manually does require boilerplate
> locking, so what about
>
> if cython.parallel.once():  # will return True once for a tread group.
>    ...
>
> we could implement this via our own locking/checking/flushing to allow
> it to occur in arbitrary expressions, e.g.
>
> special_worker = cython.parallel.once()
> if special_worker:
>   ...
> [common code]
> if special_worker:   # single wouldn't work here
>   ...
>

That looks OK. I've actually been thinking that if we have barriers we
don't really need is_master(), once() or single() or anything. We
already have threadid() and you usually don't care what thread gets
there first, you only care about doing it once. So one could just
write

if parallel.threadid() == 0:
    ...

parallel.barrier() # if required

It might also be convenient to declare variables explicitly shared
here, e.g. this code will not work:

cdef int *buf

with nogil, parallel.parallel():
    if parallel.threadid() == 0:
        buf = ...

    parallel.barrier()

    # will will likely segfault, as buf is private because we assigned
to it. It's only valid in thread 0
    use buf[...]

So basically you'd have to do something like (&buf)[0][...], which
frankly looks pretty weird. However I do think such cases are rather
uncommon.

> - Robert
> _______________________________________________
> cython-devel mailing list
> cython-devel at python.org
> http://mail.python.org/mailman/listinfo/cython-devel
>


More information about the cython-devel mailing list