[Cython] CEP: prange for parallel loops

Tue Apr 5 07:05:54 CEST 2011

On Mon, Apr 4, 2011 at 6:04 AM, Stefan Behnel <stefan_ml at behnel.de> wrote:
> Dag Sverre Seljebotn, 04.04.2011 13:53:
>>
>> On 04/04/2011 01:23 PM, Stefan Behnel wrote:
>>>
>>> Dag Sverre Seljebotn, 04.04.2011 12:17:
>>>>
>>>> CEP up at http://wiki.cython.org/enhancements/prange
>>>
>>> """
>>> Variable handling
>>>
>>> Rather than explicit declaration of shared/private variables we rely on
>>> conventions:
>>>
>>> * Thread-shared: Variables that are only read and not written in the loop
>>> body are shared across threads. Variables that are only used in the else
>>> block are considered shared as well.
>>>
>>> * Thread-private: Variables that are assigned to in the loop body are
>>> thread-private. Obviously, the iteration counter is thread-private as
>>> well.
>>>
>>> * Reduction: Variables that only used on the LHS of an inplace operator,
>>> such as s above, are marked as targets for reduction. If the variable is
>>> also used in other ways (LHS of assignment or in an expression) it does
>>> instead turn into a thread-private variable. Note: This means that if
>>> one, e.g., inserts printf(... s) above, s is turned into a thread-local
>>> variable. OTOH, there is simply no way to correctly emulate the effect
>>> printf(... s) would have in a sequential loop, so such code must be
>>> discouraged anyway.
>>> """
>>>
>>> What about simply (ab-)using Python semantics and creating a new inner
>>> scope for the prange loop body? That would basically make the loop behave
>>> like a closure function, but with the looping header at the 'right' place
>>> rather than after the closure.
>>
>> I'm not quite sure what the concrete changes to the CEP this would lead to
>> (assuming you mean this as a proposal for alternative semantics, and not
>> an
>> implementation detail).
>
> What I would like to avoid is having to tell users "and now for something
> completely different". It looks like a loop, but then there's a whole page
> of new semantics for it. And this also cannot be used in plain Python code
> due to the differing scoping behaviour.

The same could be said of OpenMP--it looks exactly like a loop except
for a couple of pragmas.

The proposed (as I'm reading the CEP now) semantics of what's shared
and first/last private and reduction would give it the semantics of a
normal, sequential loop (and if your final result changes based on how
many threads were involved then you've got incorrect code). Perhaps
reading of the reduction variable could be fine (though obviously
ill-defined, suitable only for debugging).

>> How would we treat reduction variables? They need to be supported, and
>> there's nothing in Python semantics to support reduction variables, they
>> are a rather special case everywhere. I suppose keeping the reduction
>> clause above, or use the "nonlocal" keyword in the loop body...
>
> That's what I thought, yes. It looks unexpected, sure. That's the clear
> advantage of using inner functions, which do not add anything new at all.
> But if we want to add something that looks more like a loop, we should at
> least make it behave like something that's easy to explain.
>
> Sorry for not taking the opportunity to articulate my scepticism in the
> workshop discussion. Skipping through the CEP now, I think this feature adds
> quite some complexity to the language, and I'm not sure it's worth that when
> compared to the existing closures. The equivalent closure+decorator syntax
> is certainly easier to explain, and could translate into exactly the same
> code. But with the clear advantage that the scope of local, nonlocal and
> thread-configuring variables is immediately obvious.
>
> Basically, your example would become
>
> def f(np.ndarray[double] x, double alpha):
>    cdef double s = 0
>
>    with cython.nogil:
>        @cython.run_parallel_for_loop( range(x.shape[0]) )
>        cdef threaded_loop(i):    # 'nogil' is inherited
>            cdef double tmp = alpha * i
>            nonlocal s
>            s += x[i] * tmp
>        s += alpha * (x.shape[0] - 1)
>    return s
>
> We likely agree that this is not beautiful. It's also harder to implement
> than a "simple" for-in-prange loop. But I find it at least easier to explain
> and semantically 'obvious'. And it would allow us to write a pure mode
> implementation for this based on the threading module.

I'm not opposed to having something like this, it's a whole lot of
code and extra refactoring for the basic usecase. I think a nice,
clean syntax is worthwhile and requires at lest some level of language
support. In some ways it's like buffer support--what goes on under the
hood does take some explaining, but most of the time it works as
expected (i.e. as if you hadn't declared the type), just faster. The
inner workings of prange may be a bit magical, but the intent is not,
and the latter is what users care about.

- Robert