I've done a pretty major revision to the prange CEP, bringing in a lot of the feedback. Thread-private variables are now split in two cases: i) The safe cases, which really require very little technical knowledge -> automatically inferred ii) As an advanced feature, unsafe cases that requires some knowledge of threading -> must be explicitly declared I think this split simplifies things a great deal. I'm rather excited over this now; this could turn out to be a really user-friendly and safe feature that would not only allow us to support OpenMP-like threading, but be more convenient to use in a range of common cases. http://wiki.cython.org/enhancements/prange <http://wiki.cython.org/enhancements/prange#preview> Dag Sverre
On 04/05/2011 10:29 PM, Dag Sverre Seljebotn wrote:
I've done a pretty major revision to the prange CEP, bringing in a lot of the feedback.
Thread-private variables are now split in two cases:
i) The safe cases, which really require very little technical knowledge -> automatically inferred
ii) As an advanced feature, unsafe cases that requires some knowledge of threading -> must be explicitly declared
I think this split simplifies things a great deal.
I'm rather excited over this now; this could turn out to be a really user-friendly and safe feature that would not only allow us to support OpenMP-like threading, but be more convenient to use in a range of common cases.
http://wiki.cython.org/enhancements/prange <http://wiki.cython.org/enhancements/prange#preview>
As a digression: threadlocal(int)-variables could also be supported elsewhere as syntax candy for the pythread.h Thread Local Storage, which would work well for fast TLS for any kind of threads (e.g., when using threading module). Dag Sverre (Sorry about the previous HTML-mail.)
On 5 April 2011 22:29, Dag Sverre Seljebotn <d.s.seljebotn@astro.uio.no> wrote:
I've done a pretty major revision to the prange CEP, bringing in a lot of the feedback.
Thread-private variables are now split in two cases:
i) The safe cases, which really require very little technical knowledge -> automatically inferred
ii) As an advanced feature, unsafe cases that requires some knowledge of threading -> must be explicitly declared
I think this split simplifies things a great deal.
Can't we obsolete the declaration entirely by assigning to variables that need to have firstprivate behaviour inside the with parallel block? Basically in the same way the scratch space is used. The only problem with that is that it won't be lastprivate, so the value will be undefined after the parallel block (but not after the worksharing loop). cdef int myvariable with nogil, parallel: myvariable = 2 for i in prange(...): use myvariable maybe assign to myvariable # myvariable is well-defined here # myvariable is not well-defined here If you still desperately want lastprivate behaviour you can simply assign myvariable to another variable in the loop body.
I'm rather excited over this now; this could turn out to be a really user-friendly and safe feature that would not only allow us to support OpenMP-like threading, but be more convenient to use in a range of common cases.
http://wiki.cython.org/enhancements/prange
Dag Sverre
_______________________________________________ cython-devel mailing list cython-devel@python.org http://mail.python.org/mailman/listinfo/cython-devel
On 04/11/2011 10:45 AM, mark florisson wrote:
On 5 April 2011 22:29, Dag Sverre Seljebotn<d.s.seljebotn@astro.uio.no> wrote:
I've done a pretty major revision to the prange CEP, bringing in a lot of the feedback.
Thread-private variables are now split in two cases:
i) The safe cases, which really require very little technical knowledge -> automatically inferred
ii) As an advanced feature, unsafe cases that requires some knowledge of threading -> must be explicitly declared
I think this split simplifies things a great deal.
Can't we obsolete the declaration entirely by assigning to variables that need to have firstprivate behaviour inside the with parallel block? Basically in the same way the scratch space is used. The only problem with that is that it won't be lastprivate, so the value will be undefined after the parallel block (but not after the worksharing loop).
cdef int myvariable
with nogil, parallel: myvariable = 2 for i in prange(...): use myvariable maybe assign to myvariable
# myvariable is well-defined here
# myvariable is not well-defined here
If you still desperately want lastprivate behaviour you can simply assign myvariable to another variable in the loop body.
I don't care about lastprivate, I don't think that is an issue, as you say. My problem with this is that it means going into an area where possibly tricky things are implicit rather than explicit. I also see this as a rather special case that will be seldomly used, and implicit behaviour is more difficult to justify because of that. (The other instance of thread-local variables I feel is still explicit: You use prange instead of range, which means that you declare that values created in the iteration does not leak to the next iteration. The rest is just optimization from there.) As Robert said in his recent talk: A lot of languages are easy to write. The advantage of Python is that it is easy to *read*. That's what I feel is wrong with the proposal above: An assignment to a variable changes the semantics of it. Granted, it happens in a way so that it will almost always be correct, but I feel that reading the code, I'd spend some extra cycles to go "ah, so this variable is thread-local and therefore its values survive across a loop iteration". If I even knew about the feature in the first place. In seeing "threadprivate" spelled out, it is either obvious what it means, or obvious that I should look up the docs. There's *a lot* of things that can be made implicit in a programming language; Python/Cython simply usually leans towards the explicit side. Oh, and we may want to support writable shared variables (and flush) eventually too, and the above doesn't easily differentiate there? That's just my opinion, I'm happy to be overruled here. Dag Sverre
On 11 April 2011 11:10, Dag Sverre Seljebotn <d.s.seljebotn@astro.uio.no> wrote:
On 04/11/2011 10:45 AM, mark florisson wrote:
On 5 April 2011 22:29, Dag Sverre Seljebotn<d.s.seljebotn@astro.uio.no> wrote:
I've done a pretty major revision to the prange CEP, bringing in a lot of the feedback.
Thread-private variables are now split in two cases:
i) The safe cases, which really require very little technical knowledge -> automatically inferred
ii) As an advanced feature, unsafe cases that requires some knowledge of threading -> must be explicitly declared
I think this split simplifies things a great deal.
Can't we obsolete the declaration entirely by assigning to variables that need to have firstprivate behaviour inside the with parallel block? Basically in the same way the scratch space is used. The only problem with that is that it won't be lastprivate, so the value will be undefined after the parallel block (but not after the worksharing loop).
cdef int myvariable
with nogil, parallel: myvariable = 2 for i in prange(...): use myvariable maybe assign to myvariable
# myvariable is well-defined here
# myvariable is not well-defined here
If you still desperately want lastprivate behaviour you can simply assign myvariable to another variable in the loop body.
I don't care about lastprivate, I don't think that is an issue, as you say.
My problem with this is that it means going into an area where possibly tricky things are implicit rather than explicit. I also see this as a rather special case that will be seldomly used, and implicit behaviour is more difficult to justify because of that.
Indeed, I actually considered if we should support firstprivate at all, as it's really about "being firstprivate and lastprivate". Without any declaration, you can have firstprivate or lastprivate, but not both :) So I agree that supporting such a (probably) uncommon case is better left explicit. On the other hand it seems silly to have support for such a weird case.
(The other instance of thread-local variables I feel is still explicit: You use prange instead of range, which means that you declare that values created in the iteration does not leak to the next iteration. The rest is just optimization from there.)
As Robert said in his recent talk: A lot of languages are easy to write. The advantage of Python is that it is easy to *read*. That's what I feel is wrong with the proposal above: An assignment to a variable changes the semantics of it. Granted, it happens in a way so that it will almost always be correct, but I feel that reading the code, I'd spend some extra cycles to go "ah, so this variable is thread-local and therefore its values survive across a loop iteration".
If I even knew about the feature in the first place. In seeing "threadprivate" spelled out, it is either obvious what it means, or obvious that I should look up the docs.
There's *a lot* of things that can be made implicit in a programming language; Python/Cython simply usually leans towards the explicit side.
Oh, and we may want to support writable shared variables (and flush) eventually too, and the above doesn't easily differentiate there?
Right, everything is implicit. So I guess it'll be good to introduce it anyway as you say, so we can later declare stuff shared with similar syntax. I suppose that's the point where I'm convinced.
That's just my opinion, I'm happy to be overruled here.
Dag Sverre _______________________________________________ cython-devel mailing list cython-devel@python.org http://mail.python.org/mailman/listinfo/cython-devel
On 04/11/2011 11:41 AM, mark florisson wrote:
On 11 April 2011 11:10, Dag Sverre Seljebotn<d.s.seljebotn@astro.uio.no> wrote:
On 04/11/2011 10:45 AM, mark florisson wrote:
On 5 April 2011 22:29, Dag Sverre Seljebotn<d.s.seljebotn@astro.uio.no> wrote:
I've done a pretty major revision to the prange CEP, bringing in a lot of the feedback.
Thread-private variables are now split in two cases:
i) The safe cases, which really require very little technical knowledge -> automatically inferred
ii) As an advanced feature, unsafe cases that requires some knowledge of threading -> must be explicitly declared
I think this split simplifies things a great deal.
Can't we obsolete the declaration entirely by assigning to variables that need to have firstprivate behaviour inside the with parallel block? Basically in the same way the scratch space is used. The only problem with that is that it won't be lastprivate, so the value will be undefined after the parallel block (but not after the worksharing loop).
cdef int myvariable
with nogil, parallel: myvariable = 2 for i in prange(...): use myvariable maybe assign to myvariable
# myvariable is well-defined here
# myvariable is not well-defined here
If you still desperately want lastprivate behaviour you can simply assign myvariable to another variable in the loop body.
I don't care about lastprivate, I don't think that is an issue, as you say.
My problem with this is that it means going into an area where possibly tricky things are implicit rather than explicit. I also see this as a rather special case that will be seldomly used, and implicit behaviour is more difficult to justify because of that.
Indeed, I actually considered if we should support firstprivate at all, as it's really about "being firstprivate and lastprivate". Without any declaration, you can have firstprivate or lastprivate, but not both :) So I agree that supporting such a (probably) uncommon case is better left explicit. On the other hand it seems silly to have support for such a weird case.
Well, I actually need to do the per-thread cache thing I described in the CEP in my own codes, so it's not *that* special; it'd be nice to support it. OTOH I *could* work around it by having an array of scalars cdef int[:] old_ell = int[:numthreads]() ... if old_ell[threadid()] != ell: ... So I guess, it's at least on the bottom of list of priorities in that CEP. Dag Sverre
On 11 April 2011 12:08, Dag Sverre Seljebotn <d.s.seljebotn@astro.uio.no> wrote:
On 04/11/2011 11:41 AM, mark florisson wrote:
On 11 April 2011 11:10, Dag Sverre Seljebotn<d.s.seljebotn@astro.uio.no> wrote:
On 04/11/2011 10:45 AM, mark florisson wrote:
On 5 April 2011 22:29, Dag Sverre Seljebotn<d.s.seljebotn@astro.uio.no> wrote:
I've done a pretty major revision to the prange CEP, bringing in a lot of the feedback.
Thread-private variables are now split in two cases:
i) The safe cases, which really require very little technical knowledge -> automatically inferred
ii) As an advanced feature, unsafe cases that requires some knowledge of threading -> must be explicitly declared
I think this split simplifies things a great deal.
Can't we obsolete the declaration entirely by assigning to variables that need to have firstprivate behaviour inside the with parallel block? Basically in the same way the scratch space is used. The only problem with that is that it won't be lastprivate, so the value will be undefined after the parallel block (but not after the worksharing loop).
cdef int myvariable
with nogil, parallel: myvariable = 2 for i in prange(...): use myvariable maybe assign to myvariable
# myvariable is well-defined here
# myvariable is not well-defined here
If you still desperately want lastprivate behaviour you can simply assign myvariable to another variable in the loop body.
I don't care about lastprivate, I don't think that is an issue, as you say.
My problem with this is that it means going into an area where possibly tricky things are implicit rather than explicit. I also see this as a rather special case that will be seldomly used, and implicit behaviour is more difficult to justify because of that.
Indeed, I actually considered if we should support firstprivate at all, as it's really about "being firstprivate and lastprivate". Without any declaration, you can have firstprivate or lastprivate, but not both :) So I agree that supporting such a (probably) uncommon case is better left explicit. On the other hand it seems silly to have support for such a weird case.
Well, I actually need to do the per-thread cache thing I described in the CEP in my own codes, so it's not *that* special; it'd be nice to support it.
You need 'old_ell' and 'alpha' after the loop?
OTOH I *could* work around it by having an array of scalars
cdef int[:] old_ell = int[:numthreads]()
... if old_ell[threadid()] != ell: ...
So I guess, it's at least on the bottom of list of priorities in that CEP.
Dag Sverre _______________________________________________ cython-devel mailing list cython-devel@python.org http://mail.python.org/mailman/listinfo/cython-devel
On 04/11/2011 12:14 PM, mark florisson wrote:
On 11 April 2011 12:08, Dag Sverre Seljebotn<d.s.seljebotn@astro.uio.no> wrote:
On 04/11/2011 11:41 AM, mark florisson wrote:
On 11 April 2011 11:10, Dag Sverre Seljebotn<d.s.seljebotn@astro.uio.no> wrote:
On 04/11/2011 10:45 AM, mark florisson wrote:
On 5 April 2011 22:29, Dag Sverre Seljebotn<d.s.seljebotn@astro.uio.no> wrote:
I've done a pretty major revision to the prange CEP, bringing in a lot of the feedback.
Thread-private variables are now split in two cases:
i) The safe cases, which really require very little technical knowledge -> automatically inferred
ii) As an advanced feature, unsafe cases that requires some knowledge of threading -> must be explicitly declared
I think this split simplifies things a great deal.
Can't we obsolete the declaration entirely by assigning to variables that need to have firstprivate behaviour inside the with parallel block? Basically in the same way the scratch space is used. The only problem with that is that it won't be lastprivate, so the value will be undefined after the parallel block (but not after the worksharing loop).
cdef int myvariable
with nogil, parallel: myvariable = 2 for i in prange(...): use myvariable maybe assign to myvariable
# myvariable is well-defined here
# myvariable is not well-defined here
If you still desperately want lastprivate behaviour you can simply assign myvariable to another variable in the loop body.
I don't care about lastprivate, I don't think that is an issue, as you say.
My problem with this is that it means going into an area where possibly tricky things are implicit rather than explicit. I also see this as a rather special case that will be seldomly used, and implicit behaviour is more difficult to justify because of that.
Indeed, I actually considered if we should support firstprivate at all, as it's really about "being firstprivate and lastprivate". Without any declaration, you can have firstprivate or lastprivate, but not both :) So I agree that supporting such a (probably) uncommon case is better left explicit. On the other hand it seems silly to have support for such a weird case.
Well, I actually need to do the per-thread cache thing I described in the CEP in my own codes, so it's not *that* special; it'd be nice to support it.
You need 'old_ell' and 'alpha' after the loop?
No...but I need the values to not be blanked out at the beginning of each loop iteration! Note that in the CEP, the implicitly thread-local variables are *not available* before the first assignment in the loop. That is, code such as this is NOT allowed: cdef double x ... for i in prange(10): print x x = f(x) We raise a compiler error in such cases if we can: The code above is violating the contract that the order of execution of loop bodies should not matter. In cases where we can't raise an error (because we didn't bother or because it is not possible with a proof), we still initialize the variables to invalid values (NaN for double) at the beginning of the for-loop just to be sure the contract is satisfied. This was added to answer Stefan's objection to new types of implicit scopes (and I agree with his concern). Dag Sverre
On 04/11/2011 01:02 PM, Dag Sverre Seljebotn wrote:
On 04/11/2011 12:14 PM, mark florisson wrote:
On 11 April 2011 12:08, Dag Sverre Seljebotn<d.s.seljebotn@astro.uio.no> wrote:
On 04/11/2011 11:41 AM, mark florisson wrote:
On 11 April 2011 11:10, Dag Sverre Seljebotn<d.s.seljebotn@astro.uio.no> wrote:
On 04/11/2011 10:45 AM, mark florisson wrote:
On 5 April 2011 22:29, Dag Sverre Seljebotn<d.s.seljebotn@astro.uio.no> wrote: > > I've done a pretty major revision to the prange CEP, bringing in > a lot > of > the feedback. > > Thread-private variables are now split in two cases: > > i) The safe cases, which really require very little technical > knowledge > -> > automatically inferred > > ii) As an advanced feature, unsafe cases that requires some > knowledge > of > threading -> must be explicitly declared > > I think this split simplifies things a great deal.
Can't we obsolete the declaration entirely by assigning to variables that need to have firstprivate behaviour inside the with parallel block? Basically in the same way the scratch space is used. The only problem with that is that it won't be lastprivate, so the value will be undefined after the parallel block (but not after the worksharing loop).
cdef int myvariable
with nogil, parallel: myvariable = 2 for i in prange(...): use myvariable maybe assign to myvariable
# myvariable is well-defined here
# myvariable is not well-defined here
If you still desperately want lastprivate behaviour you can simply assign myvariable to another variable in the loop body.
I don't care about lastprivate, I don't think that is an issue, as you say.
My problem with this is that it means going into an area where possibly tricky things are implicit rather than explicit. I also see this as a rather special case that will be seldomly used, and implicit behaviour is more difficult to justify because of that.
Indeed, I actually considered if we should support firstprivate at all, as it's really about "being firstprivate and lastprivate". Without any declaration, you can have firstprivate or lastprivate, but not both :) So I agree that supporting such a (probably) uncommon case is better left explicit. On the other hand it seems silly to have support for such a weird case.
Well, I actually need to do the per-thread cache thing I described in the CEP in my own codes, so it's not *that* special; it'd be nice to support it.
You need 'old_ell' and 'alpha' after the loop?
No...but I need the values to not be blanked out at the beginning of each loop iteration!
Sorry, I now realize that re-reading your email I may have misunderstood you. Anyway, no, I don't need lastprivate at all anywhere. Dag Sverre
Note that in the CEP, the implicitly thread-local variables are *not available* before the first assignment in the loop. That is, code such as this is NOT allowed:
cdef double x ... for i in prange(10): print x x = f(x)
We raise a compiler error in such cases if we can: The code above is violating the contract that the order of execution of loop bodies should not matter.
In cases where we can't raise an error (because we didn't bother or because it is not possible with a proof), we still initialize the variables to invalid values (NaN for double) at the beginning of the for-loop just to be sure the contract is satisfied.
This was added to answer Stefan's objection to new types of implicit scopes (and I agree with his concern).
Dag Sverre
On 11 April 2011 13:03, Dag Sverre Seljebotn <d.s.seljebotn@astro.uio.no> wrote:
On 04/11/2011 01:02 PM, Dag Sverre Seljebotn wrote:
On 04/11/2011 12:14 PM, mark florisson wrote:
On 11 April 2011 12:08, Dag Sverre Seljebotn<d.s.seljebotn@astro.uio.no> wrote:
On 04/11/2011 11:41 AM, mark florisson wrote:
On 11 April 2011 11:10, Dag Sverre Seljebotn<d.s.seljebotn@astro.uio.no> wrote:
On 04/11/2011 10:45 AM, mark florisson wrote: > > On 5 April 2011 22:29, Dag Sverre > Seljebotn<d.s.seljebotn@astro.uio.no> > wrote: >> >> I've done a pretty major revision to the prange CEP, bringing in >> a lot >> of >> the feedback. >> >> Thread-private variables are now split in two cases: >> >> i) The safe cases, which really require very little technical >> knowledge >> -> >> automatically inferred >> >> ii) As an advanced feature, unsafe cases that requires some >> knowledge >> of >> threading -> must be explicitly declared >> >> I think this split simplifies things a great deal. > > Can't we obsolete the declaration entirely by assigning to variables > that need to have firstprivate behaviour inside the with parallel > block? Basically in the same way the scratch space is used. The only > problem with that is that it won't be lastprivate, so the value will > be undefined after the parallel block (but not after the worksharing > loop). > > cdef int myvariable > > with nogil, parallel: > myvariable = 2 > for i in prange(...): > use myvariable > maybe assign to myvariable > > # myvariable is well-defined here > > # myvariable is not well-defined here > > If you still desperately want lastprivate behaviour you can simply > assign myvariable to another variable in the loop body.
I don't care about lastprivate, I don't think that is an issue, as you say.
My problem with this is that it means going into an area where possibly tricky things are implicit rather than explicit. I also see this as a rather special case that will be seldomly used, and implicit behaviour is more difficult to justify because of that.
Indeed, I actually considered if we should support firstprivate at all, as it's really about "being firstprivate and lastprivate". Without any declaration, you can have firstprivate or lastprivate, but not both :) So I agree that supporting such a (probably) uncommon case is better left explicit. On the other hand it seems silly to have support for such a weird case.
Well, I actually need to do the per-thread cache thing I described in the CEP in my own codes, so it's not *that* special; it'd be nice to support it.
You need 'old_ell' and 'alpha' after the loop?
No...but I need the values to not be blanked out at the beginning of each loop iteration!
Sorry, I now realize that re-reading your email I may have misunderstood you. Anyway, no, I don't need lastprivate at all anywhere.
Right, so basically you can rewrite your example by introducing the parallel block (which doesn't add an indentation level as you're already using nogil) and assigning to your variables that need to be firstprivate there. The only thing you miss out on is lastprivate behaviour. So basically, the question is, do we want explicit syntax for such a rare case (firstprivate + lastprivate)? I must say, I found your previous argument of future shared declarations persuasive enough to introduce explicit syntax.
Dag Sverre
Note that in the CEP, the implicitly thread-local variables are *not available* before the first assignment in the loop. That is, code such as this is NOT allowed:
cdef double x ... for i in prange(10): print x x = f(x)
We raise a compiler error in such cases if we can: The code above is violating the contract that the order of execution of loop bodies should not matter.
In cases where we can't raise an error (because we didn't bother or because it is not possible with a proof), we still initialize the variables to invalid values (NaN for double) at the beginning of the for-loop just to be sure the contract is satisfied.
This was added to answer Stefan's objection to new types of implicit scopes (and I agree with his concern).
Dag Sverre
_______________________________________________ cython-devel mailing list cython-devel@python.org http://mail.python.org/mailman/listinfo/cython-devel
On 04/11/2011 01:12 PM, mark florisson wrote:
On 11 April 2011 13:03, Dag Sverre Seljebotn<d.s.seljebotn@astro.uio.no> wrote:
On 04/11/2011 01:02 PM, Dag Sverre Seljebotn wrote:
On 04/11/2011 12:14 PM, mark florisson wrote:
On 11 April 2011 12:08, Dag Sverre Seljebotn<d.s.seljebotn@astro.uio.no> wrote:
On 04/11/2011 11:41 AM, mark florisson wrote:
On 11 April 2011 11:10, Dag Sverre Seljebotn<d.s.seljebotn@astro.uio.no> wrote: > > On 04/11/2011 10:45 AM, mark florisson wrote: >> >> On 5 April 2011 22:29, Dag Sverre >> Seljebotn<d.s.seljebotn@astro.uio.no> >> wrote: >>> >>> I've done a pretty major revision to the prange CEP, bringing in >>> a lot >>> of >>> the feedback. >>> >>> Thread-private variables are now split in two cases: >>> >>> i) The safe cases, which really require very little technical >>> knowledge >>> -> >>> automatically inferred >>> >>> ii) As an advanced feature, unsafe cases that requires some >>> knowledge >>> of >>> threading -> must be explicitly declared >>> >>> I think this split simplifies things a great deal. >> >> Can't we obsolete the declaration entirely by assigning to variables >> that need to have firstprivate behaviour inside the with parallel >> block? Basically in the same way the scratch space is used. The only >> problem with that is that it won't be lastprivate, so the value will >> be undefined after the parallel block (but not after the worksharing >> loop). >> >> cdef int myvariable >> >> with nogil, parallel: >> myvariable = 2 >> for i in prange(...): >> use myvariable >> maybe assign to myvariable >> >> # myvariable is well-defined here >> >> # myvariable is not well-defined here >> >> If you still desperately want lastprivate behaviour you can simply >> assign myvariable to another variable in the loop body. > > I don't care about lastprivate, I don't think that is an issue, as you > say. > > My problem with this is that it means going into an area where > possibly > tricky things are implicit rather than explicit. I also see this as a > rather > special case that will be seldomly used, and implicit behaviour is > more > difficult to justify because of that.
Indeed, I actually considered if we should support firstprivate at all, as it's really about "being firstprivate and lastprivate". Without any declaration, you can have firstprivate or lastprivate, but not both :) So I agree that supporting such a (probably) uncommon case is better left explicit. On the other hand it seems silly to have support for such a weird case.
Well, I actually need to do the per-thread cache thing I described in the CEP in my own codes, so it's not *that* special; it'd be nice to support it.
You need 'old_ell' and 'alpha' after the loop?
No...but I need the values to not be blanked out at the beginning of each loop iteration!
Sorry, I now realize that re-reading your email I may have misunderstood you. Anyway, no, I don't need lastprivate at all anywhere.
Right, so basically you can rewrite your example by introducing the parallel block (which doesn't add an indentation level as you're already using nogil) and assigning to your variables that need to be firstprivate there. The only thing you miss out on is lastprivate behaviour. So basically, the question is, do we want explicit syntax for such a rare case (firstprivate + lastprivate)?
OK, we're on the same page here.
I must say, I found your previous argument of future shared declarations persuasive enough to introduce explicit syntax.
OK, lets leave it at this then, we don't have to agree for the same reasons :-) Dag Sverre
On 5 April 2011 22:29, Dag Sverre Seljebotn <d.s.seljebotn@astro.uio.no> wrote:
I've done a pretty major revision to the prange CEP, bringing in a lot of the feedback.
Thread-private variables are now split in two cases:
i) The safe cases, which really require very little technical knowledge -> automatically inferred
ii) As an advanced feature, unsafe cases that requires some knowledge of threading -> must be explicitly declared
I think this split simplifies things a great deal.
I'm rather excited over this now; this could turn out to be a really user-friendly and safe feature that would not only allow us to support OpenMP-like threading, but be more convenient to use in a range of common cases.
http://wiki.cython.org/enhancements/prange
Dag Sverre
_______________________________________________ cython-devel mailing list cython-devel@python.org http://mail.python.org/mailman/listinfo/cython-devel
If we want to support cython.parallel.threadsavailable outside of parallel regions (which does not depend on the schedule used for worksharing constructs!), then we have to disable dynamic scheduling. For instance, if OpenMP sees some OpenMP threads are already busy, then with dynamic scheduling it dynamically establishes how many threads to use for any parallel region. So basically, if you put omp_get_num_threads() in a parallel region, you have a race when you depend on that result in a subsequent parallel region, because the number of busy OpenMP threads may have changed. So basically, to make threadsavailable() work outside parallel regions, we'd have to disable dynamic scheduling (omp_set_dynamic(0)). Of course, when OpenMP cannot request the amount of threads desired (because they are bounded by a configurable thread limit (and the OS of course)), the behaviour will be implementation defined. So then we could just put a warning in the docs for that, and users can check for this in the parallel region using threadsavailable() if it's really important. Does that sound like a good idea? And should I update the CEP?
On 04/13/2011 09:31 PM, mark florisson wrote:
On 5 April 2011 22:29, Dag Sverre Seljebotn<d.s.seljebotn@astro.uio.no> wrote:
I've done a pretty major revision to the prange CEP, bringing in a lot of the feedback.
Thread-private variables are now split in two cases:
i) The safe cases, which really require very little technical knowledge -> automatically inferred
ii) As an advanced feature, unsafe cases that requires some knowledge of threading -> must be explicitly declared
I think this split simplifies things a great deal.
I'm rather excited over this now; this could turn out to be a really user-friendly and safe feature that would not only allow us to support OpenMP-like threading, but be more convenient to use in a range of common cases.
http://wiki.cython.org/enhancements/prange
Dag Sverre
_______________________________________________ cython-devel mailing list cython-devel@python.org http://mail.python.org/mailman/listinfo/cython-devel
If we want to support cython.parallel.threadsavailable outside of parallel regions (which does not depend on the schedule used for worksharing constructs!), then we have to disable dynamic scheduling. For instance, if OpenMP sees some OpenMP threads are already busy, then with dynamic scheduling it dynamically establishes how many threads to use for any parallel region. So basically, if you put omp_get_num_threads() in a parallel region, you have a race when you depend on that result in a subsequent parallel region, because the number of busy OpenMP threads may have changed.
Ah, I don't know why I thought there wouldn't be a race condition. I wonder if the whole threadsavailable() idea should just be ditched and that we should think of something else. It's not a very common usecase. Starting to disable some forms of scheduling just to, essentially, shoehorn in one particular syntax, doesn't seem like the way to go. Perhaps this calls for support for the critical(?) block then, after all. I'm at least +1 on dropping threadsavailable() and instead require that you call numthreads() in a critical block: with parallel: with critical: # call numthreads() and allocate global buffer # calling threadid() not allowed, if we can manage that # get buffer slice for each thread
So basically, to make threadsavailable() work outside parallel regions, we'd have to disable dynamic scheduling (omp_set_dynamic(0)). Of course, when OpenMP cannot request the amount of threads desired (because they are bounded by a configurable thread limit (and the OS of course)), the behaviour will be implementation defined. So then we could just put a warning in the docs for that, and users can check for this in the parallel region using threadsavailable() if it's really important.
Do you have any experience with what actually happen with, say, GNU OpenMP? I blindly assumed from the specs that it was an error condition ("flag an error any way you like"), but I guess that may be wrong. Just curious, I think we can just fall back to OpenMP behaviour; unless it terminates the interpreter in an error condition, in which case we should look into how expensive it is to check for the condition up front... Dag Sverre
On 13 April 2011 21:57, Dag Sverre Seljebotn <d.s.seljebotn@astro.uio.no> wrote:
On 04/13/2011 09:31 PM, mark florisson wrote:
On 5 April 2011 22:29, Dag Sverre Seljebotn<d.s.seljebotn@astro.uio.no> wrote:
I've done a pretty major revision to the prange CEP, bringing in a lot of the feedback.
Thread-private variables are now split in two cases:
i) The safe cases, which really require very little technical knowledge -> automatically inferred
ii) As an advanced feature, unsafe cases that requires some knowledge of threading -> must be explicitly declared
I think this split simplifies things a great deal.
I'm rather excited over this now; this could turn out to be a really user-friendly and safe feature that would not only allow us to support OpenMP-like threading, but be more convenient to use in a range of common cases.
http://wiki.cython.org/enhancements/prange
Dag Sverre
_______________________________________________ cython-devel mailing list cython-devel@python.org http://mail.python.org/mailman/listinfo/cython-devel
If we want to support cython.parallel.threadsavailable outside of parallel regions (which does not depend on the schedule used for worksharing constructs!), then we have to disable dynamic scheduling. For instance, if OpenMP sees some OpenMP threads are already busy, then with dynamic scheduling it dynamically establishes how many threads to use for any parallel region. So basically, if you put omp_get_num_threads() in a parallel region, you have a race when you depend on that result in a subsequent parallel region, because the number of busy OpenMP threads may have changed.
Ah, I don't know why I thought there wouldn't be a race condition. I wonder if the whole threadsavailable() idea should just be ditched and that we should think of something else. It's not a very common usecase. Starting to disable some forms of scheduling just to, essentially, shoehorn in one particular syntax, doesn't seem like the way to go.
Perhaps this calls for support for the critical(?) block then, after all. I'm at least +1 on dropping threadsavailable() and instead require that you call numthreads() in a critical block:
with parallel: with critical: # call numthreads() and allocate global buffer # calling threadid() not allowed, if we can manage that # get buffer slice for each thread
In that case I think you'd want single + a barrier. 'critical' means that all threads execute the section, but exclusively. I think you usually want to allocate either a shared worksharing buffer, or a private thread-local buffer. In the former case you can allocate your buffer outside any parallel section, in the latter case within the parallel section. It the latter case the buffer will just not be available outside of the parallel section. We can still support any write-back to shared variables that are explicitly declared later on (supposing we'd also support single and barriers. Then the code would read as follows cdef shared(void *) buf cdef void *localbuf with nogil, parallel: with single: buf = malloc(n * numthreads()) barrier() localbuf = buf + n * threadid() <actual code here that uses localbuf (or buf if you don't assign to it)> # localbuf undefined here # buf is well-defined here However, I don't believe it's very common to want to use private buffers after the loop. If you have a buffer in terms of your loop size, you want it shared, but I can't imagine a case where you want to examine buffers that were allocated specifically for each thread after the parallel section. So I'm +1 on dropping threadsavailable outside parallel sections, but currently -1 on supporting this case, because we can solve it later on with support for explicitly declared variables + single + barriers.
So basically, to make threadsavailable() work outside parallel regions, we'd have to disable dynamic scheduling (omp_set_dynamic(0)). Of course, when OpenMP cannot request the amount of threads desired (because they are bounded by a configurable thread limit (and the OS of course)), the behaviour will be implementation defined. So then we could just put a warning in the docs for that, and users can check for this in the parallel region using threadsavailable() if it's really important.
Do you have any experience with what actually happen with, say, GNU OpenMP? I blindly assumed from the specs that it was an error condition ("flag an error any way you like"), but I guess that may be wrong.
Just curious, I think we can just fall back to OpenMP behaviour; unless it terminates the interpreter in an error condition, in which case we should look into how expensive it is to check for the condition up front...
With libgomp you just get the maximum amount of available threads, up to the number requested. So this code 1 #include <stdio.h> 2 #include <omp.h> 3 4 int main(void) { 5 printf("The thread limit is: %d\n", omp_get_thread_limit()); 6 #pragma omp parallel num_threads(4) 7 { 8 #pragma omp single 9 printf("We have %d threads in the thread team\n", omp_get_num_threads()); 10 } 11 return 0; 12 } requests 4 threads, but it gets only 2: [0] [22:28] ~/code/openmp ➤ OMP_THREAD_LIMIT=2 ./testomp The thread limit is: 2 We have 2 threads in the thread team
Dag Sverre
_______________________________________________ cython-devel mailing list cython-devel@python.org http://mail.python.org/mailman/listinfo/cython-devel
On 13 April 2011 22:53, mark florisson <markflorisson88@gmail.com> wrote:
On 13 April 2011 21:57, Dag Sverre Seljebotn <d.s.seljebotn@astro.uio.no> wrote:
On 04/13/2011 09:31 PM, mark florisson wrote:
On 5 April 2011 22:29, Dag Sverre Seljebotn<d.s.seljebotn@astro.uio.no> wrote:
I've done a pretty major revision to the prange CEP, bringing in a lot of the feedback.
Thread-private variables are now split in two cases:
i) The safe cases, which really require very little technical knowledge -> automatically inferred
ii) As an advanced feature, unsafe cases that requires some knowledge of threading -> must be explicitly declared
I think this split simplifies things a great deal.
I'm rather excited over this now; this could turn out to be a really user-friendly and safe feature that would not only allow us to support OpenMP-like threading, but be more convenient to use in a range of common cases.
http://wiki.cython.org/enhancements/prange
Dag Sverre
_______________________________________________ cython-devel mailing list cython-devel@python.org http://mail.python.org/mailman/listinfo/cython-devel
If we want to support cython.parallel.threadsavailable outside of parallel regions (which does not depend on the schedule used for worksharing constructs!), then we have to disable dynamic scheduling. For instance, if OpenMP sees some OpenMP threads are already busy, then with dynamic scheduling it dynamically establishes how many threads to use for any parallel region. So basically, if you put omp_get_num_threads() in a parallel region, you have a race when you depend on that result in a subsequent parallel region, because the number of busy OpenMP threads may have changed.
Ah, I don't know why I thought there wouldn't be a race condition. I wonder if the whole threadsavailable() idea should just be ditched and that we should think of something else. It's not a very common usecase. Starting to disable some forms of scheduling just to, essentially, shoehorn in one particular syntax, doesn't seem like the way to go.
Perhaps this calls for support for the critical(?) block then, after all. I'm at least +1 on dropping threadsavailable() and instead require that you call numthreads() in a critical block:
with parallel: with critical: # call numthreads() and allocate global buffer # calling threadid() not allowed, if we can manage that # get buffer slice for each thread
In that case I think you'd want single + a barrier. 'critical' means that all threads execute the section, but exclusively. I think you usually want to allocate either a shared worksharing buffer, or a private thread-local buffer. In the former case you can allocate your buffer outside any parallel section, in the latter case within the parallel section. It the latter case the buffer will just not be available outside of the parallel section.
We can still support any write-back to shared variables that are explicitly declared later on (supposing we'd also support single and barriers. Then the code would read as follows
cdef shared(void *) buf cdef void *localbuf
with nogil, parallel: with single: buf = malloc(n * numthreads())
barrier()
localbuf = buf + n * threadid() <actual code here that uses localbuf (or buf if you don't assign to it)>
# localbuf undefined here # buf is well-defined here
However, I don't believe it's very common to want to use private buffers after the loop. If you have a buffer in terms of your loop size, you want it shared, but I can't imagine a case where you want to examine buffers that were allocated specifically for each thread after the parallel section. So I'm +1 on dropping threadsavailable outside parallel sections, but currently -1 on supporting this case, because we can solve it later on with support for explicitly declared variables + single + barriers.
So basically, to make threadsavailable() work outside parallel regions, we'd have to disable dynamic scheduling (omp_set_dynamic(0)). Of course, when OpenMP cannot request the amount of threads desired (because they are bounded by a configurable thread limit (and the OS of course)), the behaviour will be implementation defined. So then we could just put a warning in the docs for that, and users can check for this in the parallel region using threadsavailable() if it's really important.
Do you have any experience with what actually happen with, say, GNU OpenMP? I blindly assumed from the specs that it was an error condition ("flag an error any way you like"), but I guess that may be wrong.
Just curious, I think we can just fall back to OpenMP behaviour; unless it terminates the interpreter in an error condition, in which case we should look into how expensive it is to check for the condition up front...
With libgomp you just get the maximum amount of available threads, up to the number requested. So this code
1 #include <stdio.h> 2 #include <omp.h> 3 4 int main(void) { 5 printf("The thread limit is: %d\n", omp_get_thread_limit()); 6 #pragma omp parallel num_threads(4) 7 { 8 #pragma omp single 9 printf("We have %d threads in the thread team\n", omp_get_num_threads()); 10 } 11 return 0; 12 }
requests 4 threads, but it gets only 2:
[0] [22:28] ~/code/openmp ➤ OMP_THREAD_LIMIT=2 ./testomp The thread limit is: 2 We have 2 threads in the thread team
Dag Sverre
_______________________________________________ cython-devel mailing list cython-devel@python.org http://mail.python.org/mailman/listinfo/cython-devel
Although there is omp_get_max_threads(): "The omp_get_max_threads routine returns an upper bound on the number of threads that could be used to form a new team if a parallel region without a num_threads clause were encountered after execution returns from this routine." So we could have threadsvailable() evaluate to that if encountered outside a parallel region. Inside, it would evaluate to omp_get_num_threads(). At worst, people would over-allocate a bit.
On 04/13/2011 11:13 PM, mark florisson wrote:
Although there is omp_get_max_threads():
"The omp_get_max_threads routine returns an upper bound on the number of threads that could be used to form a new team if a parallel region without a num_threads clause were encountered after execution returns from this routine."
So we could have threadsvailable() evaluate to that if encountered outside a parallel region. Inside, it would evaluate to omp_get_num_threads(). At worst, people would over-allocate a bit.
Well, over-allocating could well mean 1 GB, which could well mean getting an unecesarry MemoryError (or, like in my case, if I'm not careful to set ulimit, getting a SIGKILL sent to you 2 minutes after the fact by the cluster patrol process...) But even ignoring this, we also have to plan for people misusing the feature. If we put it in there, somebody somewhere *will* write code like this: nthreads = threadsavailable() with parallel: for i in prange(nthreads): for j in range(100*i, 100*(i+1)): [...] (Yes, they shouldn't. Yes, they will.) Combined with a race condition that will only very seldomly trigger, this starts to sound like a very bad idea indeed. So I agree with you that we should just leave it for now, and do single/barrier later. DS
On 14 April 2011 20:29, Dag Sverre Seljebotn <d.s.seljebotn@astro.uio.no> wrote:
On 04/13/2011 11:13 PM, mark florisson wrote:
Although there is omp_get_max_threads():
"The omp_get_max_threads routine returns an upper bound on the number of threads that could be used to form a new team if a parallel region without a num_threads clause were encountered after execution returns from this routine."
So we could have threadsvailable() evaluate to that if encountered outside a parallel region. Inside, it would evaluate to omp_get_num_threads(). At worst, people would over-allocate a bit.
Well, over-allocating could well mean 1 GB, which could well mean getting an unecesarry MemoryError (or, like in my case, if I'm not careful to set ulimit, getting a SIGKILL sent to you 2 minutes after the fact by the cluster patrol process...)
The upper bound is not "however many threads you think you can start", but rather "how many threads are considered useful for your machine". So if you use omp_set_num_threads(), it will return the value you set there. Otherwise, if you have e.g. a quadcore, it will return 4. The spec says: "Note – The return value of the omp_get_max_threads routine can be used to dynamically allocate sufficient storage for all threads in the team formed at the subsequent active parallel region." So this sounds like a viable option.
But even ignoring this, we also have to plan for people misusing the feature. If we put it in there, somebody somewhere *will* write code like this:
nthreads = threadsavailable() with parallel: for i in prange(nthreads): for j in range(100*i, 100*(i+1)): [...]
(Yes, they shouldn't. Yes, they will.)
Combined with a race condition that will only very seldomly trigger, this starts to sound like a very bad idea indeed.
So I agree with you that we should just leave it for now, and do single/barrier later.
DS _______________________________________________ cython-devel mailing list cython-devel@python.org http://mail.python.org/mailman/listinfo/cython-devel
On 04/14/2011 08:39 PM, mark florisson wrote:
On 14 April 2011 20:29, Dag Sverre Seljebotn<d.s.seljebotn@astro.uio.no> wrote:
On 04/13/2011 11:13 PM, mark florisson wrote:
Although there is omp_get_max_threads():
"The omp_get_max_threads routine returns an upper bound on the number of threads that could be used to form a new team if a parallel region without a num_threads clause were encountered after execution returns from this routine."
So we could have threadsvailable() evaluate to that if encountered outside a parallel region. Inside, it would evaluate to omp_get_num_threads(). At worst, people would over-allocate a bit.
Well, over-allocating could well mean 1 GB, which could well mean getting an unecesarry MemoryError (or, like in my case, if I'm not careful to set ulimit, getting a SIGKILL sent to you 2 minutes after the fact by the cluster patrol process...)
The upper bound is not "however many threads you think you can start", but rather "how many threads are considered useful for your machine". So if you use omp_set_num_threads(), it will return the value you set there. Otherwise, if you have e.g. a quadcore, it will return 4. The spec says:
"Note – The return value of the omp_get_max_threads routine can be used to dynamically allocate sufficient storage for all threads in the team formed at the subsequent active parallel region."
So this sounds like a viable option.
What would happen here: We have 8 cores. Some code has an OpenMP parallel section with maxthreads=2, and inside the section another function is called. That called function uses threadsavailable(), and has a parallel block that wants as many threads as it can get. I don't know the details as well as you do, but my uninformed guess is that in this case it'd be quite possible with a race where omp_get_max_threads would return 7 in each case, then the first one to the parallel would get the 7 threads. The remaining thread then has allocated storage for 7 threads but only has 1 thread running. BTW, I'm not sure what the difference is between the original idea and omp_get_max_threads -- in the absence of such races as above, my original idea with entering a parallel section (with the same scheduling parameters) just to see how many threads we got, would work as well? DS
On 14 April 2011 20:29, Dag Sverre Seljebotn <d.s.seljebotn@astro.uio.no> wrote:
On 04/13/2011 11:13 PM, mark florisson wrote:
Although there is omp_get_max_threads():
"The omp_get_max_threads routine returns an upper bound on the number of threads that could be used to form a new team if a parallel region without a num_threads clause were encountered after execution returns from this routine."
So we could have threadsvailable() evaluate to that if encountered outside a parallel region. Inside, it would evaluate to omp_get_num_threads(). At worst, people would over-allocate a bit.
Well, over-allocating could well mean 1 GB, which could well mean getting an unecesarry MemoryError (or, like in my case, if I'm not careful to set ulimit, getting a SIGKILL sent to you 2 minutes after the fact by the cluster patrol process...)
But even ignoring this, we also have to plan for people misusing the feature. If we put it in there, somebody somewhere *will* write code like this:
nthreads = threadsavailable() with parallel: for i in prange(nthreads): for j in range(100*i, 100*(i+1)): [...]
(Yes, they shouldn't. Yes, they will.)
Combined with a race condition that will only very seldomly trigger, this starts to sound like a very bad idea indeed.
So I agree with you that we should just leave it for now, and do single/barrier later.
omp_get_max_threads() doesn't have a race, as it returns the upper bound. So e.g. if between your call and your parallel section less OpenMP threads become available, then you might get less threads, but never more.
DS _______________________________________________ cython-devel mailing list cython-devel@python.org http://mail.python.org/mailman/listinfo/cython-devel
On 04/14/2011 08:42 PM, mark florisson wrote:
On 14 April 2011 20:29, Dag Sverre Seljebotn<d.s.seljebotn@astro.uio.no> wrote:
On 04/13/2011 11:13 PM, mark florisson wrote:
Although there is omp_get_max_threads():
"The omp_get_max_threads routine returns an upper bound on the number of threads that could be used to form a new team if a parallel region without a num_threads clause were encountered after execution returns from this routine."
So we could have threadsvailable() evaluate to that if encountered outside a parallel region. Inside, it would evaluate to omp_get_num_threads(). At worst, people would over-allocate a bit.
Well, over-allocating could well mean 1 GB, which could well mean getting an unecesarry MemoryError (or, like in my case, if I'm not careful to set ulimit, getting a SIGKILL sent to you 2 minutes after the fact by the cluster patrol process...)
But even ignoring this, we also have to plan for people misusing the feature. If we put it in there, somebody somewhere *will* write code like this:
nthreads = threadsavailable() with parallel: for i in prange(nthreads): for j in range(100*i, 100*(i+1)): [...]
(Yes, they shouldn't. Yes, they will.)
Combined with a race condition that will only very seldomly trigger, this starts to sound like a very bad idea indeed.
So I agree with you that we should just leave it for now, and do single/barrier later.
omp_get_max_threads() doesn't have a race, as it returns the upper bound. So e.g. if between your call and your parallel section less OpenMP threads become available, then you might get less threads, but never more.
Oh, now I'm following you. Well, my argument was that I think erroring in that direction is pretty bad as well. Also, even if we're not making it available in cython.parallel, we're not stopping people from calling omp_get_max_threads directly themselves, which should be OK for the people who know enough to do this safely... Dag Sverre
On 14 April 2011 20:58, Dag Sverre Seljebotn <d.s.seljebotn@astro.uio.no> wrote:
On 04/14/2011 08:42 PM, mark florisson wrote:
On 14 April 2011 20:29, Dag Sverre Seljebotn<d.s.seljebotn@astro.uio.no> wrote:
On 04/13/2011 11:13 PM, mark florisson wrote:
Although there is omp_get_max_threads():
"The omp_get_max_threads routine returns an upper bound on the number of threads that could be used to form a new team if a parallel region without a num_threads clause were encountered after execution returns from this routine."
So we could have threadsvailable() evaluate to that if encountered outside a parallel region. Inside, it would evaluate to omp_get_num_threads(). At worst, people would over-allocate a bit.
Well, over-allocating could well mean 1 GB, which could well mean getting an unecesarry MemoryError (or, like in my case, if I'm not careful to set ulimit, getting a SIGKILL sent to you 2 minutes after the fact by the cluster patrol process...)
But even ignoring this, we also have to plan for people misusing the feature. If we put it in there, somebody somewhere *will* write code like this:
nthreads = threadsavailable() with parallel: for i in prange(nthreads): for j in range(100*i, 100*(i+1)): [...]
(Yes, they shouldn't. Yes, they will.)
Combined with a race condition that will only very seldomly trigger, this starts to sound like a very bad idea indeed.
So I agree with you that we should just leave it for now, and do single/barrier later.
omp_get_max_threads() doesn't have a race, as it returns the upper bound. So e.g. if between your call and your parallel section less OpenMP threads become available, then you might get less threads, but never more.
Oh, now I'm following you.
Well, my argument was that I think erroring in that direction is pretty bad as well.
Also, even if we're not making it available in cython.parallel, we're not stopping people from calling omp_get_max_threads directly themselves, which should be OK for the people who know enough to do this safely...
True, but it wouldn't be as easy to wrap in a #ifdef _OPENMP. In any event, we could just put a warning in the docs stating that using threadsavailable outside parallel sections returns an upper bound on the actual number of threads in a subsequent parallel section.
Dag Sverre _______________________________________________ cython-devel mailing list cython-devel@python.org http://mail.python.org/mailman/listinfo/cython-devel
On 04/14/2011 09:08 PM, mark florisson wrote:
On 14 April 2011 20:58, Dag Sverre Seljebotn<d.s.seljebotn@astro.uio.no> wrote:
On 04/14/2011 08:42 PM, mark florisson wrote:
On 14 April 2011 20:29, Dag Sverre Seljebotn<d.s.seljebotn@astro.uio.no> wrote:
On 04/13/2011 11:13 PM, mark florisson wrote:
Although there is omp_get_max_threads():
"The omp_get_max_threads routine returns an upper bound on the number of threads that could be used to form a new team if a parallel region without a num_threads clause were encountered after execution returns from this routine."
So we could have threadsvailable() evaluate to that if encountered outside a parallel region. Inside, it would evaluate to omp_get_num_threads(). At worst, people would over-allocate a bit.
Well, over-allocating could well mean 1 GB, which could well mean getting an unecesarry MemoryError (or, like in my case, if I'm not careful to set ulimit, getting a SIGKILL sent to you 2 minutes after the fact by the cluster patrol process...)
But even ignoring this, we also have to plan for people misusing the feature. If we put it in there, somebody somewhere *will* write code like this:
nthreads = threadsavailable() with parallel: for i in prange(nthreads): for j in range(100*i, 100*(i+1)): [...]
(Yes, they shouldn't. Yes, they will.)
Combined with a race condition that will only very seldomly trigger, this starts to sound like a very bad idea indeed.
So I agree with you that we should just leave it for now, and do single/barrier later.
omp_get_max_threads() doesn't have a race, as it returns the upper bound. So e.g. if between your call and your parallel section less OpenMP threads become available, then you might get less threads, but never more.
Oh, now I'm following you.
Well, my argument was that I think erroring in that direction is pretty bad as well.
Also, even if we're not making it available in cython.parallel, we're not stopping people from calling omp_get_max_threads directly themselves, which should be OK for the people who know enough to do this safely...
True, but it wouldn't be as easy to wrap in a #ifdef _OPENMP. In any event, we could just put a warning in the docs stating that using threadsavailable outside parallel sections returns an upper bound on the actual number of threads in a subsequent parallel section.
I don't think outside or within makes a difference -- what about nested parallel sections? At least my intention in the CEP was that threadsavailable was always for the next section (so often it would be 1 after entering the section). Perhaps just calling it "maxthreads" instead solves the issue. (Still, I favour just dropping threadsavailable/maxthreads for the time being. It is much simpler to add something later, when we've had some time to use it and reflect about it, than to remove something that shouldn't have been added.) Dag Sverre
On 14 April 2011 21:37, Dag Sverre Seljebotn <d.s.seljebotn@astro.uio.no> wrote:
On 04/14/2011 09:08 PM, mark florisson wrote:
On 14 April 2011 20:58, Dag Sverre Seljebotn<d.s.seljebotn@astro.uio.no> wrote:
On 04/14/2011 08:42 PM, mark florisson wrote:
On 14 April 2011 20:29, Dag Sverre Seljebotn<d.s.seljebotn@astro.uio.no> wrote:
On 04/13/2011 11:13 PM, mark florisson wrote:
Although there is omp_get_max_threads():
"The omp_get_max_threads routine returns an upper bound on the number of threads that could be used to form a new team if a parallel region without a num_threads clause were encountered after execution returns from this routine."
So we could have threadsvailable() evaluate to that if encountered outside a parallel region. Inside, it would evaluate to omp_get_num_threads(). At worst, people would over-allocate a bit.
Well, over-allocating could well mean 1 GB, which could well mean getting an unecesarry MemoryError (or, like in my case, if I'm not careful to set ulimit, getting a SIGKILL sent to you 2 minutes after the fact by the cluster patrol process...)
But even ignoring this, we also have to plan for people misusing the feature. If we put it in there, somebody somewhere *will* write code like this:
nthreads = threadsavailable() with parallel: for i in prange(nthreads): for j in range(100*i, 100*(i+1)): [...]
(Yes, they shouldn't. Yes, they will.)
Combined with a race condition that will only very seldomly trigger, this starts to sound like a very bad idea indeed.
So I agree with you that we should just leave it for now, and do single/barrier later.
omp_get_max_threads() doesn't have a race, as it returns the upper bound. So e.g. if between your call and your parallel section less OpenMP threads become available, then you might get less threads, but never more.
Oh, now I'm following you.
Well, my argument was that I think erroring in that direction is pretty bad as well.
Also, even if we're not making it available in cython.parallel, we're not stopping people from calling omp_get_max_threads directly themselves, which should be OK for the people who know enough to do this safely...
True, but it wouldn't be as easy to wrap in a #ifdef _OPENMP. In any event, we could just put a warning in the docs stating that using threadsavailable outside parallel sections returns an upper bound on the actual number of threads in a subsequent parallel section.
I don't think outside or within makes a difference -- what about nested parallel sections? At least my intention in the CEP was that threadsavailable was always for the next section (so often it would be 1 after entering the section).
Perhaps just calling it "maxthreads" instead solves the issue.
(Still, I favour just dropping threadsavailable/maxthreads for the time being. It is much simpler to add something later, when we've had some time to use it and reflect about it, than to remove something that shouldn't have been added.)
Dag Sverre _______________________________________________ cython-devel mailing list cython-devel@python.org http://mail.python.org/mailman/listinfo/cython-devel
Definitely true, I'll disable it for now.
(Moving discussion from http://markflorisson.wordpress.com/, where Mark said:) """ Started a new branch https://github.com/markflorisson88/cython/tree/openmp . Now the question is whether sharing attributes should be propagated outwards. e.g. if you do for i in prange(m): for j in prange(n): sum += i * j then ‘sum’ is a reduction for the inner parallel loop, but not for the outer one. So the user would currently have to rewrite this to for i in prange(m): for j in prange(n): sum += i * j sum += 0 which seems a bit silly . Of course, we could just disable nested parallelism, or tell the users to use a prange and a ‘for from’ in such cases. """ Dag: Interesting. The first one is definitely the behaviour we want, as long as it doesn't cause unintended consequences. I don't really think it will -- the important thing is that that the order of loop iteration evaluation must be unimportant. And that is still true (for the outer loop, as well as for the inner) in your first example. Question: When you have nested pranges, what will happen is that two nested OpenMP parallel blocks are used, right? And do you know if there is complete freedom/"reentrancy" in that variables that are thread-private in an outer parallel block and be shared in an inner one, and vice versa? If so I'd think that this algorithm should work and feel natural: - In each prange, for the purposes of variable private/shared/reduction inference, consider all internal "prange" just as if they had been "range"; no special treatment. - Recurse to children pranges. DS
On 16 April 2011 18:42, Dag Sverre Seljebotn <d.s.seljebotn@astro.uio.no> wrote:
(Moving discussion from http://markflorisson.wordpress.com/, where Mark said:)
Ok, sure, it was just an issue I was wondering about at that moment, but it's a tricky issue, so thanks.
""" Started a new branch https://github.com/markflorisson88/cython/tree/openmp .
Now the question is whether sharing attributes should be propagated outwards. e.g. if you do
for i in prange(m): for j in prange(n): sum += i * j
then ‘sum’ is a reduction for the inner parallel loop, but not for the outer one. So the user would currently have to rewrite this to
for i in prange(m): for j in prange(n): sum += i * j sum += 0
which seems a bit silly . Of course, we could just disable nested parallelism, or tell the users to use a prange and a ‘for from’ in such cases. """
Dag: Interesting. The first one is definitely the behaviour we want, as long as it doesn't cause unintended consequences.
I don't really think it will -- the important thing is that that the order of loop iteration evaluation must be unimportant. And that is still true (for the outer loop, as well as for the inner) in your first example.
Question: When you have nested pranges, what will happen is that two nested OpenMP parallel blocks are used, right? And do you know if there is complete freedom/"reentrancy" in that variables that are thread-private in an outer parallel block and be shared in an inner one, and vice versa?
An implementation may or may not support it, and if it is supported the behaviour can be configured through omp_set_nested(). So we should consider the case where it is supported and enabled. If you have a lastprivate or reduction, and after the loop these are (reduced and) assigned to the original variable. So if that happens inside a parallel construct which does not declare the variable private to the construct, you actually have a race. So e.g. the nested prange currently races in the outer parallel range.
If so I'd think that this algorithm should work and feel natural:
- In each prange, for the purposes of variable private/shared/reduction inference, consider all internal "prange" just as if they had been "range"; no special treatment.
- Recurse to children pranges.
Right, that is most natural. Algorithmically, reductions and lastprivates (as those can have races if placed in inner parallel constructs) propagate outwards towards the outermost parallel block, or up to the first parallel with block, or up to the first construct that already determined the sharing attribute. e.g. with parallel: with parallel: for i in prange(n): for j in prange(n): sum += i * j # sum is well-defined here # sum is undefined here Here 'sum' is a reduction for the two innermost loops. 'sum' is not private for the inner parallel with block, as a prange in a parallel with block is a worksharing loop that binds to that parallel with block. However, the outermost parallel with block declares sum (and i and j) private, so after that block all those variables become undefined. However, in the outermost parallel with block, sum will have to be initialized to 0 before anything else, or be declared firstprivate, otherwise 'sum' is undefined to begin with. Do you think declaring it firstprivate would be the way to go, or should we make it private and issue a warning or perhaps even an error?
DS _______________________________________________ cython-devel mailing list cython-devel@python.org http://mail.python.org/mailman/listinfo/cython-devel
On 18 April 2011 13:06, mark florisson <markflorisson88@gmail.com> wrote:
On 16 April 2011 18:42, Dag Sverre Seljebotn <d.s.seljebotn@astro.uio.no> wrote:
(Moving discussion from http://markflorisson.wordpress.com/, where Mark said:)
Ok, sure, it was just an issue I was wondering about at that moment, but it's a tricky issue, so thanks.
""" Started a new branch https://github.com/markflorisson88/cython/tree/openmp .
Now the question is whether sharing attributes should be propagated outwards. e.g. if you do
for i in prange(m): for j in prange(n): sum += i * j
then ‘sum’ is a reduction for the inner parallel loop, but not for the outer one. So the user would currently have to rewrite this to
for i in prange(m): for j in prange(n): sum += i * j sum += 0
which seems a bit silly . Of course, we could just disable nested parallelism, or tell the users to use a prange and a ‘for from’ in such cases. """
Dag: Interesting. The first one is definitely the behaviour we want, as long as it doesn't cause unintended consequences.
I don't really think it will -- the important thing is that that the order of loop iteration evaluation must be unimportant. And that is still true (for the outer loop, as well as for the inner) in your first example.
Question: When you have nested pranges, what will happen is that two nested OpenMP parallel blocks are used, right? And do you know if there is complete freedom/"reentrancy" in that variables that are thread-private in an outer parallel block and be shared in an inner one, and vice versa?
An implementation may or may not support it, and if it is supported the behaviour can be configured through omp_set_nested(). So we should consider the case where it is supported and enabled.
If you have a lastprivate or reduction, and after the loop these are (reduced and) assigned to the original variable. So if that happens inside a parallel construct which does not declare the variable private to the construct, you actually have a race. So e.g. the nested prange currently races in the outer parallel range.
If so I'd think that this algorithm should work and feel natural:
- In each prange, for the purposes of variable private/shared/reduction inference, consider all internal "prange" just as if they had been "range"; no special treatment.
- Recurse to children pranges.
Right, that is most natural. Algorithmically, reductions and lastprivates (as those can have races if placed in inner parallel constructs) propagate outwards towards the outermost parallel block, or up to the first parallel with block, or up to the first construct that already determined the sharing attribute.
e.g.
with parallel: with parallel: for i in prange(n): for j in prange(n): sum += i * j # sum is well-defined here # sum is undefined here
Here 'sum' is a reduction for the two innermost loops. 'sum' is not private for the inner parallel with block, as a prange in a parallel with block is a worksharing loop that binds to that parallel with block. However, the outermost parallel with block declares sum (and i and j) private, so after that block all those variables become undefined.
However, in the outermost parallel with block, sum will have to be initialized to 0 before anything else, or be declared firstprivate, otherwise 'sum' is undefined to begin with. Do you think declaring it firstprivate would be the way to go, or should we make it private and issue a warning or perhaps even an error?
DS _______________________________________________ cython-devel mailing list cython-devel@python.org http://mail.python.org/mailman/listinfo/cython-devel
Everything seems to be working, although now the user has to be careful with nested parallel blocks as variables can be private there (and not firstprivate), i.e., the user has to do initialization at the right place (e.g. in the outermost parallel block that determines it private). I'm thinking of adding a warning, as the C compiler does. Two issues are remaining: 1) explicit declarations of firstprivates Do we still want those? 2) buffer auxiliary vars When unpacking numpy buffers and using typed numpy arrays, can reassignment or updates of a buffer-related variable ever occur in nogil code sections? I'm thinking this is not possible and therefore all buffer variables may be shared in parallel (for) sections?
Excellent! Sounds great! (as I won't have my laptop for some days I can't have a look yet but I will later) You're right about (the current) buffers and the gil. A testcase explicitly for them would be good. Firstprivate etc: i think it'd be nice myself, but it is probably better to take a break from it at this point so that we can think more about that and not do anything rash; perhaps open up a specific thread on them and ask for more general input. Perhaps you want to take a break or task-switch to something else (fused types?) until I can get around to review and merge what you have so far? You'll know best what works for you though. If you decide to implement explicit threadprivate variables because you've got the flow I certainly wom't object myself. -- Sent from my Android phone with K-9 Mail. Please excuse my brevity. mark florisson <markflorisson88@gmail.com> wrote: On 18 April 2011 13:06, mark florisson <markflorisson88@gmail.com> wrote: > On 16 April 2011 18:42, Dag Sverre Seljebotn <d.s.seljebotn@astro.uio.no> wrote: >> (Moving discussion from http://markflorisson.wordpress.com/, where Mark >> said:) > > Ok, sure, it was just an issue I was wondering about at that moment, > but it's a tricky issue, so thanks. > >> """ >> Started a new branch https://github.com/markflorisson88/cython/tree/openmp . >> >> Now the question is whether sharing attributes should be propagated >> outwards. e.g. if you do >> >> for i in prange(m): >> for j in prange(n): >> sum += i * j >> >> then ‘sum’ is a reduction for the inner parallel loop, but not for the outer >> one. So the user would currently have to rewrite this to >> >> for i in prange(m): >> for j in prange(n): >> sum += i * j >> sum += 0 >> >> which seems a bit silly . Of course, we could just disable nested >> parallelism, or tell the users to use a prange and a ‘for from’ in such >> cases. >> """ >> >> Dag: Interesting. The first one is definitely the behaviour we want, as long >> as it doesn't cause unintended consequences. >> >> I don't really think it will -- the important thing is that that the order >> of loop iteration evaluation must be unimportant. And that is still true >> (for the outer loop, as well as for the inner) in your first example. >> >> Question: When you have nested pranges, what will happen is that two nested >> OpenMP parallel blocks are used, right? And do you know if there is complete >> freedom/"reentrancy" in that variables that are thread-private in an outer >> parallel block and be shared in an inner one, and vice versa? > > An implementation may or may not support it, and if it is supported > the behaviour can be configured through omp_set_nested(). So we should > consider the case where it is supported and enabled. > > If you have a lastprivate or reduction, and after the loop these are > (reduced and) assigned to the original variable. So if that happens > inside a parallel construct which does not declare the variable > private to the construct, you actually have a race. So e.g. the nested > prange currently races in the outer parallel range. > >> If so I'd think that this algorithm should work and feel natural: >> >> - In each prange, for the purposes of variable private/shared/reduction >> inference, consider all internal "prange" just as if they had been "range"; >> no special treatment. >> >> - Recurse to children pranges. > > Right, that is most natural. Algorithmically, reductions and > lastprivates (as those can have races if placed in inner parallel > constructs) propagate outwards towards the outermost parallel block, > or up to the first parallel with block, or up to the first construct > that already determined the sharing attribute. > > e.g. > > with parallel: > with parallel: > for i in prange(n): > for j in prange(n): > sum += i * j > # sum is well-defined here > # sum is undefined here > > Here 'sum' is a reduction for the two innermost loops. 'sum' is not > private for the inner parallel with block, as a prange in a parallel > with block is a worksharing loop that binds to that parallel with > block. However, the outermost parallel with block declares sum (and i > and j) private, so after that block all those variables become > undefined. > > However, in the outermost parallel with block, sum will have to be > initialized to 0 before anything else, or be declared firstprivate, > otherwise 'sum' is undefined to begin with. Do you think declaring it > firstprivate would be the way to go, or should we make it private and > issue a warning or perhaps even an error? > >> DS >>_____________________________________________
cython-devel mailing list >> cython-devel@python.org >> http://mail.python.org/mailman/listinfo/cython-devel >> > Everything seems to be working, although now the user has to be careful with nested parallel blocks as variables can be private there (and not firstprivate), i.e., the user has to do initialization at the right place (e.g. in the outermost parallel block that determines it private). I'm thinking of adding a warning, as the C compiler does. Two issues are remaining: 1) explicit declarations of firstprivates Do we still want those? 2) buffer auxiliary vars When unpacking numpy buffers and using typed numpy arrays, can reassignment or updates of a buffer-related variable ever occur in nogil code sections? I'm thinking this is not possible and therefore all buffer variables may be shared in parallel (for) sections?_____________________________________________ cython-devel mailing list cython-devel@python.org http://mail.python.org/mailman/listinfo/cython-devel
On 18 April 2011 16:41, Dag Sverre Seljebotn <d.s.seljebotn@astro.uio.no> wrote:
Excellent! Sounds great! (as I won't have my laptop for some days I can't have a look yet but I will later)
You're right about (the current) buffers and the gil. A testcase explicitly for them would be good.
Firstprivate etc: i think it'd be nice myself, but it is probably better to take a break from it at this point so that we can think more about that and not do anything rash; perhaps open up a specific thread on them and ask for more general input. Perhaps you want to take a break or task-switch to something else (fused types?) until I can get around to review and merge what you have so far? You'll know best what works for you though. If you decide to implement explicit threadprivate variables because you've got the flow I certainly wom't object myself.
Ok, cool, I'll move on :) I already included a test with a prange and a numpy buffer with indexing.
On Mon, Apr 18, 2011 at 7:51 AM, mark florisson <markflorisson88@gmail.com> wrote:
On 18 April 2011 16:41, Dag Sverre Seljebotn <d.s.seljebotn@astro.uio.no> wrote:
Excellent! Sounds great! (as I won't have my laptop for some days I can't have a look yet but I will later)
You're right about (the current) buffers and the gil. A testcase explicitly for them would be good.
Firstprivate etc: i think it'd be nice myself, but it is probably better to take a break from it at this point so that we can think more about that and not do anything rash; perhaps open up a specific thread on them and ask for more general input. Perhaps you want to take a break or task-switch to something else (fused types?) until I can get around to review and merge what you have so far? You'll know best what works for you though. If you decide to implement explicit threadprivate variables because you've got the flow I certainly wom't object myself.
Ok, cool, I'll move on :) I already included a test with a prange and a numpy buffer with indexing.
Wow, you're just plowing away at this. Very cool. +1 to disallowing nested prange, that seems to get really messy with little benefit. In terms of the CEP, I'm still unconvinced that firstprivate is not safe to infer, but lets leave the initial values undefined rather than specifying them to be NaNs (we can do that as an implementation if you want), which will give us flexibility to change later once we've had a chance to play around with it. The "cdef threadlocal(int) foo" declaration syntax feels odd to me... We also probably want some way of explicitly marking a variable as shared and still be able to assign to/flush/sync it. Perhaps the parallel context could be used for these declarations, i.e. with parallel(threadlocal=a, shared=(b,c)): ... which would be considered an "expert" usecase. For all the discussion of threadsavailable/threadid, the most common usecase I see is for allocating a large shared buffer and partitioning it. This seems better handled by allocating separate thread-local buffers, no? I still like the context idea, but everything in a parallel block before and after the loop(s) also seems like a natural place to put any setup/teardown code (though the context has the advantage that __exit__ is always called, even if exceptions are raised, which makes cleanup a lot easier to handle). - Robert
On 21 April 2011 10:37, Robert Bradshaw <robertwb@math.washington.edu> wrote:
On Mon, Apr 18, 2011 at 7:51 AM, mark florisson <markflorisson88@gmail.com> wrote:
On 18 April 2011 16:41, Dag Sverre Seljebotn <d.s.seljebotn@astro.uio.no> wrote:
Excellent! Sounds great! (as I won't have my laptop for some days I can't have a look yet but I will later)
You're right about (the current) buffers and the gil. A testcase explicitly for them would be good.
Firstprivate etc: i think it'd be nice myself, but it is probably better to take a break from it at this point so that we can think more about that and not do anything rash; perhaps open up a specific thread on them and ask for more general input. Perhaps you want to take a break or task-switch to something else (fused types?) until I can get around to review and merge what you have so far? You'll know best what works for you though. If you decide to implement explicit threadprivate variables because you've got the flow I certainly wom't object myself.
Ok, cool, I'll move on :) I already included a test with a prange and a numpy buffer with indexing.
Wow, you're just plowing away at this. Very cool.
+1 to disallowing nested prange, that seems to get really messy with little benefit.
In terms of the CEP, I'm still unconvinced that firstprivate is not safe to infer, but lets leave the initial values undefined rather than specifying them to be NaNs (we can do that as an implementation if you want), which will give us flexibility to change later once we've had a chance to play around with it.
Yes, they are currently undefined (and not initialized to NaN etc). The thing is that without the control flow analysis (or perhaps not until runtime) you won't know whether a variable is initialized at all before the parallel section, so making it firstprivate might actually copy an undefined value (perhaps with a trap representation!) into the thread-private copy, which might invalidate valid code. e.g. consider x_is_initialized = False if condition: x = 1 x_is_initialized = True for i in prange(10, schedule='static'): if x_is_initialized: printf("%d\n", x) x = i
The "cdef threadlocal(int) foo" declaration syntax feels odd to me... We also probably want some way of explicitly marking a variable as shared and still be able to assign to/flush/sync it. Perhaps the parallel context could be used for these declarations, i.e.
with parallel(threadlocal=a, shared=(b,c)): ...
which would be considered an "expert" usecase.
Indeed, assigning to elements in an array instead doesn't seem very convenient :)
For all the discussion of threadsavailable/threadid, the most common usecase I see is for allocating a large shared buffer and partitioning it. This seems better handled by allocating separate thread-local buffers, no? I still like the context idea, but everything in a parallel block before and after the loop(s) also seems like a natural place to put any setup/teardown code (though the context has the advantage that __exit__ is always called, even if exceptions are raised, which makes cleanup a lot easier to handle).
Currently 'with gil' isn't merged into that branch, and if it will, it will be disallowed, as I'm not yet sure how (if at all) it could be handled with regard to exceptions. It seems a lot easier to disallow it and have the user write a 'with gil' function, from which nothing can propagate.
- Robert _______________________________________________ cython-devel mailing list cython-devel@python.org http://mail.python.org/mailman/listinfo/cython-devel
On Thu, Apr 21, 2011 at 1:59 AM, mark florisson <markflorisson88@gmail.com> wrote:
On 21 April 2011 10:37, Robert Bradshaw <robertwb@math.washington.edu> wrote:
On Mon, Apr 18, 2011 at 7:51 AM, mark florisson <markflorisson88@gmail.com> wrote:
On 18 April 2011 16:41, Dag Sverre Seljebotn <d.s.seljebotn@astro.uio.no> wrote:
Excellent! Sounds great! (as I won't have my laptop for some days I can't have a look yet but I will later)
You're right about (the current) buffers and the gil. A testcase explicitly for them would be good.
Firstprivate etc: i think it'd be nice myself, but it is probably better to take a break from it at this point so that we can think more about that and not do anything rash; perhaps open up a specific thread on them and ask for more general input. Perhaps you want to take a break or task-switch to something else (fused types?) until I can get around to review and merge what you have so far? You'll know best what works for you though. If you decide to implement explicit threadprivate variables because you've got the flow I certainly wom't object myself.
Ok, cool, I'll move on :) I already included a test with a prange and a numpy buffer with indexing.
Wow, you're just plowing away at this. Very cool.
+1 to disallowing nested prange, that seems to get really messy with little benefit.
In terms of the CEP, I'm still unconvinced that firstprivate is not safe to infer, but lets leave the initial values undefined rather than specifying them to be NaNs (we can do that as an implementation if you want), which will give us flexibility to change later once we've had a chance to play around with it.
Yes, they are currently undefined (and not initialized to NaN etc). The thing is that without the control flow analysis (or perhaps not until runtime) you won't know whether a variable is initialized at all before the parallel section, so making it firstprivate might actually copy an undefined value (perhaps with a trap representation!) into the thread-private copy, which might invalidate valid code. e.g. consider
x_is_initialized = False if condition: x = 1 x_is_initialized = True
for i in prange(10, schedule='static'): if x_is_initialized: printf("%d\n", x) x = i
I'm still failing to see how this is a problem (or anything new, as opposed to this same example with an ordinary range).
The "cdef threadlocal(int) foo" declaration syntax feels odd to me... We also probably want some way of explicitly marking a variable as shared and still be able to assign to/flush/sync it. Perhaps the parallel context could be used for these declarations, i.e.
with parallel(threadlocal=a, shared=(b,c)): ...
which would be considered an "expert" usecase.
Indeed, assigning to elements in an array instead doesn't seem very convenient :)
For all the discussion of threadsavailable/threadid, the most common usecase I see is for allocating a large shared buffer and partitioning it. This seems better handled by allocating separate thread-local buffers, no? I still like the context idea, but everything in a parallel block before and after the loop(s) also seems like a natural place to put any setup/teardown code (though the context has the advantage that __exit__ is always called, even if exceptions are raised, which makes cleanup a lot easier to handle).
Currently 'with gil' isn't merged into that branch, and if it will, it will be disallowed, as I'm not yet sure how (if at all) it could be handled with regard to exceptions. It seems a lot easier to disallow it and have the user write a 'with gil' function, from which nothing can propagate.
Not being able to propagate exceptions is a pretty strong constraint--even if the implementation doesn't yet support it, it'd be nice to have an API that makes it possible as a future feature. - Robert
On 21 April 2011 11:18, Robert Bradshaw <robertwb@math.washington.edu> wrote:
On Thu, Apr 21, 2011 at 1:59 AM, mark florisson <markflorisson88@gmail.com> wrote:
On 21 April 2011 10:37, Robert Bradshaw <robertwb@math.washington.edu> wrote:
On Mon, Apr 18, 2011 at 7:51 AM, mark florisson <markflorisson88@gmail.com> wrote:
On 18 April 2011 16:41, Dag Sverre Seljebotn <d.s.seljebotn@astro.uio.no> wrote:
Excellent! Sounds great! (as I won't have my laptop for some days I can't have a look yet but I will later)
You're right about (the current) buffers and the gil. A testcase explicitly for them would be good.
Firstprivate etc: i think it'd be nice myself, but it is probably better to take a break from it at this point so that we can think more about that and not do anything rash; perhaps open up a specific thread on them and ask for more general input. Perhaps you want to take a break or task-switch to something else (fused types?) until I can get around to review and merge what you have so far? You'll know best what works for you though. If you decide to implement explicit threadprivate variables because you've got the flow I certainly wom't object myself.
Ok, cool, I'll move on :) I already included a test with a prange and a numpy buffer with indexing.
Wow, you're just plowing away at this. Very cool.
+1 to disallowing nested prange, that seems to get really messy with little benefit.
In terms of the CEP, I'm still unconvinced that firstprivate is not safe to infer, but lets leave the initial values undefined rather than specifying them to be NaNs (we can do that as an implementation if you want), which will give us flexibility to change later once we've had a chance to play around with it.
Yes, they are currently undefined (and not initialized to NaN etc). The thing is that without the control flow analysis (or perhaps not until runtime) you won't know whether a variable is initialized at all before the parallel section, so making it firstprivate might actually copy an undefined value (perhaps with a trap representation!) into the thread-private copy, which might invalidate valid code. e.g. consider
x_is_initialized = False if condition: x = 1 x_is_initialized = True
for i in prange(10, schedule='static'): if x_is_initialized: printf("%d\n", x) x = i
I'm still failing to see how this is a problem (or anything new, as opposed to this same example with an ordinary range).
The "cdef threadlocal(int) foo" declaration syntax feels odd to me... We also probably want some way of explicitly marking a variable as shared and still be able to assign to/flush/sync it. Perhaps the parallel context could be used for these declarations, i.e.
with parallel(threadlocal=a, shared=(b,c)): ...
which would be considered an "expert" usecase.
Indeed, assigning to elements in an array instead doesn't seem very convenient :)
For all the discussion of threadsavailable/threadid, the most common usecase I see is for allocating a large shared buffer and partitioning it. This seems better handled by allocating separate thread-local buffers, no? I still like the context idea, but everything in a parallel block before and after the loop(s) also seems like a natural place to put any setup/teardown code (though the context has the advantage that __exit__ is always called, even if exceptions are raised, which makes cleanup a lot easier to handle).
Currently 'with gil' isn't merged into that branch, and if it will, it will be disallowed, as I'm not yet sure how (if at all) it could be handled with regard to exceptions. It seems a lot easier to disallow it and have the user write a 'with gil' function, from which nothing can propagate.
Not being able to propagate exceptions is a pretty strong constraint--even if the implementation doesn't yet support it, it'd be nice to have an API that makes it possible as a future feature.
It would be possible, with some modifications to try/finally. I think it'd be best to stabilize and merge with gil first.
- Robert _______________________________________________ cython-devel mailing list cython-devel@python.org http://mail.python.org/mailman/listinfo/cython-devel
On 21 April 2011 10:59, mark florisson <markflorisson88@gmail.com> wrote:
On 21 April 2011 10:37, Robert Bradshaw <robertwb@math.washington.edu> wrote:
On Mon, Apr 18, 2011 at 7:51 AM, mark florisson <markflorisson88@gmail.com> wrote:
On 18 April 2011 16:41, Dag Sverre Seljebotn <d.s.seljebotn@astro.uio.no> wrote:
Excellent! Sounds great! (as I won't have my laptop for some days I can't have a look yet but I will later)
You're right about (the current) buffers and the gil. A testcase explicitly for them would be good.
Firstprivate etc: i think it'd be nice myself, but it is probably better to take a break from it at this point so that we can think more about that and not do anything rash; perhaps open up a specific thread on them and ask for more general input. Perhaps you want to take a break or task-switch to something else (fused types?) until I can get around to review and merge what you have so far? You'll know best what works for you though. If you decide to implement explicit threadprivate variables because you've got the flow I certainly wom't object myself.
Ok, cool, I'll move on :) I already included a test with a prange and a numpy buffer with indexing.
Wow, you're just plowing away at this. Very cool.
+1 to disallowing nested prange, that seems to get really messy with little benefit.
In terms of the CEP, I'm still unconvinced that firstprivate is not safe to infer, but lets leave the initial values undefined rather than specifying them to be NaNs (we can do that as an implementation if you want), which will give us flexibility to change later once we've had a chance to play around with it.
Yes, they are currently undefined (and not initialized to NaN etc). The thing is that without the control flow analysis (or perhaps not until runtime) you won't know whether a variable is initialized at all before the parallel section, so making it firstprivate might actually copy an undefined value (perhaps with a trap representation!) into the thread-private copy, which might invalidate valid code. e.g. consider
x_is_initialized = False if condition: x = 1 x_is_initialized = True
for i in prange(10, schedule='static'): if x_is_initialized: printf("%d\n", x) x = i
Erm, that snippet I posted is invalid in any case, as x will be private. So guess initializing things to NaN in such would have to occur in the parallel section that should enclose the for. So e.g. we'd have to do #pragma omp parallel private(x) { x = INT_MAX; #pragma omp for lastprivate(i) for (...) ... } Which would then mean that 'x' cannot be lastprivate anymore :). So it's either "uninitialized and undefined" or "firstprivate". I personally prefer the former for the implicit route. I do like the threadlocal=a stuff to parallel, it's basically what I proposed a while back except that you don't make them strings, but better because most of your variables can be inferred, so the messiness is gone.
The "cdef threadlocal(int) foo" declaration syntax feels odd to me... We also probably want some way of explicitly marking a variable as shared and still be able to assign to/flush/sync it. Perhaps the parallel context could be used for these declarations, i.e.
with parallel(threadlocal=a, shared=(b,c)): ...
which would be considered an "expert" usecase.
Indeed, assigning to elements in an array instead doesn't seem very convenient :)
For all the discussion of threadsavailable/threadid, the most common usecase I see is for allocating a large shared buffer and partitioning it. This seems better handled by allocating separate thread-local buffers, no? I still like the context idea, but everything in a parallel block before and after the loop(s) also seems like a natural place to put any setup/teardown code (though the context has the advantage that __exit__ is always called, even if exceptions are raised, which makes cleanup a lot easier to handle).
Currently 'with gil' isn't merged into that branch, and if it will, it will be disallowed, as I'm not yet sure how (if at all) it could be handled with regard to exceptions. It seems a lot easier to disallow it and have the user write a 'with gil' function, from which nothing can propagate.
- Robert _______________________________________________ cython-devel mailing list cython-devel@python.org http://mail.python.org/mailman/listinfo/cython-devel
On Thu, Apr 21, 2011 at 2:21 AM, mark florisson <markflorisson88@gmail.com> wrote:
On 21 April 2011 10:59, mark florisson <markflorisson88@gmail.com> wrote:
On 21 April 2011 10:37, Robert Bradshaw <robertwb@math.washington.edu> wrote:
On Mon, Apr 18, 2011 at 7:51 AM, mark florisson <markflorisson88@gmail.com> wrote:
On 18 April 2011 16:41, Dag Sverre Seljebotn <d.s.seljebotn@astro.uio.no> wrote:
Excellent! Sounds great! (as I won't have my laptop for some days I can't have a look yet but I will later)
You're right about (the current) buffers and the gil. A testcase explicitly for them would be good.
Firstprivate etc: i think it'd be nice myself, but it is probably better to take a break from it at this point so that we can think more about that and not do anything rash; perhaps open up a specific thread on them and ask for more general input. Perhaps you want to take a break or task-switch to something else (fused types?) until I can get around to review and merge what you have so far? You'll know best what works for you though. If you decide to implement explicit threadprivate variables because you've got the flow I certainly wom't object myself.
Ok, cool, I'll move on :) I already included a test with a prange and a numpy buffer with indexing.
Wow, you're just plowing away at this. Very cool.
+1 to disallowing nested prange, that seems to get really messy with little benefit.
In terms of the CEP, I'm still unconvinced that firstprivate is not safe to infer, but lets leave the initial values undefined rather than specifying them to be NaNs (we can do that as an implementation if you want), which will give us flexibility to change later once we've had a chance to play around with it.
Yes, they are currently undefined (and not initialized to NaN etc). The thing is that without the control flow analysis (or perhaps not until runtime) you won't know whether a variable is initialized at all before the parallel section, so making it firstprivate might actually copy an undefined value (perhaps with a trap representation!) into the thread-private copy, which might invalidate valid code. e.g. consider
x_is_initialized = False if condition: x = 1 x_is_initialized = True
for i in prange(10, schedule='static'): if x_is_initialized: printf("%d\n", x) x = i
Erm, that snippet I posted is invalid in any case, as x will be private. So guess initializing things to NaN in such would have to occur in the parallel section that should enclose the for. So e.g. we'd have to do
#pragma omp parallel private(x) { x = INT_MAX; #pragma omp for lastprivate(i) for (...) ... }
Which would then mean that 'x' cannot be lastprivate anymore :). So it's either "uninitialized and undefined" or "firstprivate". I personally prefer the former for the implicit route.
A variable can't be both first and last private? In any case, as long as we don't promise anything about them now, we can decide later.
I do like the threadlocal=a stuff to parallel, it's basically what I proposed a while back except that you don't make them strings, but better because most of your variables can be inferred, so the messiness is gone.
Yep. - Robert
On 21 April 2011 11:37, Robert Bradshaw <robertwb@math.washington.edu> wrote:
On Thu, Apr 21, 2011 at 2:21 AM, mark florisson <markflorisson88@gmail.com> wrote:
On 21 April 2011 10:59, mark florisson <markflorisson88@gmail.com> wrote:
On 21 April 2011 10:37, Robert Bradshaw <robertwb@math.washington.edu> wrote:
On Mon, Apr 18, 2011 at 7:51 AM, mark florisson <markflorisson88@gmail.com> wrote:
On 18 April 2011 16:41, Dag Sverre Seljebotn <d.s.seljebotn@astro.uio.no> wrote:
Excellent! Sounds great! (as I won't have my laptop for some days I can't have a look yet but I will later)
You're right about (the current) buffers and the gil. A testcase explicitly for them would be good.
Firstprivate etc: i think it'd be nice myself, but it is probably better to take a break from it at this point so that we can think more about that and not do anything rash; perhaps open up a specific thread on them and ask for more general input. Perhaps you want to take a break or task-switch to something else (fused types?) until I can get around to review and merge what you have so far? You'll know best what works for you though. If you decide to implement explicit threadprivate variables because you've got the flow I certainly wom't object myself.
Ok, cool, I'll move on :) I already included a test with a prange and a numpy buffer with indexing.
Wow, you're just plowing away at this. Very cool.
+1 to disallowing nested prange, that seems to get really messy with little benefit.
In terms of the CEP, I'm still unconvinced that firstprivate is not safe to infer, but lets leave the initial values undefined rather than specifying them to be NaNs (we can do that as an implementation if you want), which will give us flexibility to change later once we've had a chance to play around with it.
Yes, they are currently undefined (and not initialized to NaN etc). The thing is that without the control flow analysis (or perhaps not until runtime) you won't know whether a variable is initialized at all before the parallel section, so making it firstprivate might actually copy an undefined value (perhaps with a trap representation!) into the thread-private copy, which might invalidate valid code. e.g. consider
x_is_initialized = False if condition: x = 1 x_is_initialized = True
for i in prange(10, schedule='static'): if x_is_initialized: printf("%d\n", x) x = i
Erm, that snippet I posted is invalid in any case, as x will be private. So guess initializing things to NaN in such would have to occur in the parallel section that should enclose the for. So e.g. we'd have to do
#pragma omp parallel private(x) { x = INT_MAX; #pragma omp for lastprivate(i) for (...) ... }
Which would then mean that 'x' cannot be lastprivate anymore :). So it's either "uninitialized and undefined" or "firstprivate". I personally prefer the former for the implicit route.
A variable can't be both first and last private? In any case, as long as we don't promise anything about them now, we can decide later.
It can be, but not if the binding parallel region declares it private. So we wouldn't actually need the snippet above, we could just do x = INT_MAX; #pragma omp parallel for firstprivate(x) lastprivate(i, x) for (...) ... Yeah, that would work.
I do like the threadlocal=a stuff to parallel, it's basically what I proposed a while back except that you don't make them strings, but better because most of your variables can be inferred, so the messiness is gone.
Yep.
- Robert _______________________________________________ cython-devel mailing list cython-devel@python.org http://mail.python.org/mailman/listinfo/cython-devel
On 04/21/2011 10:37 AM, Robert Bradshaw wrote:
On Mon, Apr 18, 2011 at 7:51 AM, mark florisson <markflorisson88@gmail.com> wrote:
On 18 April 2011 16:41, Dag Sverre Seljebotn<d.s.seljebotn@astro.uio.no> wrote:
Excellent! Sounds great! (as I won't have my laptop for some days I can't have a look yet but I will later)
You're right about (the current) buffers and the gil. A testcase explicitly for them would be good.
Firstprivate etc: i think it'd be nice myself, but it is probably better to take a break from it at this point so that we can think more about that and not do anything rash; perhaps open up a specific thread on them and ask for more general input. Perhaps you want to take a break or task-switch to something else (fused types?) until I can get around to review and merge what you have so far? You'll know best what works for you though. If you decide to implement explicit threadprivate variables because you've got the flow I certainly wom't object myself.
Ok, cool, I'll move on :) I already included a test with a prange and a numpy buffer with indexing.
Wow, you're just plowing away at this. Very cool.
+1 to disallowing nested prange, that seems to get really messy with little benefit.
In terms of the CEP, I'm still unconvinced that firstprivate is not safe to infer, but lets leave the initial values undefined rather than specifying them to be NaNs (we can do that as an implementation if you want), which will give us flexibility to change later once we've had a chance to play around with it.
I don't see any technical issues with inferring firstprivate, the question is whether we want to. I suggest not inferring it in order to make this safer: One should be able to just try to change a loop from "range" to "prange", and either a) have things fail very hard, or b) just work correctly and be able to trust the results. Note that when I suggest using NaN, it is as initial values for EACH ITERATION, not per-thread initialization. It is not about "firstprivate" or not, but about disabling thread-private variables entirely in favor of "per-iteration" variables. I believe that by talking about "readonly" and "per-iteration" variables, rather than "thread-shared" and "thread-private" variables, this can be used much more safely and with virtually no knowledge of the details of threading. Again, what's in my mind are scientific programmers with (too) little training. In the end it's a matter of taste and what is most convenient to more users. But I believe the case of needing real thread-private variables that preserves per-thread values across iterations (and thus also can possibly benefit from firstprivate) is seldomly enough used that an explicit declaration is OK, in particular when it buys us so much in safety in the common case. To be very precise, cdef double x, z for i in prange(n): x = f(x) z = f(i) ... goes to cdef double x, z for i in prange(n): x = z = nan x = f(x) z = f(i) ... and we leave it to the C compiler to (trivially) optimize away "z = nan". And, yes, it is a stopgap solution until we've got control flow analysis so that we can outright disallow such uses of x (without threadprivate declaration, which also gives firstprivate behaviour).
The "cdef threadlocal(int) foo" declaration syntax feels odd to me... We also probably want some way of explicitly marking a variable as shared and still be able to assign to/flush/sync it. Perhaps the parallel context could be used for these declarations, i.e.
with parallel(threadlocal=a, shared=(b,c)): ...
which would be considered an "expert" usecase.
I'm not set on the syntax for threadlocal variables; although your proposal feels funny/very unpythonic, almost like a C macro. For some inspiration, here's the Python solution (with no obvious place to put the type): import threading mydata = threading.local() mydata.myvar = ... # value is threadprivate
For all the discussion of threadsavailable/threadid, the most common usecase I see is for allocating a large shared buffer and partitioning it. This seems better handled by allocating separate thread-local buffers, no? I still like the context idea, but everything in a parallel block before and after the loop(s) also seems like a natural place to put any setup/teardown code (though the context has the advantage that __exit__ is always called, even if exceptions are raised, which makes cleanup a lot easier to handle).
I'd *really* like to have try/finally available in cython.parallel block for this, although I realize that may have to wait for a while. A big part of our discussions at the workshop were about how to handle exceptions; I guess there'll be a "phase 2" of this where break/continue/raise is dealt with. Dag Sverre
On Thu, Apr 21, 2011 at 11:13 AM, Dag Sverre Seljebotn <d.s.seljebotn@astro.uio.no> wrote:
On 04/21/2011 10:37 AM, Robert Bradshaw wrote:
In terms of the CEP, I'm still unconvinced that firstprivate is not safe to infer, but lets leave the initial values undefined rather than specifying them to be NaNs (we can do that as an implementation if you want), which will give us flexibility to change later once we've had a chance to play around with it.
I don't see any technical issues with inferring firstprivate, the question is whether we want to. I suggest not inferring it in order to make this safer: One should be able to just try to change a loop from "range" to "prange", and either a) have things fail very hard, or b) just work correctly and be able to trust the results.
Note that when I suggest using NaN, it is as initial values for EACH ITERATION, not per-thread initialization. It is not about "firstprivate" or not, but about disabling thread-private variables entirely in favor of "per-iteration" variables.
I believe that by talking about "readonly" and "per-iteration" variables, rather than "thread-shared" and "thread-private" variables, this can be used much more safely and with virtually no knowledge of the details of threading. Again, what's in my mind are scientific programmers with (too) little training.
In the end it's a matter of taste and what is most convenient to more users. But I believe the case of needing real thread-private variables that preserves per-thread values across iterations (and thus also can possibly benefit from firstprivate) is seldomly enough used that an explicit declaration is OK, in particular when it buys us so much in safety in the common case.
To be very precise,
cdef double x, z for i in prange(n): x = f(x) z = f(i) ...
goes to
cdef double x, z for i in prange(n): x = z = nan x = f(x) z = f(i) ...
and we leave it to the C compiler to (trivially) optimize away "z = nan". And, yes, it is a stopgap solution until we've got control flow analysis so that we can outright disallow such uses of x (without threadprivate declaration, which also gives firstprivate behaviour).
OK, I had totally missed that these are per-iteration. In that case, it makes more sense.
The "cdef threadlocal(int) foo" declaration syntax feels odd to me... We also probably want some way of explicitly marking a variable as shared and still be able to assign to/flush/sync it. Perhaps the parallel context could be used for these declarations, i.e.
with parallel(threadlocal=a, shared=(b,c)): ...
which would be considered an "expert" usecase.
I'm not set on the syntax for threadlocal variables; although your proposal feels funny/very unpythonic, almost like a C macro. For some inspiration, here's the Python solution (with no obvious place to put the type):
import threading mydata = threading.local() mydata.myvar = ... # value is threadprivate
That's nice and Pythonic, though I'm not sure how we would handle typing and the passing of "mydata" around if we wanted to go that route. We have cython.locals, we could introduce cython.parallel.threadlocals(a=int), though this is a bit magical as well.
For all the discussion of threadsavailable/threadid, the most common usecase I see is for allocating a large shared buffer and partitioning it. This seems better handled by allocating separate thread-local buffers, no? I still like the context idea, but everything in a parallel block before and after the loop(s) also seems like a natural place to put any setup/teardown code (though the context has the advantage that __exit__ is always called, even if exceptions are raised, which makes cleanup a lot easier to handle).
I'd *really* like to have try/finally available in cython.parallel block for this, although I realize that may have to wait for a while. A big part of our discussions at the workshop were about how to handle exceptions; I guess there'll be a "phase 2" of this where break/continue/raise is dealt with.
Yeah, this is definitely an (important) second or third phase. - Robert
On 21 April 2011 20:13, Dag Sverre Seljebotn <d.s.seljebotn@astro.uio.no> wrote:
On 04/21/2011 10:37 AM, Robert Bradshaw wrote:
On Mon, Apr 18, 2011 at 7:51 AM, mark florisson <markflorisson88@gmail.com> wrote:
On 18 April 2011 16:41, Dag Sverre Seljebotn<d.s.seljebotn@astro.uio.no> wrote:
Excellent! Sounds great! (as I won't have my laptop for some days I can't have a look yet but I will later)
You're right about (the current) buffers and the gil. A testcase explicitly for them would be good.
Firstprivate etc: i think it'd be nice myself, but it is probably better to take a break from it at this point so that we can think more about that and not do anything rash; perhaps open up a specific thread on them and ask for more general input. Perhaps you want to take a break or task-switch to something else (fused types?) until I can get around to review and merge what you have so far? You'll know best what works for you though. If you decide to implement explicit threadprivate variables because you've got the flow I certainly wom't object myself.
Ok, cool, I'll move on :) I already included a test with a prange and a numpy buffer with indexing.
Wow, you're just plowing away at this. Very cool.
+1 to disallowing nested prange, that seems to get really messy with little benefit.
In terms of the CEP, I'm still unconvinced that firstprivate is not safe to infer, but lets leave the initial values undefined rather than specifying them to be NaNs (we can do that as an implementation if you want), which will give us flexibility to change later once we've had a chance to play around with it.
I don't see any technical issues with inferring firstprivate, the question is whether we want to. I suggest not inferring it in order to make this safer: One should be able to just try to change a loop from "range" to "prange", and either a) have things fail very hard, or b) just work correctly and be able to trust the results.
Note that when I suggest using NaN, it is as initial values for EACH ITERATION, not per-thread initialization. It is not about "firstprivate" or not, but about disabling thread-private variables entirely in favor of "per-iteration" variables.
I believe that by talking about "readonly" and "per-iteration" variables, rather than "thread-shared" and "thread-private" variables, this can be used much more safely and with virtually no knowledge of the details of threading. Again, what's in my mind are scientific programmers with (too) little training.
In the end it's a matter of taste and what is most convenient to more users. But I believe the case of needing real thread-private variables that preserves per-thread values across iterations (and thus also can possibly benefit from firstprivate) is seldomly enough used that an explicit declaration is OK, in particular when it buys us so much in safety in the common case.
To be very precise,
cdef double x, z for i in prange(n): x = f(x) z = f(i) ...
goes to
cdef double x, z for i in prange(n): x = z = nan x = f(x) z = f(i) ...
and we leave it to the C compiler to (trivially) optimize away "z = nan". And, yes, it is a stopgap solution until we've got control flow analysis so that we can outright disallow such uses of x (without threadprivate declaration, which also gives firstprivate behaviour).
Ah, I see, sure, that sounds sensible. I'm currently working on fused types, so when I finish that up I'll return to that.
The "cdef threadlocal(int) foo" declaration syntax feels odd to me... We also probably want some way of explicitly marking a variable as shared and still be able to assign to/flush/sync it. Perhaps the parallel context could be used for these declarations, i.e.
with parallel(threadlocal=a, shared=(b,c)): ...
which would be considered an "expert" usecase.
I'm not set on the syntax for threadlocal variables; although your proposal feels funny/very unpythonic, almost like a C macro. For some inspiration, here's the Python solution (with no obvious place to put the type):
import threading mydata = threading.local() mydata.myvar = ... # value is threadprivate
For all the discussion of threadsavailable/threadid, the most common usecase I see is for allocating a large shared buffer and partitioning it. This seems better handled by allocating separate thread-local buffers, no? I still like the context idea, but everything in a parallel block before and after the loop(s) also seems like a natural place to put any setup/teardown code (though the context has the advantage that __exit__ is always called, even if exceptions are raised, which makes cleanup a lot easier to handle).
I'd *really* like to have try/finally available in cython.parallel block for this, although I realize that may have to wait for a while. A big part of our discussions at the workshop were about how to handle exceptions; I guess there'll be a "phase 2" of this where break/continue/raise is dealt with.
I'll leave that until I finish fused types and the typed memory views. Before I'd start on that I'd first review the with gil block and ensure the tests pass in all python versions, and perhaps that should be merged first before I pull it into the parallel branch? Otherwise you're kind of forced to review both branches.
Dag Sverre _______________________________________________ cython-devel mailing list cython-devel@python.org http://mail.python.org/mailman/listinfo/cython-devel
On Tue, Apr 26, 2011 at 7:25 AM, mark florisson <markflorisson88@gmail.com> wrote:
On 21 April 2011 20:13, Dag Sverre Seljebotn <d.s.seljebotn@astro.uio.no> wrote:
On 04/21/2011 10:37 AM, Robert Bradshaw wrote:
On Mon, Apr 18, 2011 at 7:51 AM, mark florisson <markflorisson88@gmail.com> wrote:
On 18 April 2011 16:41, Dag Sverre Seljebotn<d.s.seljebotn@astro.uio.no> wrote:
Excellent! Sounds great! (as I won't have my laptop for some days I can't have a look yet but I will later)
You're right about (the current) buffers and the gil. A testcase explicitly for them would be good.
Firstprivate etc: i think it'd be nice myself, but it is probably better to take a break from it at this point so that we can think more about that and not do anything rash; perhaps open up a specific thread on them and ask for more general input. Perhaps you want to take a break or task-switch to something else (fused types?) until I can get around to review and merge what you have so far? You'll know best what works for you though. If you decide to implement explicit threadprivate variables because you've got the flow I certainly wom't object myself.
Ok, cool, I'll move on :) I already included a test with a prange and a numpy buffer with indexing.
Wow, you're just plowing away at this. Very cool.
+1 to disallowing nested prange, that seems to get really messy with little benefit.
In terms of the CEP, I'm still unconvinced that firstprivate is not safe to infer, but lets leave the initial values undefined rather than specifying them to be NaNs (we can do that as an implementation if you want), which will give us flexibility to change later once we've had a chance to play around with it.
I don't see any technical issues with inferring firstprivate, the question is whether we want to. I suggest not inferring it in order to make this safer: One should be able to just try to change a loop from "range" to "prange", and either a) have things fail very hard, or b) just work correctly and be able to trust the results.
Note that when I suggest using NaN, it is as initial values for EACH ITERATION, not per-thread initialization. It is not about "firstprivate" or not, but about disabling thread-private variables entirely in favor of "per-iteration" variables.
I believe that by talking about "readonly" and "per-iteration" variables, rather than "thread-shared" and "thread-private" variables, this can be used much more safely and with virtually no knowledge of the details of threading. Again, what's in my mind are scientific programmers with (too) little training.
In the end it's a matter of taste and what is most convenient to more users. But I believe the case of needing real thread-private variables that preserves per-thread values across iterations (and thus also can possibly benefit from firstprivate) is seldomly enough used that an explicit declaration is OK, in particular when it buys us so much in safety in the common case.
To be very precise,
cdef double x, z for i in prange(n): x = f(x) z = f(i) ...
goes to
cdef double x, z for i in prange(n): x = z = nan x = f(x) z = f(i) ...
and we leave it to the C compiler to (trivially) optimize away "z = nan". And, yes, it is a stopgap solution until we've got control flow analysis so that we can outright disallow such uses of x (without threadprivate declaration, which also gives firstprivate behaviour).
Ah, I see, sure, that sounds sensible. I'm currently working on fused types, so when I finish that up I'll return to that.
The "cdef threadlocal(int) foo" declaration syntax feels odd to me... We also probably want some way of explicitly marking a variable as shared and still be able to assign to/flush/sync it. Perhaps the parallel context could be used for these declarations, i.e.
with parallel(threadlocal=a, shared=(b,c)): ...
which would be considered an "expert" usecase.
I'm not set on the syntax for threadlocal variables; although your proposal feels funny/very unpythonic, almost like a C macro. For some inspiration, here's the Python solution (with no obvious place to put the type):
import threading mydata = threading.local() mydata.myvar = ... # value is threadprivate
For all the discussion of threadsavailable/threadid, the most common usecase I see is for allocating a large shared buffer and partitioning it. This seems better handled by allocating separate thread-local buffers, no? I still like the context idea, but everything in a parallel block before and after the loop(s) also seems like a natural place to put any setup/teardown code (though the context has the advantage that __exit__ is always called, even if exceptions are raised, which makes cleanup a lot easier to handle).
I'd *really* like to have try/finally available in cython.parallel block for this, although I realize that may have to wait for a while. A big part of our discussions at the workshop were about how to handle exceptions; I guess there'll be a "phase 2" of this where break/continue/raise is dealt with.
I'll leave that until I finish fused types and the typed memory views. Before I'd start on that I'd first review the with gil block and ensure the tests pass in all python versions, and perhaps that should be merged first before I pull it into the parallel branch? Otherwise you're kind of forced to review both branches.
Yes, that makes sense. - Robert
On 21 April 2011 20:13, Dag Sverre Seljebotn <d.s.seljebotn@astro.uio.no> wrote:
On 04/21/2011 10:37 AM, Robert Bradshaw wrote:
On Mon, Apr 18, 2011 at 7:51 AM, mark florisson <markflorisson88@gmail.com> wrote:
On 18 April 2011 16:41, Dag Sverre Seljebotn<d.s.seljebotn@astro.uio.no> wrote:
Excellent! Sounds great! (as I won't have my laptop for some days I can't have a look yet but I will later)
You're right about (the current) buffers and the gil. A testcase explicitly for them would be good.
Firstprivate etc: i think it'd be nice myself, but it is probably better to take a break from it at this point so that we can think more about that and not do anything rash; perhaps open up a specific thread on them and ask for more general input. Perhaps you want to take a break or task-switch to something else (fused types?) until I can get around to review and merge what you have so far? You'll know best what works for you though. If you decide to implement explicit threadprivate variables because you've got the flow I certainly wom't object myself.
Ok, cool, I'll move on :) I already included a test with a prange and a numpy buffer with indexing.
Wow, you're just plowing away at this. Very cool.
+1 to disallowing nested prange, that seems to get really messy with little benefit.
In terms of the CEP, I'm still unconvinced that firstprivate is not safe to infer, but lets leave the initial values undefined rather than specifying them to be NaNs (we can do that as an implementation if you want), which will give us flexibility to change later once we've had a chance to play around with it.
I don't see any technical issues with inferring firstprivate, the question is whether we want to. I suggest not inferring it in order to make this safer: One should be able to just try to change a loop from "range" to "prange", and either a) have things fail very hard, or b) just work correctly and be able to trust the results.
Note that when I suggest using NaN, it is as initial values for EACH ITERATION, not per-thread initialization. It is not about "firstprivate" or not, but about disabling thread-private variables entirely in favor of "per-iteration" variables.
I believe that by talking about "readonly" and "per-iteration" variables, rather than "thread-shared" and "thread-private" variables, this can be used much more safely and with virtually no knowledge of the details of threading. Again, what's in my mind are scientific programmers with (too) little training.
In the end it's a matter of taste and what is most convenient to more users. But I believe the case of needing real thread-private variables that preserves per-thread values across iterations (and thus also can possibly benefit from firstprivate) is seldomly enough used that an explicit declaration is OK, in particular when it buys us so much in safety in the common case.
To be very precise,
cdef double x, z for i in prange(n): x = f(x) z = f(i) ...
goes to
cdef double x, z for i in prange(n): x = z = nan x = f(x) z = f(i) ...
and we leave it to the C compiler to (trivially) optimize away "z = nan". And, yes, it is a stopgap solution until we've got control flow analysis so that we can outright disallow such uses of x (without threadprivate declaration, which also gives firstprivate behaviour).
I think the preliminary OpenMP support is ready for review. It supports 'with cython.parallel.parallel:' and 'for i in cython.parallel.prange(...):'. It works in generators and closures and the docs are updated. Support for break/continue/with gil isn't there yet. There are two remaining issue. The first is warnings for potentially uninitialized variables for prange(). When you do for i in prange(start, stop, step): ... it generates code like nsteps = (stop - start) / step; #pragma omp parallel for lastprivate(i) for (temp = 0; temp < nsteps; temp++) { i = start + temp * step; ... } So here it will complain about 'i' being potentially uninitialized, as it might not be assigned to in the loop. However, simply assigning 0 to 'i' can't work either, as you expect zero iterations not to touch it. So for now, we have a bunch of warnings, as I don't see a __attribute__ to suppress it selectively. The second is NaN-ing private variables, NaN isn't part of C. For gcc, the docs ( http://www.delorie.com/gnu/docs/glibc/libc_407.html ) have the following to say: "You can use `#ifdef NAN' to test whether the machine supports NaN. (Of course, you must arrange for GNU extensions to be visible, such as by defining _GNU_SOURCE, and then you must include `math.h'.)" So I'm thinking that if NaN is not available (or the compiler is not GCC), we can use FLT_MAX, DBL_MAX and LDBL_MAX instead from float.h. Would this be the proper way to handle this?
On 05/04/2011 12:00 PM, mark florisson wrote:
On 21 April 2011 20:13, Dag Sverre Seljebotn<d.s.seljebotn@astro.uio.no> wrote:
On 04/21/2011 10:37 AM, Robert Bradshaw wrote:
On Mon, Apr 18, 2011 at 7:51 AM, mark florisson <markflorisson88@gmail.com> wrote:
On 18 April 2011 16:41, Dag Sverre Seljebotn<d.s.seljebotn@astro.uio.no> wrote:
Excellent! Sounds great! (as I won't have my laptop for some days I can't have a look yet but I will later)
You're right about (the current) buffers and the gil. A testcase explicitly for them would be good.
Firstprivate etc: i think it'd be nice myself, but it is probably better to take a break from it at this point so that we can think more about that and not do anything rash; perhaps open up a specific thread on them and ask for more general input. Perhaps you want to take a break or task-switch to something else (fused types?) until I can get around to review and merge what you have so far? You'll know best what works for you though. If you decide to implement explicit threadprivate variables because you've got the flow I certainly wom't object myself.
Ok, cool, I'll move on :) I already included a test with a prange and a numpy buffer with indexing.
Wow, you're just plowing away at this. Very cool.
+1 to disallowing nested prange, that seems to get really messy with little benefit.
In terms of the CEP, I'm still unconvinced that firstprivate is not safe to infer, but lets leave the initial values undefined rather than specifying them to be NaNs (we can do that as an implementation if you want), which will give us flexibility to change later once we've had a chance to play around with it.
I don't see any technical issues with inferring firstprivate, the question is whether we want to. I suggest not inferring it in order to make this safer: One should be able to just try to change a loop from "range" to "prange", and either a) have things fail very hard, or b) just work correctly and be able to trust the results.
Note that when I suggest using NaN, it is as initial values for EACH ITERATION, not per-thread initialization. It is not about "firstprivate" or not, but about disabling thread-private variables entirely in favor of "per-iteration" variables.
I believe that by talking about "readonly" and "per-iteration" variables, rather than "thread-shared" and "thread-private" variables, this can be used much more safely and with virtually no knowledge of the details of threading. Again, what's in my mind are scientific programmers with (too) little training.
In the end it's a matter of taste and what is most convenient to more users. But I believe the case of needing real thread-private variables that preserves per-thread values across iterations (and thus also can possibly benefit from firstprivate) is seldomly enough used that an explicit declaration is OK, in particular when it buys us so much in safety in the common case.
To be very precise,
cdef double x, z for i in prange(n): x = f(x) z = f(i) ...
goes to
cdef double x, z for i in prange(n): x = z = nan x = f(x) z = f(i) ...
and we leave it to the C compiler to (trivially) optimize away "z = nan". And, yes, it is a stopgap solution until we've got control flow analysis so that we can outright disallow such uses of x (without threadprivate declaration, which also gives firstprivate behaviour).
I think the preliminary OpenMP support is ready for review. It supports 'with cython.parallel.parallel:' and 'for i in cython.parallel.prange(...):'. It works in generators and closures and the docs are updated. Support for break/continue/with gil isn't there yet.
There are two remaining issue. The first is warnings for potentially uninitialized variables for prange(). When you do
for i in prange(start, stop, step): ...
it generates code like
nsteps = (stop - start) / step; #pragma omp parallel for lastprivate(i) for (temp = 0; temp< nsteps; temp++) { i = start + temp * step; ... }
So here it will complain about 'i' being potentially uninitialized, as it might not be assigned to in the loop. However, simply assigning 0 to 'i' can't work either, as you expect zero iterations not to touch it. So for now, we have a bunch of warnings, as I don't see a __attribute__ to suppress it selectively.
Isn't this is orthogonal to OpenMP -- even if it said "range", your testcase could get such a warning? If so, the fix is simply to initialize i in your testcase code.
The second is NaN-ing private variables, NaN isn't part of C. For gcc, the docs ( http://www.delorie.com/gnu/docs/glibc/libc_407.html ) have the following to say:
"You can use `#ifdef NAN' to test whether the machine supports NaN. (Of course, you must arrange for GNU extensions to be visible, such as by defining _GNU_SOURCE, and then you must include `math.h'.)"
So I'm thinking that if NaN is not available (or the compiler is not GCC), we can use FLT_MAX, DBL_MAX and LDBL_MAX instead from float.h. Would this be the proper way to handle this?
I think it is sufficient. A relatively portable way would be to initialize a double variable to 0.0/0.0 at program startup; a problem is that that would flag exceptions in the FPU though. Here's some more compiler-specific stuff I found: http://www.koders.com/c/fid6EF58B6683BCD810AE371607818952EB039CBC32.aspx DS
On 4 May 2011 12:45, Dag Sverre Seljebotn <d.s.seljebotn@astro.uio.no> wrote:
On 05/04/2011 12:00 PM, mark florisson wrote:
On 21 April 2011 20:13, Dag Sverre Seljebotn<d.s.seljebotn@astro.uio.no> wrote:
On 04/21/2011 10:37 AM, Robert Bradshaw wrote:
On Mon, Apr 18, 2011 at 7:51 AM, mark florisson <markflorisson88@gmail.com> wrote:
On 18 April 2011 16:41, Dag Sverre Seljebotn<d.s.seljebotn@astro.uio.no> wrote:
Excellent! Sounds great! (as I won't have my laptop for some days I can't have a look yet but I will later)
You're right about (the current) buffers and the gil. A testcase explicitly for them would be good.
Firstprivate etc: i think it'd be nice myself, but it is probably better to take a break from it at this point so that we can think more about that and not do anything rash; perhaps open up a specific thread on them and ask for more general input. Perhaps you want to take a break or task-switch to something else (fused types?) until I can get around to review and merge what you have so far? You'll know best what works for you though. If you decide to implement explicit threadprivate variables because you've got the flow I certainly wom't object myself.
Ok, cool, I'll move on :) I already included a test with a prange and a numpy buffer with indexing.
Wow, you're just plowing away at this. Very cool.
+1 to disallowing nested prange, that seems to get really messy with little benefit.
In terms of the CEP, I'm still unconvinced that firstprivate is not safe to infer, but lets leave the initial values undefined rather than specifying them to be NaNs (we can do that as an implementation if you want), which will give us flexibility to change later once we've had a chance to play around with it.
I don't see any technical issues with inferring firstprivate, the question is whether we want to. I suggest not inferring it in order to make this safer: One should be able to just try to change a loop from "range" to "prange", and either a) have things fail very hard, or b) just work correctly and be able to trust the results.
Note that when I suggest using NaN, it is as initial values for EACH ITERATION, not per-thread initialization. It is not about "firstprivate" or not, but about disabling thread-private variables entirely in favor of "per-iteration" variables.
I believe that by talking about "readonly" and "per-iteration" variables, rather than "thread-shared" and "thread-private" variables, this can be used much more safely and with virtually no knowledge of the details of threading. Again, what's in my mind are scientific programmers with (too) little training.
In the end it's a matter of taste and what is most convenient to more users. But I believe the case of needing real thread-private variables that preserves per-thread values across iterations (and thus also can possibly benefit from firstprivate) is seldomly enough used that an explicit declaration is OK, in particular when it buys us so much in safety in the common case.
To be very precise,
cdef double x, z for i in prange(n): x = f(x) z = f(i) ...
goes to
cdef double x, z for i in prange(n): x = z = nan x = f(x) z = f(i) ...
and we leave it to the C compiler to (trivially) optimize away "z = nan". And, yes, it is a stopgap solution until we've got control flow analysis so that we can outright disallow such uses of x (without threadprivate declaration, which also gives firstprivate behaviour).
I think the preliminary OpenMP support is ready for review. It supports 'with cython.parallel.parallel:' and 'for i in cython.parallel.prange(...):'. It works in generators and closures and the docs are updated. Support for break/continue/with gil isn't there yet.
There are two remaining issue. The first is warnings for potentially uninitialized variables for prange(). When you do
for i in prange(start, stop, step): ...
it generates code like
nsteps = (stop - start) / step; #pragma omp parallel for lastprivate(i) for (temp = 0; temp< nsteps; temp++) { i = start + temp * step; ... }
So here it will complain about 'i' being potentially uninitialized, as it might not be assigned to in the loop. However, simply assigning 0 to 'i' can't work either, as you expect zero iterations not to touch it. So for now, we have a bunch of warnings, as I don't see a __attribute__ to suppress it selectively.
Isn't this is orthogonal to OpenMP -- even if it said "range", your testcase could get such a warning? If so, the fix is simply to initialize i in your testcase code.
No, the problem is that 'i' needs to be lastprivate, and 'i' is assigned to in the loop body. It's irrelevant whether 'i' is assigned to before the loop. I think this is the case because the spec says that lastprivate variables will get the value of the private variable of the last sequential iteration, but it cannot at compile time know whether there might be zero iterations, which I believe the spec doesn't have anything to say about. So basically we could guard against it by checking if nsteps > 0, but the compiler doesn't detect this, so it will still issue a warning even if 'i' is initialized (the warning is at the place of the lastprivate declaration).
The second is NaN-ing private variables, NaN isn't part of C. For gcc, the docs ( http://www.delorie.com/gnu/docs/glibc/libc_407.html ) have the following to say:
"You can use `#ifdef NAN' to test whether the machine supports NaN. (Of course, you must arrange for GNU extensions to be visible, such as by defining _GNU_SOURCE, and then you must include `math.h'.)"
So I'm thinking that if NaN is not available (or the compiler is not GCC), we can use FLT_MAX, DBL_MAX and LDBL_MAX instead from float.h. Would this be the proper way to handle this?
I think it is sufficient. A relatively portable way would be to initialize a double variable to 0.0/0.0 at program startup; a problem is that that would flag exceptions in the FPU though.
Here's some more compiler-specific stuff I found:
http://www.koders.com/c/fid6EF58B6683BCD810AE371607818952EB039CBC32.aspx
Thanks, I'll take a look!
DS _______________________________________________ cython-devel mailing list cython-devel@python.org http://mail.python.org/mailman/listinfo/cython-devel
On 05/04/2011 12:59 PM, mark florisson wrote:
On 4 May 2011 12:45, Dag Sverre Seljebotn<d.s.seljebotn@astro.uio.no> wrote:
On 05/04/2011 12:00 PM, mark florisson wrote:
There are two remaining issue. The first is warnings for potentially uninitialized variables for prange(). When you do
for i in prange(start, stop, step): ...
it generates code like
nsteps = (stop - start) / step; #pragma omp parallel for lastprivate(i) for (temp = 0; temp< nsteps; temp++) { i = start + temp * step; ... }
So here it will complain about 'i' being potentially uninitialized, as it might not be assigned to in the loop. However, simply assigning 0 to 'i' can't work either, as you expect zero iterations not to touch it. So for now, we have a bunch of warnings, as I don't see a __attribute__ to suppress it selectively.
Isn't this is orthogonal to OpenMP -- even if it said "range", your testcase could get such a warning? If so, the fix is simply to initialize i in your testcase code.
No, the problem is that 'i' needs to be lastprivate, and 'i' is assigned to in the loop body. It's irrelevant whether 'i' is assigned to before the loop. I think this is the case because the spec says that lastprivate variables will get the value of the private variable of the last sequential iteration, but it cannot at compile time know whether there might be zero iterations, which I believe the spec doesn't have anything to say about. So basically we could guard against it by checking if nsteps> 0, but the compiler doesn't detect this, so it will still issue a warning even if 'i' is initialized (the warning is at the place of the lastprivate declaration).
Ah. But this is then more important than I initially thought it was. You are saying that this is the case: cdef int i = 0 with nogil: for i in prange(n): ... print i # garbage when n == 0? It would be in the interest of less semantic differences w.r.t. range to deal better with this case. Will it silence the warning if we make "i" firstprivate as well as lastprivate? firstprivate would only affect the case of zero iterations, since we overwrite with NaN if the loop is entered... Dag
On 4 May 2011 13:15, Dag Sverre Seljebotn <d.s.seljebotn@astro.uio.no> wrote:
On 05/04/2011 12:59 PM, mark florisson wrote:
On 4 May 2011 12:45, Dag Sverre Seljebotn<d.s.seljebotn@astro.uio.no> wrote:
On 05/04/2011 12:00 PM, mark florisson wrote:
There are two remaining issue. The first is warnings for potentially uninitialized variables for prange(). When you do
for i in prange(start, stop, step): ...
it generates code like
nsteps = (stop - start) / step; #pragma omp parallel for lastprivate(i) for (temp = 0; temp< nsteps; temp++) { i = start + temp * step; ... }
So here it will complain about 'i' being potentially uninitialized, as it might not be assigned to in the loop. However, simply assigning 0 to 'i' can't work either, as you expect zero iterations not to touch it. So for now, we have a bunch of warnings, as I don't see a __attribute__ to suppress it selectively.
Isn't this is orthogonal to OpenMP -- even if it said "range", your testcase could get such a warning? If so, the fix is simply to initialize i in your testcase code.
No, the problem is that 'i' needs to be lastprivate, and 'i' is assigned to in the loop body. It's irrelevant whether 'i' is assigned to before the loop. I think this is the case because the spec says that lastprivate variables will get the value of the private variable of the last sequential iteration, but it cannot at compile time know whether there might be zero iterations, which I believe the spec doesn't have anything to say about. So basically we could guard against it by checking if nsteps> 0, but the compiler doesn't detect this, so it will still issue a warning even if 'i' is initialized (the warning is at the place of the lastprivate declaration).
Ah. But this is then more important than I initially thought it was. You are saying that this is the case:
cdef int i = 0 with nogil: for i in prange(n): ... print i # garbage when n == 0?
I think it may be, depending on the implementation. With libgomp it return 0. With the check it should also return 0.
It would be in the interest of less semantic differences w.r.t. range to deal better with this case.
Will it silence the warning if we make "i" firstprivate as well as lastprivate? firstprivate would only affect the case of zero iterations, since we overwrite with NaN if the loop is entered...
Well, it wouldn't be NaN, it would be start + step * temp :) But, yes, that works. So we need both the check and an initialization in there: if (nsteps > 0) { i = 0; #pragma omp parallel for firstprivate(i) lastprivate(i) for (temp = 0; ...; ...) ... } Now any subsequent read of 'i' will only issue a warning if 'i' is not initialized before the prange() by the user. So if you leave your index variable uninitialized (because you know in advance nsteps will be greater than zero), you'll still get a warning. But at least you will be able to shut up the compiler :)
Dag _______________________________________________ cython-devel mailing list cython-devel@python.org http://mail.python.org/mailman/listinfo/cython-devel
On 05/04/2011 01:30 PM, mark florisson wrote:
On 4 May 2011 13:15, Dag Sverre Seljebotn<d.s.seljebotn@astro.uio.no> wrote:
On 05/04/2011 12:59 PM, mark florisson wrote:
On 4 May 2011 12:45, Dag Sverre Seljebotn<d.s.seljebotn@astro.uio.no> wrote:
On 05/04/2011 12:00 PM, mark florisson wrote:
There are two remaining issue. The first is warnings for potentially uninitialized variables for prange(). When you do
for i in prange(start, stop, step): ...
it generates code like
nsteps = (stop - start) / step; #pragma omp parallel for lastprivate(i) for (temp = 0; temp< nsteps; temp++) { i = start + temp * step; ... }
So here it will complain about 'i' being potentially uninitialized, as it might not be assigned to in the loop. However, simply assigning 0 to 'i' can't work either, as you expect zero iterations not to touch it. So for now, we have a bunch of warnings, as I don't see a __attribute__ to suppress it selectively.
Isn't this is orthogonal to OpenMP -- even if it said "range", your testcase could get such a warning? If so, the fix is simply to initialize i in your testcase code.
No, the problem is that 'i' needs to be lastprivate, and 'i' is assigned to in the loop body. It's irrelevant whether 'i' is assigned to before the loop. I think this is the case because the spec says that lastprivate variables will get the value of the private variable of the last sequential iteration, but it cannot at compile time know whether there might be zero iterations, which I believe the spec doesn't have anything to say about. So basically we could guard against it by checking if nsteps> 0, but the compiler doesn't detect this, so it will still issue a warning even if 'i' is initialized (the warning is at the place of the lastprivate declaration).
Ah. But this is then more important than I initially thought it was. You are saying that this is the case:
cdef int i = 0 with nogil: for i in prange(n): ... print i # garbage when n == 0?
I think it may be, depending on the implementation. With libgomp it return 0. With the check it should also return 0.
It would be in the interest of less semantic differences w.r.t. range to deal better with this case.
Will it silence the warning if we make "i" firstprivate as well as lastprivate? firstprivate would only affect the case of zero iterations, since we overwrite with NaN if the loop is entered...
Well, it wouldn't be NaN, it would be start + step * temp :) But, yes,
Doh.
that works. So we need both the check and an initialization in there:
if (nsteps> 0) { i = 0; #pragma omp parallel for firstprivate(i) lastprivate(i) for (temp = 0; ...; ...) ... }
Why do you need the if-test? Won't simply #pragma omp parallel for firstprivate(i) lastprivate(i) for (temp = 0; ...; ...) ... do the job -- any initial value will be copied into all threads, including the "last" thread, even if there are no iterations? Dag Sverre
On 4 May 2011 13:39, Dag Sverre Seljebotn <d.s.seljebotn@astro.uio.no> wrote:
On 05/04/2011 01:30 PM, mark florisson wrote:
On 4 May 2011 13:15, Dag Sverre Seljebotn<d.s.seljebotn@astro.uio.no> wrote:
On 05/04/2011 12:59 PM, mark florisson wrote:
On 4 May 2011 12:45, Dag Sverre Seljebotn<d.s.seljebotn@astro.uio.no> wrote:
On 05/04/2011 12:00 PM, mark florisson wrote:
There are two remaining issue. The first is warnings for potentially uninitialized variables for prange(). When you do
for i in prange(start, stop, step): ...
it generates code like
nsteps = (stop - start) / step; #pragma omp parallel for lastprivate(i) for (temp = 0; temp< nsteps; temp++) { i = start + temp * step; ... }
So here it will complain about 'i' being potentially uninitialized, as it might not be assigned to in the loop. However, simply assigning 0 to 'i' can't work either, as you expect zero iterations not to touch it. So for now, we have a bunch of warnings, as I don't see a __attribute__ to suppress it selectively.
Isn't this is orthogonal to OpenMP -- even if it said "range", your testcase could get such a warning? If so, the fix is simply to initialize i in your testcase code.
No, the problem is that 'i' needs to be lastprivate, and 'i' is assigned to in the loop body. It's irrelevant whether 'i' is assigned to before the loop. I think this is the case because the spec says that lastprivate variables will get the value of the private variable of the last sequential iteration, but it cannot at compile time know whether there might be zero iterations, which I believe the spec doesn't have anything to say about. So basically we could guard against it by checking if nsteps> 0, but the compiler doesn't detect this, so it will still issue a warning even if 'i' is initialized (the warning is at the place of the lastprivate declaration).
Ah. But this is then more important than I initially thought it was. You are saying that this is the case:
cdef int i = 0 with nogil: for i in prange(n): ... print i # garbage when n == 0?
I think it may be, depending on the implementation. With libgomp it return 0. With the check it should also return 0.
It would be in the interest of less semantic differences w.r.t. range to deal better with this case.
Will it silence the warning if we make "i" firstprivate as well as lastprivate? firstprivate would only affect the case of zero iterations, since we overwrite with NaN if the loop is entered...
Well, it wouldn't be NaN, it would be start + step * temp :) But, yes,
Doh.
that works. So we need both the check and an initialization in there:
if (nsteps> 0) { i = 0; #pragma omp parallel for firstprivate(i) lastprivate(i) for (temp = 0; ...; ...) ... }
Why do you need the if-test? Won't simply
#pragma omp parallel for firstprivate(i) lastprivate(i) for (temp = 0; ...; ...) ...
do the job -- any initial value will be copied into all threads, including the "last" thread, even if there are no iterations?
It will, but you don't expect your iteration variable to change with zero iterations.
Dag Sverre _______________________________________________ cython-devel mailing list cython-devel@python.org http://mail.python.org/mailman/listinfo/cython-devel
On 05/04/2011 01:41 PM, mark florisson wrote:
On 4 May 2011 13:39, Dag Sverre Seljebotn<d.s.seljebotn@astro.uio.no> wrote:
On 05/04/2011 01:30 PM, mark florisson wrote:
On 4 May 2011 13:15, Dag Sverre Seljebotn<d.s.seljebotn@astro.uio.no> wrote:
On 05/04/2011 12:59 PM, mark florisson wrote:
On 4 May 2011 12:45, Dag Sverre Seljebotn<d.s.seljebotn@astro.uio.no> wrote:
On 05/04/2011 12:00 PM, mark florisson wrote: > > There are two remaining issue. The first is warnings for potentially > uninitialized variables for prange(). When you do > > for i in prange(start, stop, step): ... > > it generates code like > > nsteps = (stop - start) / step; > #pragma omp parallel for lastprivate(i) > for (temp = 0; temp< nsteps; temp++) { > i = start + temp * step; > ... > } > > So here it will complain about 'i' being potentially uninitialized, as > it might not be assigned to in the loop. However, simply assigning 0 > to 'i' can't work either, as you expect zero iterations not to touch > it. So for now, we have a bunch of warnings, as I don't see a > __attribute__ to suppress it selectively.
Isn't this is orthogonal to OpenMP -- even if it said "range", your testcase could get such a warning? If so, the fix is simply to initialize i in your testcase code.
No, the problem is that 'i' needs to be lastprivate, and 'i' is assigned to in the loop body. It's irrelevant whether 'i' is assigned to before the loop. I think this is the case because the spec says that lastprivate variables will get the value of the private variable of the last sequential iteration, but it cannot at compile time know whether there might be zero iterations, which I believe the spec doesn't have anything to say about. So basically we could guard against it by checking if nsteps> 0, but the compiler doesn't detect this, so it will still issue a warning even if 'i' is initialized (the warning is at the place of the lastprivate declaration).
Ah. But this is then more important than I initially thought it was. You are saying that this is the case:
cdef int i = 0 with nogil: for i in prange(n): ... print i # garbage when n == 0?
I think it may be, depending on the implementation. With libgomp it return 0. With the check it should also return 0.
It would be in the interest of less semantic differences w.r.t. range to deal better with this case.
Will it silence the warning if we make "i" firstprivate as well as lastprivate? firstprivate would only affect the case of zero iterations, since we overwrite with NaN if the loop is entered...
Well, it wouldn't be NaN, it would be start + step * temp :) But, yes,
Doh.
that works. So we need both the check and an initialization in there:
if (nsteps> 0) { i = 0; #pragma omp parallel for firstprivate(i) lastprivate(i) for (temp = 0; ...; ...) ... }
Why do you need the if-test? Won't simply
#pragma omp parallel for firstprivate(i) lastprivate(i) for (temp = 0; ...; ...) ...
do the job -- any initial value will be copied into all threads, including the "last" thread, even if there are no iterations?
It will, but you don't expect your iteration variable to change with zero iterations.
Look. i = 42 for i in prange(n): f(i) print i # want 42 whenever n == 0 Now, translate this to: i = 42; #pragma omp parallel for firstprivate(i) lastprivate(i) for (temp = 0; ...; ...) { i = ... } #pragma omp parallel end /* At this point, i == 42 if n == 0 */ Am I missing something? DS
On 4 May 2011 13:45, Dag Sverre Seljebotn <d.s.seljebotn@astro.uio.no> wrote:
On 05/04/2011 01:41 PM, mark florisson wrote:
On 4 May 2011 13:39, Dag Sverre Seljebotn<d.s.seljebotn@astro.uio.no> wrote:
On 05/04/2011 01:30 PM, mark florisson wrote:
On 4 May 2011 13:15, Dag Sverre Seljebotn<d.s.seljebotn@astro.uio.no> wrote:
On 05/04/2011 12:59 PM, mark florisson wrote:
On 4 May 2011 12:45, Dag Sverre Seljebotn<d.s.seljebotn@astro.uio.no> wrote: > > On 05/04/2011 12:00 PM, mark florisson wrote: >> >> There are two remaining issue. The first is warnings for potentially >> uninitialized variables for prange(). When you do >> >> for i in prange(start, stop, step): ... >> >> it generates code like >> >> nsteps = (stop - start) / step; >> #pragma omp parallel for lastprivate(i) >> for (temp = 0; temp< nsteps; temp++) { >> i = start + temp * step; >> ... >> } >> >> So here it will complain about 'i' being potentially uninitialized, >> as >> it might not be assigned to in the loop. However, simply assigning 0 >> to 'i' can't work either, as you expect zero iterations not to touch >> it. So for now, we have a bunch of warnings, as I don't see a >> __attribute__ to suppress it selectively. > > Isn't this is orthogonal to OpenMP -- even if it said "range", your > testcase > could get such a warning? If so, the fix is simply to initialize i in > your > testcase code.
No, the problem is that 'i' needs to be lastprivate, and 'i' is assigned to in the loop body. It's irrelevant whether 'i' is assigned to before the loop. I think this is the case because the spec says that lastprivate variables will get the value of the private variable of the last sequential iteration, but it cannot at compile time know whether there might be zero iterations, which I believe the spec doesn't have anything to say about. So basically we could guard against it by checking if nsteps> 0, but the compiler doesn't detect this, so it will still issue a warning even if 'i' is initialized (the warning is at the place of the lastprivate declaration).
Ah. But this is then more important than I initially thought it was. You are saying that this is the case:
cdef int i = 0 with nogil: for i in prange(n): ... print i # garbage when n == 0?
I think it may be, depending on the implementation. With libgomp it return 0. With the check it should also return 0.
It would be in the interest of less semantic differences w.r.t. range to deal better with this case.
Will it silence the warning if we make "i" firstprivate as well as lastprivate? firstprivate would only affect the case of zero iterations, since we overwrite with NaN if the loop is entered...
Well, it wouldn't be NaN, it would be start + step * temp :) But, yes,
Doh.
that works. So we need both the check and an initialization in there:
if (nsteps> 0) { i = 0; #pragma omp parallel for firstprivate(i) lastprivate(i) for (temp = 0; ...; ...) ... }
Why do you need the if-test? Won't simply
#pragma omp parallel for firstprivate(i) lastprivate(i) for (temp = 0; ...; ...) ...
do the job -- any initial value will be copied into all threads, including the "last" thread, even if there are no iterations?
It will, but you don't expect your iteration variable to change with zero iterations.
Look.
i = 42 for i in prange(n): f(i) print i # want 42 whenever n == 0
Now, translate this to:
i = 42; #pragma omp parallel for firstprivate(i) lastprivate(i) for (temp = 0; ...; ...) { i = ... } #pragma omp parallel end /* At this point, i == 42 if n == 0 */
Am I missing something?
Yes, 'i' may be uninitialized with nsteps > 0 (this should be valid code). So if nsteps > 0, we need to initialize 'i' to something to get correct behaviour with firstprivate.
On 4 May 2011 13:47, mark florisson <markflorisson88@gmail.com> wrote:
On 4 May 2011 13:45, Dag Sverre Seljebotn <d.s.seljebotn@astro.uio.no> wrote:
On 05/04/2011 01:41 PM, mark florisson wrote:
On 4 May 2011 13:39, Dag Sverre Seljebotn<d.s.seljebotn@astro.uio.no> wrote:
On 05/04/2011 01:30 PM, mark florisson wrote:
On 4 May 2011 13:15, Dag Sverre Seljebotn<d.s.seljebotn@astro.uio.no> wrote:
On 05/04/2011 12:59 PM, mark florisson wrote: > > On 4 May 2011 12:45, Dag Sverre Seljebotn<d.s.seljebotn@astro.uio.no> > wrote: >> >> On 05/04/2011 12:00 PM, mark florisson wrote: >>> >>> There are two remaining issue. The first is warnings for potentially >>> uninitialized variables for prange(). When you do >>> >>> for i in prange(start, stop, step): ... >>> >>> it generates code like >>> >>> nsteps = (stop - start) / step; >>> #pragma omp parallel for lastprivate(i) >>> for (temp = 0; temp< nsteps; temp++) { >>> i = start + temp * step; >>> ... >>> } >>> >>> So here it will complain about 'i' being potentially uninitialized, >>> as >>> it might not be assigned to in the loop. However, simply assigning 0 >>> to 'i' can't work either, as you expect zero iterations not to touch >>> it. So for now, we have a bunch of warnings, as I don't see a >>> __attribute__ to suppress it selectively. >> >> Isn't this is orthogonal to OpenMP -- even if it said "range", your >> testcase >> could get such a warning? If so, the fix is simply to initialize i in >> your >> testcase code. > > No, the problem is that 'i' needs to be lastprivate, and 'i' is > assigned to in the loop body. It's irrelevant whether 'i' is assigned > to before the loop. I think this is the case because the spec says > that lastprivate variables will get the value of the private variable > of the last sequential iteration, but it cannot at compile time know > whether there might be zero iterations, which I believe the spec > doesn't have anything to say about. So basically we could guard > against it by checking if nsteps> 0, but the compiler doesn't > detect > this, so it will still issue a warning even if 'i' is initialized (the > warning is at the place of the lastprivate declaration).
Ah. But this is then more important than I initially thought it was. You are saying that this is the case:
cdef int i = 0 with nogil: for i in prange(n): ... print i # garbage when n == 0?
I think it may be, depending on the implementation. With libgomp it return 0. With the check it should also return 0.
It would be in the interest of less semantic differences w.r.t. range to deal better with this case.
Will it silence the warning if we make "i" firstprivate as well as lastprivate? firstprivate would only affect the case of zero iterations, since we overwrite with NaN if the loop is entered...
Well, it wouldn't be NaN, it would be start + step * temp :) But, yes,
Doh.
that works. So we need both the check and an initialization in there:
if (nsteps> 0) { i = 0; #pragma omp parallel for firstprivate(i) lastprivate(i) for (temp = 0; ...; ...) ... }
Why do you need the if-test? Won't simply
#pragma omp parallel for firstprivate(i) lastprivate(i) for (temp = 0; ...; ...) ...
do the job -- any initial value will be copied into all threads, including the "last" thread, even if there are no iterations?
It will, but you don't expect your iteration variable to change with zero iterations.
Look.
i = 42 for i in prange(n): f(i) print i # want 42 whenever n == 0
Now, translate this to:
i = 42; #pragma omp parallel for firstprivate(i) lastprivate(i) for (temp = 0; ...; ...) { i = ... } #pragma omp parallel end /* At this point, i == 42 if n == 0 */
Am I missing something?
Yes, 'i' may be uninitialized with nsteps > 0 (this should be valid code). So if nsteps > 0, we need to initialize 'i' to something to get correct behaviour with firstprivate.
And of course, if you initialize 'i' unconditionally, you change 'i' whereas you might have to leave it unaffected.
On 05/04/2011 01:48 PM, mark florisson wrote:
On 4 May 2011 13:47, mark florisson<markflorisson88@gmail.com> wrote:
On 4 May 2011 13:45, Dag Sverre Seljebotn<d.s.seljebotn@astro.uio.no> wrote:
Look.
i = 42 for i in prange(n): f(i) print i # want 42 whenever n == 0
Now, translate this to:
i = 42; #pragma omp parallel for firstprivate(i) lastprivate(i) for (temp = 0; ...; ...) { i = ... } #pragma omp parallel end /* At this point, i == 42 if n == 0 */
Am I missing something?
Yes, 'i' may be uninitialized with nsteps> 0 (this should be valid code). So if nsteps> 0, we need to initialize 'i' to something to get correct behaviour with firstprivate.
This I don't see. I think I need to be spoon-fed on this one.
And of course, if you initialize 'i' unconditionally, you change 'i' whereas you might have to leave it unaffected.
This I see. Dag Sverre
On 4 May 2011 13:54, Dag Sverre Seljebotn <d.s.seljebotn@astro.uio.no> wrote:
On 05/04/2011 01:48 PM, mark florisson wrote:
On 4 May 2011 13:47, mark florisson<markflorisson88@gmail.com> wrote:
On 4 May 2011 13:45, Dag Sverre Seljebotn<d.s.seljebotn@astro.uio.no> wrote:
Look.
i = 42 for i in prange(n): f(i) print i # want 42 whenever n == 0
Now, translate this to:
i = 42; #pragma omp parallel for firstprivate(i) lastprivate(i) for (temp = 0; ...; ...) { i = ... } #pragma omp parallel end /* At this point, i == 42 if n == 0 */
Am I missing something?
Yes, 'i' may be uninitialized with nsteps> 0 (this should be valid code). So if nsteps> 0, we need to initialize 'i' to something to get correct behaviour with firstprivate.
This I don't see. I think I need to be spoon-fed on this one.
So assume this code cdef int i for i in prange(10): ... Now if we transform this without the guard we get int i; #pragma omp parallel for firstprivate(i) lastprivate(i) for (...) { ...} This is invalid C code, but valid Cython code. So we need to initialize 'i', but then we get our "leave it unaffected for 0 iterations" paradox. So we need a guard.
And of course, if you initialize 'i' unconditionally, you change 'i' whereas you might have to leave it unaffected.
This I see.
Dag Sverre _______________________________________________ cython-devel mailing list cython-devel@python.org http://mail.python.org/mailman/listinfo/cython-devel
On 05/04/2011 01:59 PM, mark florisson wrote:
On 4 May 2011 13:54, Dag Sverre Seljebotn<d.s.seljebotn@astro.uio.no> wrote:
On 05/04/2011 01:48 PM, mark florisson wrote:
On 4 May 2011 13:47, mark florisson<markflorisson88@gmail.com> wrote:
On 4 May 2011 13:45, Dag Sverre Seljebotn<d.s.seljebotn@astro.uio.no> wrote:
Look.
i = 42 for i in prange(n): f(i) print i # want 42 whenever n == 0
Now, translate this to:
i = 42; #pragma omp parallel for firstprivate(i) lastprivate(i) for (temp = 0; ...; ...) { i = ... } #pragma omp parallel end /* At this point, i == 42 if n == 0 */
Am I missing something?
Yes, 'i' may be uninitialized with nsteps> 0 (this should be valid code). So if nsteps> 0, we need to initialize 'i' to something to get correct behaviour with firstprivate.
This I don't see. I think I need to be spoon-fed on this one.
So assume this code
cdef int i
for i in prange(10): ...
Now if we transform this without the guard we get
int i;
#pragma omp parallel for firstprivate(i) lastprivate(i) for (...) { ...}
This is invalid C code, but valid Cython code. So we need to initialize 'i', but then we get our "leave it unaffected for 0 iterations" paradox. So we need a guard.
You mean C code won't compile if i is firstprivate and not initialized? (Sorry, I'm not aware of such things.) My first instinct is to initialize i to 0xbadabada. After all, its value is not specified -- we're not violating any Cython specs by initializing it to garbage ourselves. OTOH, I see that your approach with an if-test is more Valgrind-friendly, so I'm OK with that. Would it work to do if (nsteps > 0) { #pragma omp parallel i = 0; #pragma omp for lastprivate(i) for (temp = 0; ...) ... ... } instead, to get rid of the warning without using a firstprivate? Not sure if there's an efficiency difference here, I suppose a good C compiler could compile them to the same thing. Dag Sverre
On 4 May 2011 14:10, Dag Sverre Seljebotn <d.s.seljebotn@astro.uio.no> wrote:
On 05/04/2011 01:59 PM, mark florisson wrote:
On 4 May 2011 13:54, Dag Sverre Seljebotn<d.s.seljebotn@astro.uio.no> wrote:
On 05/04/2011 01:48 PM, mark florisson wrote:
On 4 May 2011 13:47, mark florisson<markflorisson88@gmail.com> wrote:
On 4 May 2011 13:45, Dag Sverre Seljebotn<d.s.seljebotn@astro.uio.no> wrote:
Look.
i = 42 for i in prange(n): f(i) print i # want 42 whenever n == 0
Now, translate this to:
i = 42; #pragma omp parallel for firstprivate(i) lastprivate(i) for (temp = 0; ...; ...) { i = ... } #pragma omp parallel end /* At this point, i == 42 if n == 0 */
Am I missing something?
Yes, 'i' may be uninitialized with nsteps> 0 (this should be valid code). So if nsteps> 0, we need to initialize 'i' to something to get correct behaviour with firstprivate.
This I don't see. I think I need to be spoon-fed on this one.
So assume this code
cdef int i
for i in prange(10): ...
Now if we transform this without the guard we get
int i;
#pragma omp parallel for firstprivate(i) lastprivate(i) for (...) { ...}
This is invalid C code, but valid Cython code. So we need to initialize 'i', but then we get our "leave it unaffected for 0 iterations" paradox. So we need a guard.
You mean C code won't compile if i is firstprivate and not initialized? (Sorry, I'm not aware of such things.)
It will compile and warn, but it is technically invalid, as you're reading an uninitialized variable, which has undefined behavior. If e.g. the variable contains a trap representation on a certain architecture, it might halt the program (I'm not sure which architecture that would be, but I believe they exist).
My first instinct is to initialize i to 0xbadabada. After all, its value is not specified -- we're not violating any Cython specs by initializing it to garbage ourselves.
The problem is that we don't know whether the user has initialized the variable. So if we want firstprivate to suppress warnings, we should assume that the user hasn't and do it ourselves.
OTOH, I see that your approach with an if-test is more Valgrind-friendly, so I'm OK with that.
Would it work to do
if (nsteps > 0) { #pragma omp parallel i = 0; #pragma omp for lastprivate(i) for (temp = 0; ...) ... ... }
I'm assuming you mean #pragma omp parallel private(i), otherwise you have a race (I'm not sure how much that matters for assignment). In any case, with the private() clause 'i' would be uninitialized afterwards. In either case it won't do anything useful.
instead, to get rid of the warning without using a firstprivate? Not sure if there's an efficiency difference here, I suppose a good C compiler could compile them to the same thing.
Dag Sverre _______________________________________________ cython-devel mailing list cython-devel@python.org http://mail.python.org/mailman/listinfo/cython-devel
On 4 May 2011 14:17, mark florisson <markflorisson88@gmail.com> wrote:
On 4 May 2011 14:10, Dag Sverre Seljebotn <d.s.seljebotn@astro.uio.no> wrote:
On 05/04/2011 01:59 PM, mark florisson wrote:
On 4 May 2011 13:54, Dag Sverre Seljebotn<d.s.seljebotn@astro.uio.no> wrote:
On 05/04/2011 01:48 PM, mark florisson wrote:
On 4 May 2011 13:47, mark florisson<markflorisson88@gmail.com> wrote:
On 4 May 2011 13:45, Dag Sverre Seljebotn<d.s.seljebotn@astro.uio.no> wrote:
> Look. > > i = 42 > for i in prange(n): > f(i) > print i # want 42 whenever n == 0 > > Now, translate this to: > > i = 42; > #pragma omp parallel for firstprivate(i) lastprivate(i) > for (temp = 0; ...; ...) { > i = ... > } > #pragma omp parallel end > /* At this point, i == 42 if n == 0 */ > > Am I missing something?
Yes, 'i' may be uninitialized with nsteps> 0 (this should be valid code). So if nsteps> 0, we need to initialize 'i' to something to get correct behaviour with firstprivate.
This I don't see. I think I need to be spoon-fed on this one.
So assume this code
cdef int i
for i in prange(10): ...
Now if we transform this without the guard we get
int i;
#pragma omp parallel for firstprivate(i) lastprivate(i) for (...) { ...}
This is invalid C code, but valid Cython code. So we need to initialize 'i', but then we get our "leave it unaffected for 0 iterations" paradox. So we need a guard.
You mean C code won't compile if i is firstprivate and not initialized? (Sorry, I'm not aware of such things.)
It will compile and warn, but it is technically invalid, as you're reading an uninitialized variable, which has undefined behavior. If e.g. the variable contains a trap representation on a certain architecture, it might halt the program (I'm not sure which architecture that would be, but I believe they exist).
My first instinct is to initialize i to 0xbadabada. After all, its value is not specified -- we're not violating any Cython specs by initializing it to garbage ourselves.
The problem is that we don't know whether the user has initialized the variable. So if we want firstprivate to suppress warnings, we should assume that the user hasn't and do it ourselves.
The alternative would be to give 'cdef int i' initialized semantics, to whatever value we please. So instead of generating 'int i;' code, we could always generate 'int i = ...;'. But currently we don't do that.
OTOH, I see that your approach with an if-test is more Valgrind-friendly, so I'm OK with that.
Would it work to do
if (nsteps > 0) { #pragma omp parallel i = 0; #pragma omp for lastprivate(i) for (temp = 0; ...) ... ... }
I'm assuming you mean #pragma omp parallel private(i), otherwise you have a race (I'm not sure how much that matters for assignment). In any case, with the private() clause 'i' would be uninitialized afterwards. In either case it won't do anything useful.
instead, to get rid of the warning without using a firstprivate? Not sure if there's an efficiency difference here, I suppose a good C compiler could compile them to the same thing.
Dag Sverre _______________________________________________ cython-devel mailing list cython-devel@python.org http://mail.python.org/mailman/listinfo/cython-devel
On 05/04/2011 02:17 PM, mark florisson wrote:
On 4 May 2011 14:10, Dag Sverre Seljebotn<d.s.seljebotn@astro.uio.no> wrote:
On 05/04/2011 01:59 PM, mark florisson wrote:
On 4 May 2011 13:54, Dag Sverre Seljebotn<d.s.seljebotn@astro.uio.no> wrote:
On 05/04/2011 01:48 PM, mark florisson wrote:
On 4 May 2011 13:47, mark florisson<markflorisson88@gmail.com> wrote:
On 4 May 2011 13:45, Dag Sverre Seljebotn<d.s.seljebotn@astro.uio.no> wrote:
> Look. > > i = 42 > for i in prange(n): > f(i) > print i # want 42 whenever n == 0 > > Now, translate this to: > > i = 42; > #pragma omp parallel for firstprivate(i) lastprivate(i) > for (temp = 0; ...; ...) { > i = ... > } > #pragma omp parallel end > /* At this point, i == 42 if n == 0 */ > > Am I missing something?
Yes, 'i' may be uninitialized with nsteps> 0 (this should be valid code). So if nsteps> 0, we need to initialize 'i' to something to get correct behaviour with firstprivate.
This I don't see. I think I need to be spoon-fed on this one.
So assume this code
cdef int i
for i in prange(10): ...
Now if we transform this without the guard we get
int i;
#pragma omp parallel for firstprivate(i) lastprivate(i) for (...) { ...}
This is invalid C code, but valid Cython code. So we need to initialize 'i', but then we get our "leave it unaffected for 0 iterations" paradox. So we need a guard.
You mean C code won't compile if i is firstprivate and not initialized? (Sorry, I'm not aware of such things.)
It will compile and warn, but it is technically invalid, as you're reading an uninitialized variable, which has undefined behavior. If e.g. the variable contains a trap representation on a certain architecture, it might halt the program (I'm not sure which architecture that would be, but I believe they exist).
My first instinct is to initialize i to 0xbadabada. After all, its value is not specified -- we're not violating any Cython specs by initializing it to garbage ourselves.
The problem is that we don't know whether the user has initialized the variable. So if we want firstprivate to suppress warnings, we should assume that the user hasn't and do it ourselves.
I meant that if we don't care about Valgrindability, we can initialize i at the top of our function (i.e. where it says "int __pyx_v_i").
OTOH, I see that your approach with an if-test is more Valgrind-friendly, so I'm OK with that.
Would it work to do
if (nsteps> 0) { #pragma omp parallel i = 0; #pragma omp for lastprivate(i) for (temp = 0; ...) ... ... }
I'm assuming you mean #pragma omp parallel private(i), otherwise you have a race (I'm not sure how much that matters for assignment). In any case, with the private() clause 'i' would be uninitialized afterwards. In either case it won't do anything useful.
Sorry, I meant that lastprivate(i) should go on the parallel line. if (nsteps> 0) { #pragma omp parallel lastprivate(i) i = 0; #pragma omp for for (temp = 0; ...) ... ... } won't this silence the warning? At any rate, it's obvious you have a better handle on this than me, so I'll shut up now and leave you to it :-) Dag Sverre
On 4 May 2011 14:23, Dag Sverre Seljebotn <d.s.seljebotn@astro.uio.no> wrote:
On 05/04/2011 02:17 PM, mark florisson wrote:
On 4 May 2011 14:10, Dag Sverre Seljebotn<d.s.seljebotn@astro.uio.no> wrote:
On 05/04/2011 01:59 PM, mark florisson wrote:
On 4 May 2011 13:54, Dag Sverre Seljebotn<d.s.seljebotn@astro.uio.no> wrote:
On 05/04/2011 01:48 PM, mark florisson wrote:
On 4 May 2011 13:47, mark florisson<markflorisson88@gmail.com> wrote: > > On 4 May 2011 13:45, Dag Sverre Seljebotn<d.s.seljebotn@astro.uio.no> > wrote:
>> Look. >> >> i = 42 >> for i in prange(n): >> f(i) >> print i # want 42 whenever n == 0 >> >> Now, translate this to: >> >> i = 42; >> #pragma omp parallel for firstprivate(i) lastprivate(i) >> for (temp = 0; ...; ...) { >> i = ... >> } >> #pragma omp parallel end >> /* At this point, i == 42 if n == 0 */ >> >> Am I missing something? > > Yes, 'i' may be uninitialized with nsteps> 0 (this should be > valid > code). So if nsteps> 0, we need to initialize 'i' to something > to > get > correct behaviour with firstprivate.
This I don't see. I think I need to be spoon-fed on this one.
So assume this code
cdef int i
for i in prange(10): ...
Now if we transform this without the guard we get
int i;
#pragma omp parallel for firstprivate(i) lastprivate(i) for (...) { ...}
This is invalid C code, but valid Cython code. So we need to initialize 'i', but then we get our "leave it unaffected for 0 iterations" paradox. So we need a guard.
You mean C code won't compile if i is firstprivate and not initialized? (Sorry, I'm not aware of such things.)
It will compile and warn, but it is technically invalid, as you're reading an uninitialized variable, which has undefined behavior. If e.g. the variable contains a trap representation on a certain architecture, it might halt the program (I'm not sure which architecture that would be, but I believe they exist).
My first instinct is to initialize i to 0xbadabada. After all, its value is not specified -- we're not violating any Cython specs by initializing it to garbage ourselves.
The problem is that we don't know whether the user has initialized the variable. So if we want firstprivate to suppress warnings, we should assume that the user hasn't and do it ourselves.
I meant that if we don't care about Valgrindability, we can initialize i at the top of our function (i.e. where it says "int __pyx_v_i").
Indeed, but as the current semantics don't do this, I think we also shouldn't. The good thing is that if we don't do it, the user will see warnings from the C compiler if used uninitialized.
OTOH, I see that your approach with an if-test is more Valgrind-friendly, so I'm OK with that.
Would it work to do
if (nsteps> 0) { #pragma omp parallel i = 0; #pragma omp for lastprivate(i) for (temp = 0; ...) ... ... }
I'm assuming you mean #pragma omp parallel private(i), otherwise you have a race (I'm not sure how much that matters for assignment). In any case, with the private() clause 'i' would be uninitialized afterwards. In either case it won't do anything useful.
Sorry, I meant that lastprivate(i) should go on the parallel line.
if (nsteps> 0) { #pragma omp parallel lastprivate(i) i = 0; #pragma omp for for (temp = 0; ...) ... ... }
won't this silence the warning? At any rate, it's obvious you have a better handle on this than me, so I'll shut up now and leave you to it :-)
lastprivate() is not valid on a plain parallel constructs, as it's not a loop. There's only private() and shared().
Moving pull requestion discussion (https://github.com/cython/cython/pull/28) over here: First, I got curious why you'd have a strip off "-pthread" from CC. I'd think you could just execute with it with "-pthread", which seems simpler. Second: If parallel.parallel is not callable, how are scheduling parameters for parallel blocks handled? Is there a reason to not support that? Do you think it should stay this way, or will parallel take parameters in the future? Dag Sverre
On 4 May 2011 18:35, Dag Sverre Seljebotn <d.s.seljebotn@astro.uio.no> wrote:
Moving pull requestion discussion (https://github.com/cython/cython/pull/28) over here:
First, I got curious why you'd have a strip off "-pthread" from CC. I'd think you could just execute with it with "-pthread", which seems simpler.
It needs to end up in a list of arguments, and it's not needed at all as I only need the version. I guess I could do (cc + " -v").split() but eh.
Second: If parallel.parallel is not callable, how are scheduling parameters for parallel blocks handled? Is there a reason to not support that? Do you think it should stay this way, or will parallel take parameters in the future?
Well, as I mentioned a while back, you cannot schedule parallel blocks, there is no worksharing involved. All a parallel block does is executed a code block in however many threads there are available. The scheduling parameters are valid for a worksharing for loop only, as you schedule (read "distribute") the work among the threads.
Dag Sverre _______________________________________________ cython-devel mailing list cython-devel@python.org http://mail.python.org/mailman/listinfo/cython-devel
On 05/04/2011 07:03 PM, mark florisson wrote:
On 4 May 2011 18:35, Dag Sverre Seljebotn<d.s.seljebotn@astro.uio.no> wrote:
Moving pull requestion discussion (https://github.com/cython/cython/pull/28) over here:
First, I got curious why you'd have a strip off "-pthread" from CC. I'd think you could just execute with it with "-pthread", which seems simpler.
It needs to end up in a list of arguments, and it's not needed at all as I only need the version. I guess I could do (cc + " -v").split() but eh.
OK, that's reassuring, thought perhaps you had encountered a strange gcc strain.
Second: If parallel.parallel is not callable, how are scheduling parameters for parallel blocks handled? Is there a reason to not support that? Do you think it should stay this way, or will parallel take parameters in the future?
Well, as I mentioned a while back, you cannot schedule parallel blocks, there is no worksharing involved. All a parallel block does is executed a code block in however many threads there are available. The scheduling parameters are valid for a worksharing for loop only, as you schedule (read "distribute") the work among the threads.
Perhaps I used the wrong terms; but checking the specs, I guess I meant "num_threads", which definitely applies to parallel. Dag Sverre
On 4 May 2011 19:44, Dag Sverre Seljebotn <d.s.seljebotn@astro.uio.no> wrote:
On 05/04/2011 07:03 PM, mark florisson wrote:
On 4 May 2011 18:35, Dag Sverre Seljebotn<d.s.seljebotn@astro.uio.no> wrote:
Moving pull requestion discussion (https://github.com/cython/cython/pull/28) over here:
First, I got curious why you'd have a strip off "-pthread" from CC. I'd think you could just execute with it with "-pthread", which seems simpler.
It needs to end up in a list of arguments, and it's not needed at all as I only need the version. I guess I could do (cc + " -v").split() but eh.
OK, that's reassuring, thought perhaps you had encountered a strange gcc strain.
Second: If parallel.parallel is not callable, how are scheduling parameters for parallel blocks handled? Is there a reason to not support that? Do you think it should stay this way, or will parallel take parameters in the future?
Well, as I mentioned a while back, you cannot schedule parallel blocks, there is no worksharing involved. All a parallel block does is executed a code block in however many threads there are available. The scheduling parameters are valid for a worksharing for loop only, as you schedule (read "distribute") the work among the threads.
Perhaps I used the wrong terms; but checking the specs, I guess I meant "num_threads", which definitely applies to parallel.
Ah, that level of scheduling :) Right, so it doesn't take that, but I don't think it's a big issue. If dynamic scheduling is enabled, it's only a suggestion, if dynamic scheduling is disabled (whether it's turned on or off by default is implementation defined) it will give the the amount of threads requested, if available. The user can still use omp_set_num_threads(), although admittedly that modifies a global setting.
Dag Sverre _______________________________________________ cython-devel mailing list cython-devel@python.org http://mail.python.org/mailman/listinfo/cython-devel
On 05/04/2011 08:07 PM, mark florisson wrote:
On 4 May 2011 19:44, Dag Sverre Seljebotn<d.s.seljebotn@astro.uio.no> wrote:
On 05/04/2011 07:03 PM, mark florisson wrote:
On 4 May 2011 18:35, Dag Sverre Seljebotn<d.s.seljebotn@astro.uio.no> wrote:
Moving pull requestion discussion (https://github.com/cython/cython/pull/28) over here:
First, I got curious why you'd have a strip off "-pthread" from CC. I'd think you could just execute with it with "-pthread", which seems simpler.
It needs to end up in a list of arguments, and it's not needed at all as I only need the version. I guess I could do (cc + " -v").split() but eh.
OK, that's reassuring, thought perhaps you had encountered a strange gcc strain.
Second: If parallel.parallel is not callable, how are scheduling parameters for parallel blocks handled? Is there a reason to not support that? Do you think it should stay this way, or will parallel take parameters in the future?
Well, as I mentioned a while back, you cannot schedule parallel blocks, there is no worksharing involved. All a parallel block does is executed a code block in however many threads there are available. The scheduling parameters are valid for a worksharing for loop only, as you schedule (read "distribute") the work among the threads.
Perhaps I used the wrong terms; but checking the specs, I guess I meant "num_threads", which definitely applies to parallel.
Ah, that level of scheduling :) Right, so it doesn't take that, but I don't think it's a big issue. If dynamic scheduling is enabled, it's only a suggestion, if dynamic scheduling is disabled (whether it's turned on or off by default is implementation defined) it will give the the amount of threads requested, if available. The user can still use omp_set_num_threads(), although admittedly that modifies a global setting.
Hmm...I'm not completely happy about this. For now I just worry about not shutting off the possibility of adding thread-pool-spawning parameters in the future. Specifying the number of threads can be useful, and omp_set_num_threads is a bad way of doing as you say. And other backends than OpenMP may call for something we don't know what is yet? Anyway, all I'm asking is whether we should require trailing () on parallel: with nogil, parallel(): ... I think we should, to keep the window open for options. Unless, that is, we're OK both with and without trailing () down the line. Dag Sverre
On 4 May 2011 21:13, Dag Sverre Seljebotn <d.s.seljebotn@astro.uio.no> wrote:
On 05/04/2011 08:07 PM, mark florisson wrote:
On 4 May 2011 19:44, Dag Sverre Seljebotn<d.s.seljebotn@astro.uio.no> wrote:
On 05/04/2011 07:03 PM, mark florisson wrote:
On 4 May 2011 18:35, Dag Sverre Seljebotn<d.s.seljebotn@astro.uio.no> wrote:
Moving pull requestion discussion (https://github.com/cython/cython/pull/28) over here:
First, I got curious why you'd have a strip off "-pthread" from CC. I'd think you could just execute with it with "-pthread", which seems simpler.
It needs to end up in a list of arguments, and it's not needed at all as I only need the version. I guess I could do (cc + " -v").split() but eh.
OK, that's reassuring, thought perhaps you had encountered a strange gcc strain.
Second: If parallel.parallel is not callable, how are scheduling parameters for parallel blocks handled? Is there a reason to not support that? Do you think it should stay this way, or will parallel take parameters in the future?
Well, as I mentioned a while back, you cannot schedule parallel blocks, there is no worksharing involved. All a parallel block does is executed a code block in however many threads there are available. The scheduling parameters are valid for a worksharing for loop only, as you schedule (read "distribute") the work among the threads.
Perhaps I used the wrong terms; but checking the specs, I guess I meant "num_threads", which definitely applies to parallel.
Ah, that level of scheduling :) Right, so it doesn't take that, but I don't think it's a big issue. If dynamic scheduling is enabled, it's only a suggestion, if dynamic scheduling is disabled (whether it's turned on or off by default is implementation defined) it will give the the amount of threads requested, if available. The user can still use omp_set_num_threads(), although admittedly that modifies a global setting.
Hmm...I'm not completely happy about this. For now I just worry about not shutting off the possibility of adding thread-pool-spawning parameters in the future. Specifying the number of threads can be useful, and omp_set_num_threads is a bad way of doing as you say.
And other backends than OpenMP may call for something we don't know what is yet?
Anyway, all I'm asking is whether we should require trailing () on parallel:
with nogil, parallel(): ...
I think we should, to keep the window open for options. Unless, that is, we're OK both with and without trailing () down the line.
Ok, sure, that's fine with me.
as it doesn't cause unintended consequences. > > I don't really think it will -- the important thing is that that the order > of loop iteration evaluation must be unimportant. And that is still true > (for the outer loop, as well as for the inner) in your first example. > > Question: When you have nested pranges, what will happen is that two nested > OpenMP parallel blocks are used, right? And do you know if there is complete > freedom/"reentrancy" in that variables that are thread-private in an outer > parallel block and be shared in an inner one, and vice versa? An implementation may or may not support it, and if it is supported the behaviour can be configured through omp_set_nested(). So we should consider the case where it is supported and enabled. If you have a lastprivate or reduction, and after the loop these are (reduced and) assigned to the original variable. So if that happens inside a parallel construct which does not declare the variable private to the construct, you actually have a race. So e.g. the nested prange currently races in the outer parallel range. > If so I'd think that this algorithm should work and feel natural: > > - In each prange, for the purposes of variable private/shared/reduction > inference, consider all internal "prange" just as if they had been "range"; > no special treatment. > > - Recurse to children pranges. Right, that is most natural. Algorithmically, reductions and lastprivates (as those can have races if placed in inner parallel constructs) propagate outwards towards the outermost parallel block, or up to the first parallel with block, or up to the first construct that already determined the sharing attribute. e.g. with parallel: with parallel: for i in prange(n): for j in prange(n): sum += i * j # sum is well-defined here # sum is undefined here Here 'sum' is a reduction for the two innermost loops. 'sum' is not private for the inner parallel with block, as a prange in a parallel with block is a worksharing loop
(apologies for top post) This all seems to scream 'disallow' to me, in particular since some openmp implementations may not support it etc. At any rate I feel 'parallel/parallel/prange/prange' is going to far; so next step could be to only allowing 'parallel/prange/parallel/prange'. But really, my feeling is that if you really do need this then you can always write a seperate function for the inner loop (I honestly can't think of a usecase anyway...). So I'd really drop it; at least until the rest of the gsoc project is completed :) DS -- Sent from my Android phone with K-9 Mail. Please excuse my brevity. mark florisson <markflorisson88@gmail.com> wrote: On 16 April 2011 18:42, Dag Sverre Seljebotn <d.s.seljebotn@astro.uio.no> wrote: > (Moving discussion from http://markflorisson.wordpress.com/, where Mark > said:) Ok, sure, it was just an issue I was wondering about at that moment, but it's a tricky issue, so thanks. > """ > Started a new branch https://github.com/markflorisson88/cython/tree/openmp . > > Now the question is whether sharing attributes should be propagated > outwards. e.g. if you do > > for i in prange(m): > for j in prange(n): > sum += i * j > > then ‘sum’ is a reduction for the inner parallel loop, but not for the outer > one. So the user would currently have to rewrite this to > > for i in prange(m): > for j in prange(n): > sum += i * j > sum += 0 > > which seems a bit silly . Of course, we could just disable nested > parallelism, or tell the users to use a prange and a ‘for from’ in such > cases. > """ > > Dag: Interesting. The first one is definitely the behaviour we want, as long that binds to that parallel with block. However, the outermost parallel with block declares sum (and i and j) private, so after that block all those variables become undefined. However, in the outermost parallel with block, sum will have to be initialized to 0 before anything else, or be declared firstprivate, otherwise 'sum' is undefined to begin with. Do you think declaring it firstprivate would be the way to go, or should we make it private and issue a warning or perhaps even an error? > DS >_____________________________________________
cython-devel mailing list > cython-devel@python.org > http://mail.python.org/mailman/listinfo/cython-devel >_____________________________________________ cython-devel mailing list cython-devel@python.org http://mail.python.org/mailman/listinfo/cython-devel
On 18 April 2011 16:01, Dag Sverre Seljebotn <d.s.seljebotn@astro.uio.no> wrote:
(apologies for top post)
No problem, it means I have to scroll less :)
This all seems to scream 'disallow' to me, in particular since some openmp implementations may not support it etc.
At any rate I feel 'parallel/parallel/prange/prange' is going to far; so next step could be to only allowing 'parallel/prange/parallel/prange'.
But really, my feeling is that if you really do need this then you can always write a seperate function for the inner loop (I honestly can't think of a usecase anyway...). So I'd really drop it; at least until the rest of the gsoc project is completed :)
Ok, sure, I'll disallow it. Then the user won't be able to make mistakes and I don't have to detect the case and issue a warning for inner reductions or lastprivates :).
DS -- Sent from my Android phone with K-9 Mail. Please excuse my brevity.
mark florisson <markflorisson88@gmail.com> wrote:
then ‘sum’ is a reduction for the inner parallel loop, but not for the outer > one. So the user would currently have to rewrite this to > > for i in prange(m): > for j in prange(n): > sum += i * j > sum += 0 > which seems a bit silly . Of course, we could just disable nested >
no special treatment. > > - Recurse to children pranges. Right, that is most natural. Algorithmically, reductions and lastprivates (as those can have races if placed in inner parallel constructs) propagate outwards towards the outermost parallel block, or up to the first parallel with block, or up to the first construct that already determined the sharing attribute. e.g. with parallel: with parallel: for i in prange(n): for j in
On 16 April 2011 18:42, Dag Sverre Seljebotn <d.s.seljebotn@astro.uio.no> wrote: > (Moving discussion from http://markflorisson.wordpress.com/, where Mark > said:) Ok, sure, it was just an issue I was wondering about at that moment, but it's a tricky issue, so thanks. > """ > Started a new branch https://github.com/markflorisson88/cython/tree/openmp . > > Now the question is whether sharing attributes should be propagated > outwards. e.g. if you do > > for i in prange(m): > for j in prange(n): > sum += i * j > parallelism, or tell the users to use a prange and a ‘for from’ in such > cases. > """ > > Dag: Interesting. The first one is definitely the behaviour we want, as long > as it doesn't cause unintended consequences. > > I don't really think it will -- the important thing is that that the order > of loop iteration evaluation must be unimportant. And that is still true > (for the outer loop, as well as for the inner) in your first example. > > Question: When you have nested pranges, what will happen is that two nested > OpenMP parallel blocks are used, right? And do you know if there is complete > freedom/"reentrancy" in that variables that are thread-private in an outer > parallel block and be shared in an inner one, and vice versa? An implementation may or may not support it, and if it is supported the behaviour can be configured through omp_set_nested(). So we should consider the case where it is supported and enabled. If you have a lastprivate or reduction, and after the loop these are (reduced and) assigned to the original variable. So if that happens inside a parallel construct which does not declare the variable private to the construct, you actually have a race. So e.g. the nested prange currently races in the outer parallel range. > If so I'd think that this algorithm should work and feel natural: > > - In each prange, for the purposes of variable private/shared/reduction > inference, consider all internal "prange" just as if they had been "range"; prange(n): sum += i * j # sum is well-defined here # sum is undefined here Here 'sum' is a reduction for the two innermost loops. 'sum' is not private for the inner parallel with block, as a prange in a parallel with block is a worksharing loop that binds to that parallel with block. However, the outermost parallel with block declares sum (and i and j) private, so after that block all those variables become undefined. However, in the outermost parallel with block, sum will have to be initialized to 0 before anything else, or be declared firstprivate, otherwise 'sum' is undefined to begin with. Do you think declaring it firstprivate would be the way to go, or should we make it private and issue a warning or perhaps even an error? > DS
________________________________
cython-devel mailing list > cython-devel@python.org > http://mail.python.org/mailman/listinfo/cython-devel >
cython-devel mailing list cython-devel@python.org http://mail.python.org/mailman/listinfo/cython-devel
_______________________________________________ cython-devel mailing list cython-devel@python.org http://mail.python.org/mailman/listinfo/cython-devel
participants (3)
-
Dag Sverre Seljebotn -
mark florisson -
Robert Bradshaw