hi, Temporary arrays generated in expressions are expensive as the imply extra memory bandwidth which is the bottleneck in most numpy operations. For example: r = a + b + c creates the b + c temporary and then adds a to it. This can be rewritten to be more efficient using inplace operations: r = b + c r += a This saves some memory bandwidth and can speedup the operation by 50% for very large arrays or even more if the inplace operation allows it to be completed completely in the cpu cache. The problem is that inplace operations are a lot less readable so they are often only used in well optimized code. But due to pythons refcounting semantics we can actually do some inplace conversions transparently. If an operand in python has a reference count of one it must be a temporary so we can use it as the destination array. CPython itself does this optimization for string concatenations. In numpy we have the issue that we can be called from the C-API directly where the reference count may be one for other reasons. To solve this we can check the backtrace until the python frame evaluation function. If there are only numpy and python functions in between that and our entry point we should be able to elide the temporary. This PR implements this: https://github.com/numpy/numpy/pull/7997 It currently only supports Linux with glibc (which has reliable backtraces via unwinding) and maybe MacOS depending on how good their backtrace is. On windows the backtrace APIs are different and I don't know them but in theory it could also be done there. A problem is that checking the backtrace is quite expensive, so should only be enabled when the involved arrays are large enough for it to be worthwhile. In my testing this seems to be around 180-300KiB sized arrays, basically where they start spilling out of the CPU L2 cache. I made a little crappy benchmark script to test this cutoff in this branch: https://github.com/juliantaylor/numpy/tree/elide-bench If you are interested you can run it with: python setup.py build_ext -j 4 --inplace ipython --profile=null check.ipy At the end it will plot the ratio between elided and non-elided runtime. It should get larger than one around 180KiB on most cpus. If no one points out some flaw in the approach, I'm hoping to get this into the next numpy version. cheers, Julian
On Fri, Sep 30, 2016 at 9:38 AM, Julian Taylor <jtaylor.debian@googlemail.com> wrote:
hi, Temporary arrays generated in expressions are expensive as the imply extra memory bandwidth which is the bottleneck in most numpy operations. For example:
r = a + b + c
creates the b + c temporary and then adds a to it. This can be rewritten to be more efficient using inplace operations:
r = b + c r += a
general question (I wouldn't understand the details even if I looked.) how is this affected by broadcasting and type promotion? Some of the main reasons that I don't like to use inplace operation in general is that I'm often not sure when type promotion occurs and when arrays expand during broadcasting. for example b + c is 1-D, a is 2-D, and r has the broadcasted shape. another case when I switch away from broadcasting is when b + c is int or bool and a is float. Thankfully, we get error messages for casting now.
This saves some memory bandwidth and can speedup the operation by 50% for very large arrays or even more if the inplace operation allows it to be completed completely in the cpu cache.
I didn't realize the difference can be so large. That would make streamlining some code worth the effort. Josef
The problem is that inplace operations are a lot less readable so they are often only used in well optimized code. But due to pythons refcounting semantics we can actually do some inplace conversions transparently. If an operand in python has a reference count of one it must be a temporary so we can use it as the destination array. CPython itself does this optimization for string concatenations.
In numpy we have the issue that we can be called from the C-API directly where the reference count may be one for other reasons. To solve this we can check the backtrace until the python frame evaluation function. If there are only numpy and python functions in between that and our entry point we should be able to elide the temporary.
This PR implements this: https://github.com/numpy/numpy/pull/7997
It currently only supports Linux with glibc (which has reliable backtraces via unwinding) and maybe MacOS depending on how good their backtrace is. On windows the backtrace APIs are different and I don't know them but in theory it could also be done there.
A problem is that checking the backtrace is quite expensive, so should only be enabled when the involved arrays are large enough for it to be worthwhile. In my testing this seems to be around 180-300KiB sized arrays, basically where they start spilling out of the CPU L2 cache.
I made a little crappy benchmark script to test this cutoff in this branch: https://github.com/juliantaylor/numpy/tree/elide-bench
If you are interested you can run it with: python setup.py build_ext -j 4 --inplace ipython --profile=null check.ipy
At the end it will plot the ratio between elided and non-elided runtime. It should get larger than one around 180KiB on most cpus.
If no one points out some flaw in the approach, I'm hoping to get this into the next numpy version.
cheers, Julian
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpy-discussion
On 30.09.2016 23:09, josef.pktd@gmail.com wrote:
On Fri, Sep 30, 2016 at 9:38 AM, Julian Taylor <jtaylor.debian@googlemail.com> wrote:
hi, Temporary arrays generated in expressions are expensive as the imply extra memory bandwidth which is the bottleneck in most numpy operations. For example:
r = a + b + c
creates the b + c temporary and then adds a to it. This can be rewritten to be more efficient using inplace operations:
r = b + c r += a
general question (I wouldn't understand the details even if I looked.)
how is this affected by broadcasting and type promotion?
Some of the main reasons that I don't like to use inplace operation in general is that I'm often not sure when type promotion occurs and when arrays expand during broadcasting.
for example b + c is 1-D, a is 2-D, and r has the broadcasted shape. another case when I switch away from broadcasting is when b + c is int or bool and a is float. Thankfully, we get error messages for casting now.
the temporary is only avoided when the casting follows the safe rule, so it should be the same as what you get without inplace operations. E.g. float32-temporary + float64 will not be converted to the unsafe float32 += float64 which a normal inplace operations would allow. But float64-temp + float32 is transformed. Currently the only broadcasting that will be transformed is temporary + scalar value, otherwise it will only work on matching array sizes. Though there is not really anything that prevents full broadcasting but its not implemented yet in the PR.
This saves some memory bandwidth and can speedup the operation by 50% for very large arrays or even more if the inplace operation allows it to be completed completely in the cpu cache.
I didn't realize the difference can be so large. That would make streamlining some code worth the effort.
Josef
The problem is that inplace operations are a lot less readable so they are often only used in well optimized code. But due to pythons refcounting semantics we can actually do some inplace conversions transparently. If an operand in python has a reference count of one it must be a temporary so we can use it as the destination array. CPython itself does this optimization for string concatenations.
In numpy we have the issue that we can be called from the C-API directly where the reference count may be one for other reasons. To solve this we can check the backtrace until the python frame evaluation function. If there are only numpy and python functions in between that and our entry point we should be able to elide the temporary.
This PR implements this: https://github.com/numpy/numpy/pull/7997
It currently only supports Linux with glibc (which has reliable backtraces via unwinding) and maybe MacOS depending on how good their backtrace is. On windows the backtrace APIs are different and I don't know them but in theory it could also be done there.
A problem is that checking the backtrace is quite expensive, so should only be enabled when the involved arrays are large enough for it to be worthwhile. In my testing this seems to be around 180-300KiB sized arrays, basically where they start spilling out of the CPU L2 cache.
I made a little crappy benchmark script to test this cutoff in this branch: https://github.com/juliantaylor/numpy/tree/elide-bench
If you are interested you can run it with: python setup.py build_ext -j 4 --inplace ipython --profile=null check.ipy
At the end it will plot the ratio between elided and non-elided runtime. It should get larger than one around 180KiB on most cpus.
If no one points out some flaw in the approach, I'm hoping to get this into the next numpy version.
cheers, Julian
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpy-discussion
Julian, This is really, really cool! I have been wanting something like this for years (over a decade? wow!), but always thought it would require hacking the interpreter to intercept operations. This is a really inspired idea, and could buy numpy a lot of performance. I'm afraid I can't say much about the implementation details -- but great work! -Chris On Fri, Sep 30, 2016 at 2:50 PM, Julian Taylor < jtaylor.debian@googlemail.com> wrote:
On 30.09.2016 23:09, josef.pktd@gmail.com wrote:
On Fri, Sep 30, 2016 at 9:38 AM, Julian Taylor <jtaylor.debian@googlemail.com> wrote:
hi, Temporary arrays generated in expressions are expensive as the imply extra memory bandwidth which is the bottleneck in most numpy operations. For example:
r = a + b + c
creates the b + c temporary and then adds a to it. This can be rewritten to be more efficient using inplace operations:
r = b + c r += a
general question (I wouldn't understand the details even if I looked.)
how is this affected by broadcasting and type promotion?
Some of the main reasons that I don't like to use inplace operation in general is that I'm often not sure when type promotion occurs and when arrays expand during broadcasting.
for example b + c is 1-D, a is 2-D, and r has the broadcasted shape. another case when I switch away from broadcasting is when b + c is int or bool and a is float. Thankfully, we get error messages for casting now.
the temporary is only avoided when the casting follows the safe rule, so it should be the same as what you get without inplace operations. E.g. float32-temporary + float64 will not be converted to the unsafe float32 += float64 which a normal inplace operations would allow. But float64-temp + float32 is transformed.
Currently the only broadcasting that will be transformed is temporary + scalar value, otherwise it will only work on matching array sizes. Though there is not really anything that prevents full broadcasting but its not implemented yet in the PR.
This saves some memory bandwidth and can speedup the operation by 50% for very large arrays or even more if the inplace operation allows it to be completed completely in the cpu cache.
I didn't realize the difference can be so large. That would make streamlining some code worth the effort.
Josef
The problem is that inplace operations are a lot less readable so they are often only used in well optimized code. But due to pythons refcounting semantics we can actually do some inplace conversions transparently. If an operand in python has a reference count of one it must be a temporary so we can use it as the destination array. CPython itself does this optimization for string concatenations.
In numpy we have the issue that we can be called from the C-API directly where the reference count may be one for other reasons. To solve this we can check the backtrace until the python frame evaluation function. If there are only numpy and python functions in between that and our entry point we should be able to elide the
temporary.
This PR implements this: https://github.com/numpy/numpy/pull/7997
It currently only supports Linux with glibc (which has reliable backtraces via unwinding) and maybe MacOS depending on how good their backtrace is. On windows the backtrace APIs are different and I don't know them but in theory it could also be done there.
A problem is that checking the backtrace is quite expensive, so should only be enabled when the involved arrays are large enough for it to be worthwhile. In my testing this seems to be around 180-300KiB sized arrays, basically where they start spilling out of the CPU L2 cache.
I made a little crappy benchmark script to test this cutoff in this
branch:
https://github.com/juliantaylor/numpy/tree/elide-bench
If you are interested you can run it with: python setup.py build_ext -j 4 --inplace ipython --profile=null check.ipy
At the end it will plot the ratio between elided and non-elided runtime. It should get larger than one around 180KiB on most cpus.
If no one points out some flaw in the approach, I'm hoping to get this into the next numpy version.
cheers, Julian
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpy-discussion
-- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov
Just thinking aloud, an idea I had recently takes a different approach. The problem with temporaries isn't so much that they exists, but rather they they keep on malloc'ed and cleared. What if numpy kept a small LRU cache of weakref'ed temporaries? Whenever a new numpy array is requested, numpy could see if there is already one in its cache of matching size and use it. If you think about it, expressions that result in many temporaries would quite likely have many of them being the same size in memory. Don't know how feasible it would be to implement though. Cheers! Ben Root On Sat, Oct 1, 2016 at 2:38 PM, Chris Barker <chris.barker@noaa.gov> wrote:
Julian,
This is really, really cool!
I have been wanting something like this for years (over a decade? wow!), but always thought it would require hacking the interpreter to intercept operations. This is a really inspired idea, and could buy numpy a lot of performance.
I'm afraid I can't say much about the implementation details -- but great work!
-Chris
On Fri, Sep 30, 2016 at 2:50 PM, Julian Taylor < jtaylor.debian@googlemail.com> wrote:
On 30.09.2016 23:09, josef.pktd@gmail.com wrote:
On Fri, Sep 30, 2016 at 9:38 AM, Julian Taylor <jtaylor.debian@googlemail.com> wrote:
hi, Temporary arrays generated in expressions are expensive as the imply extra memory bandwidth which is the bottleneck in most numpy operations. For example:
r = a + b + c
creates the b + c temporary and then adds a to it. This can be rewritten to be more efficient using inplace operations:
r = b + c r += a
general question (I wouldn't understand the details even if I looked.)
how is this affected by broadcasting and type promotion?
Some of the main reasons that I don't like to use inplace operation in general is that I'm often not sure when type promotion occurs and when arrays expand during broadcasting.
for example b + c is 1-D, a is 2-D, and r has the broadcasted shape. another case when I switch away from broadcasting is when b + c is int or bool and a is float. Thankfully, we get error messages for casting now.
the temporary is only avoided when the casting follows the safe rule, so it should be the same as what you get without inplace operations. E.g. float32-temporary + float64 will not be converted to the unsafe float32 += float64 which a normal inplace operations would allow. But float64-temp + float32 is transformed.
Currently the only broadcasting that will be transformed is temporary + scalar value, otherwise it will only work on matching array sizes. Though there is not really anything that prevents full broadcasting but its not implemented yet in the PR.
This saves some memory bandwidth and can speedup the operation by 50% for very large arrays or even more if the inplace operation allows it
be completed completely in the cpu cache.
I didn't realize the difference can be so large. That would make streamlining some code worth the effort.
Josef
The problem is that inplace operations are a lot less readable so they are often only used in well optimized code. But due to pythons refcounting semantics we can actually do some inplace conversions transparently. If an operand in python has a reference count of one it must be a temporary so we can use it as the destination array. CPython itself
does
this optimization for string concatenations.
In numpy we have the issue that we can be called from the C-API
to directly
where the reference count may be one for other reasons. To solve this we can check the backtrace until the python frame evaluation function. If there are only numpy and python functions in between that and our entry point we should be able to elide the temporary.
This PR implements this: https://github.com/numpy/numpy/pull/7997
It currently only supports Linux with glibc (which has reliable backtraces via unwinding) and maybe MacOS depending on how good their backtrace is. On windows the backtrace APIs are different and I don't know them but in theory it could also be done there.
A problem is that checking the backtrace is quite expensive, so should only be enabled when the involved arrays are large enough for it to be worthwhile. In my testing this seems to be around 180-300KiB sized arrays, basically where they start spilling out of the CPU L2 cache.
I made a little crappy benchmark script to test this cutoff in this branch: https://github.com/juliantaylor/numpy/tree/elide-bench
If you are interested you can run it with: python setup.py build_ext -j 4 --inplace ipython --profile=null check.ipy
At the end it will plot the ratio between elided and non-elided runtime. It should get larger than one around 180KiB on most cpus.
If no one points out some flaw in the approach, I'm hoping to get this into the next numpy version.
cheers, Julian
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpy-discussion
--
Christopher Barker, Ph.D. Oceanographer
Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception
Chris.Barker@noaa.gov
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpy-discussion
the problem with this approach is that we don't really want numpy hogging on to hundreds of megabytes of memory by default so it would need to be a user option. A context manager could work too but it would probably lead to premature optimization. Very new Linux versions (4.6+) now finally support MADV_FREE which gives memory back to the system but does not require refaulting it if nothing else needed it. So this might be an option now. But libc implementations will probably use that at some point too and then numpy doesn't need to do this. On 02.10.2016 15:10, Benjamin Root wrote:
Just thinking aloud, an idea I had recently takes a different approach. The problem with temporaries isn't so much that they exists, but rather they they keep on malloc'ed and cleared. What if numpy kept a small LRU cache of weakref'ed temporaries? Whenever a new numpy array is requested, numpy could see if there is already one in its cache of matching size and use it. If you think about it, expressions that result in many temporaries would quite likely have many of them being the same size in memory.
Don't know how feasible it would be to implement though.
Cheers! Ben Root
On Sat, Oct 1, 2016 at 2:38 PM, Chris Barker <chris.barker@noaa.gov <mailto:chris.barker@noaa.gov>> wrote:
Julian,
This is really, really cool!
I have been wanting something like this for years (over a decade? wow!), but always thought it would require hacking the interpreter to intercept operations. This is a really inspired idea, and could buy numpy a lot of performance.
I'm afraid I can't say much about the implementation details -- but great work!
-Chris
On Fri, Sep 30, 2016 at 2:50 PM, Julian Taylor <jtaylor.debian@googlemail.com <mailto:jtaylor.debian@googlemail.com>> wrote:
On 30.09.2016 23:09, josef.pktd@gmail.com <mailto:josef.pktd@gmail.com> wrote: > On Fri, Sep 30, 2016 at 9:38 AM, Julian Taylor > <jtaylor.debian@googlemail.com <mailto:jtaylor.debian@googlemail.com>> wrote: >> hi, >> Temporary arrays generated in expressions are expensive as the imply >> extra memory bandwidth which is the bottleneck in most numpy operations. >> For example: >> >> r = a + b + c >> >> creates the b + c temporary and then adds a to it. >> This can be rewritten to be more efficient using inplace operations: >> >> r = b + c >> r += a > > general question (I wouldn't understand the details even if I looked.) > > how is this affected by broadcasting and type promotion? > > Some of the main reasons that I don't like to use inplace operation in > general is that I'm often not sure when type promotion occurs and when > arrays expand during broadcasting. > > for example b + c is 1-D, a is 2-D, and r has the broadcasted shape. > another case when I switch away from broadcasting is when b + c is int > or bool and a is float. Thankfully, we get error messages for casting > now.
the temporary is only avoided when the casting follows the safe rule, so it should be the same as what you get without inplace operations. E.g. float32-temporary + float64 will not be converted to the unsafe float32 += float64 which a normal inplace operations would allow. But float64-temp + float32 is transformed.
Currently the only broadcasting that will be transformed is temporary + scalar value, otherwise it will only work on matching array sizes. Though there is not really anything that prevents full broadcasting but its not implemented yet in the PR.
> >> >> This saves some memory bandwidth and can speedup the operation by 50% >> for very large arrays or even more if the inplace operation allows it to >> be completed completely in the cpu cache. > > I didn't realize the difference can be so large. That would make > streamlining some code worth the effort. > > Josef > > >> >> The problem is that inplace operations are a lot less readable so they >> are often only used in well optimized code. But due to pythons >> refcounting semantics we can actually do some inplace conversions >> transparently. >> If an operand in python has a reference count of one it must be a >> temporary so we can use it as the destination array. CPython itself does >> this optimization for string concatenations. >> >> In numpy we have the issue that we can be called from the C-API directly >> where the reference count may be one for other reasons. >> To solve this we can check the backtrace until the python frame >> evaluation function. If there are only numpy and python functions in >> between that and our entry point we should be able to elide the temporary. >> >> This PR implements this: >> https://github.com/numpy/numpy/pull/7997 <https://github.com/numpy/numpy/pull/7997> >> >> It currently only supports Linux with glibc (which has reliable >> backtraces via unwinding) and maybe MacOS depending on how good their >> backtrace is. On windows the backtrace APIs are different and I don't >> know them but in theory it could also be done there. >> >> A problem is that checking the backtrace is quite expensive, so should >> only be enabled when the involved arrays are large enough for it to be >> worthwhile. In my testing this seems to be around 180-300KiB sized >> arrays, basically where they start spilling out of the CPU L2 cache. >> >> I made a little crappy benchmark script to test this cutoff in this branch: >> https://github.com/juliantaylor/numpy/tree/elide-bench <https://github.com/juliantaylor/numpy/tree/elide-bench> >> >> If you are interested you can run it with: >> python setup.py build_ext -j 4 --inplace >> ipython --profile=null check.ipy >> >> At the end it will plot the ratio between elided and non-elided runtime. >> It should get larger than one around 180KiB on most cpus. >> >> If no one points out some flaw in the approach, I'm hoping to get this >> into the next numpy version. >> >> cheers, >> Julian >> >> >> _______________________________________________ >> NumPy-Discussion mailing list >> NumPy-Discussion@scipy.org <mailto:NumPy-Discussion@scipy.org> >> https://mail.scipy.org/mailman/listinfo/numpy-discussion <https://mail.scipy.org/mailman/listinfo/numpy-discussion> >> > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion@scipy.org <mailto:NumPy-Discussion@scipy.org> > https://mail.scipy.org/mailman/listinfo/numpy-discussion <https://mail.scipy.org/mailman/listinfo/numpy-discussion> >
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org <mailto:NumPy-Discussion@scipy.org> https://mail.scipy.org/mailman/listinfo/numpy-discussion <https://mail.scipy.org/mailman/listinfo/numpy-discussion>
--
Christopher Barker, Ph.D. Oceanographer
Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 <tel:%28206%29%20526-6959> voice 7600 Sand Point Way NE (206) 526-6329 <tel:%28206%29%20526-6329> fax Seattle, WA 98115 (206) 526-6317 <tel:%28206%29%20526-6317> main reception
Chris.Barker@noaa.gov <mailto:Chris.Barker@noaa.gov>
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org <mailto:NumPy-Discussion@scipy.org> https://mail.scipy.org/mailman/listinfo/numpy-discussion <https://mail.scipy.org/mailman/listinfo/numpy-discussion>
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpy-discussion
On Mon, Oct 3, 2016 at 3:16 AM, Julian Taylor <jtaylor.debian@googlemail.com
wrote:
the problem with this approach is that we don't really want numpy hogging on to hundreds of megabytes of memory by default so it would need to be a user option.
indeed -- but one could set an LRU cache to be very small (few items, not small memory), and then it get used within expressions, but not hold on to much outside of expressions. However, is the allocation the only (Or even biggest) source of the performance hit? If you generate a temporary as a result of an operation, rather than doing it in-place, that temporary needs to be allocated, but it also means that an additional array needs to be pushed through the processor -- and that can make a big performance difference too. I"m not entirely sure how to profile this correctly, but this seems to indicate that the allocation is cheap compared to the operations (for a million--element array) * Regular old temporary creation In [24]: def f1(arr1, arr2): ...: result = arr1 + arr2 ...: return result In [26]: %timeit f1(arr1, arr2) 1000 loops, best of 3: 1.13 ms per loop * Completely in-place, no allocation of an extra array In [27]: def f2(arr1, arr2): ...: arr1 += arr2 ...: return arr1 In [28]: %timeit f2(arr1, arr2) 1000 loops, best of 3: 755 µs per loop So that's about 30% faster * allocate a temporary that isn't used -- but should catch the creation cost In [29]: def f3(arr1, arr2): ...: result = np.empty_like(arr1) ...: arr1 += arr2 ...: return arr1 In [30]: % timeit f3(arr1, arr2) 1000 loops, best of 3: 756 µs per loop only a µs slower! Profiling is hard, and I'm not good at it, but this seems to indicate that the allocation is cheap. -CHB -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov
On 03.10.2016 20:23, Chris Barker wrote:
On Mon, Oct 3, 2016 at 3:16 AM, Julian Taylor <jtaylor.debian@googlemail.com <mailto:jtaylor.debian@googlemail.com>> wrote:
the problem with this approach is that we don't really want numpy hogging on to hundreds of megabytes of memory by default so it would need to be a user option.
indeed -- but one could set an LRU cache to be very small (few items, not small memory), and then it get used within expressions, but not hold on to much outside of expressions.
numpy doesn't see the whole expression so we can't really do much. (technically we could in 3.5 by using pep 523, but that would be a larger undertaking)
However, is the allocation the only (Or even biggest) source of the performance hit?
on large arrays the allocation is insignificant. What does cost some time is faulting the memory into the process which implies writing zeros into the pages (a page at a time as it is being used). By storing memory blocks in numpy we would save this portion. This is really the job of the libc, but these are usually tuned for general purpose workloads and thus tend to give back memory back to the system much earlier than numerical workloads would like. Note that numpy already has a small memory block cache but its only used for very small arrays where the allocation cost itself is significant, it is limited to a couple megabytes at most.
With regards to arguments about holding onto large arrays, I would like to emphasize that my original suggestion mentioned weakref'ed numpy arrays. Essentially, the idea is to claw back only the raw memory blocks during that limbo period between discarding the numpy array python object and when python garbage-collects it. Ben Root On Mon, Oct 3, 2016 at 2:43 PM, Julian Taylor <jtaylor.debian@googlemail.com
wrote:
On 03.10.2016 20:23, Chris Barker wrote:
On Mon, Oct 3, 2016 at 3:16 AM, Julian Taylor <jtaylor.debian@googlemail.com <mailto:jtaylor.debian@googlemail.com>> wrote:
the problem with this approach is that we don't really want numpy hogging on to hundreds of megabytes of memory by default so it would need to be a user option.
indeed -- but one could set an LRU cache to be very small (few items, not small memory), and then it get used within expressions, but not hold on to much outside of expressions.
numpy doesn't see the whole expression so we can't really do much. (technically we could in 3.5 by using pep 523, but that would be a larger undertaking)
However, is the allocation the only (Or even biggest) source of the performance hit?
on large arrays the allocation is insignificant. What does cost some time is faulting the memory into the process which implies writing zeros into the pages (a page at a time as it is being used). By storing memory blocks in numpy we would save this portion. This is really the job of the libc, but these are usually tuned for general purpose workloads and thus tend to give back memory back to the system much earlier than numerical workloads would like.
Note that numpy already has a small memory block cache but its only used for very small arrays where the allocation cost itself is significant, it is limited to a couple megabytes at most. _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpy-discussion
Mon, 03 Oct 2016 15:07:28 -0400, Benjamin Root kirjoitti:
With regards to arguments about holding onto large arrays, I would like to emphasize that my original suggestion mentioned weakref'ed numpy arrays. Essentially, the idea is to claw back only the raw memory blocks during that limbo period between discarding the numpy array python object and when python garbage-collects it.
CPython afaik deallocates immediately when the refcount hits zero. It's relatively rare that you have arrays hanging around waiting for cycle breaking by gc. If you have them hanging around, I don't think it's possible to distinguish these from other arrays without running the gc. Note also that an "is an array in use" check probably always requires Julian's stack based hack since you cannot rely on the refcount. Pauli
Note that numpy does store some larger arrays already, in the fft module. (In fact, this was a cache of unlimited size until #7686.) It might not be bad if the same cache were used more generally. That said, if newer versions of python are offering ways of doing this better, maybe that is the best way forward. -- Marten
Good discussion, but was surprised by the absence of numexpr in the discussion., given how relevant it (numexpr) is to the topic. Is the goal to fold in the numexpr functionality (and beyond) into Numpy ? On Fri, Sep 30, 2016 at 7:08 PM, Julian Taylor < jtaylor.debian@googlemail.com> wrote:
hi, Temporary arrays generated in expressions are expensive as the imply extra memory bandwidth which is the bottleneck in most numpy operations. For example:
r = a + b + c
creates the b + c temporary and then adds a to it. This can be rewritten to be more efficient using inplace operations:
r = b + c r += a
This saves some memory bandwidth and can speedup the operation by 50% for very large arrays or even more if the inplace operation allows it to be completed completely in the cpu cache.
The problem is that inplace operations are a lot less readable so they are often only used in well optimized code. But due to pythons refcounting semantics we can actually do some inplace conversions transparently. If an operand in python has a reference count of one it must be a temporary so we can use it as the destination array. CPython itself does this optimization for string concatenations.
In numpy we have the issue that we can be called from the C-API directly where the reference count may be one for other reasons. To solve this we can check the backtrace until the python frame evaluation function. If there are only numpy and python functions in between that and our entry point we should be able to elide the temporary.
This PR implements this: https://github.com/numpy/numpy/pull/7997
It currently only supports Linux with glibc (which has reliable backtraces via unwinding) and maybe MacOS depending on how good their backtrace is. On windows the backtrace APIs are different and I don't know them but in theory it could also be done there.
A problem is that checking the backtrace is quite expensive, so should only be enabled when the involved arrays are large enough for it to be worthwhile. In my testing this seems to be around 180-300KiB sized arrays, basically where they start spilling out of the CPU L2 cache.
I made a little crappy benchmark script to test this cutoff in this branch: https://github.com/juliantaylor/numpy/tree/elide-bench
If you are interested you can run it with: python setup.py build_ext -j 4 --inplace ipython --profile=null check.ipy
At the end it will plot the ratio between elided and non-elided runtime. It should get larger than one around 180KiB on most cpus.
If no one points out some flaw in the approach, I'm hoping to get this into the next numpy version.
cheers, Julian
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpy-discussion
2016-10-05 8:45 GMT+02:00 srean <srean.list@gmail.com>:
Good discussion, but was surprised by the absence of numexpr in the discussion., given how relevant it (numexpr) is to the topic.
Is the goal to fold in the numexpr functionality (and beyond) into Numpy ?
Yes, the question about merging numexpr into numpy has been something that periodically shows up in this list. I think mostly everyone agree that it is a good idea, but things are not so easy, and so far nobody provided a good patch for this. Also, the fact that numexpr relies on grouping an expression by using a string (e.g. (y = ne.evaluate("x**3 + tanh(x**2) + 4")) does not play well with the way in that numpy evaluates expressions, so something should be suggested to cope with this too.
On Fri, Sep 30, 2016 at 7:08 PM, Julian Taylor < jtaylor.debian@googlemail.com> wrote:
hi, Temporary arrays generated in expressions are expensive as the imply extra memory bandwidth which is the bottleneck in most numpy operations. For example:
r = a + b + c
creates the b + c temporary and then adds a to it. This can be rewritten to be more efficient using inplace operations:
r = b + c r += a
This saves some memory bandwidth and can speedup the operation by 50% for very large arrays or even more if the inplace operation allows it to be completed completely in the cpu cache.
The problem is that inplace operations are a lot less readable so they are often only used in well optimized code. But due to pythons refcounting semantics we can actually do some inplace conversions transparently. If an operand in python has a reference count of one it must be a temporary so we can use it as the destination array. CPython itself does this optimization for string concatenations.
In numpy we have the issue that we can be called from the C-API directly where the reference count may be one for other reasons. To solve this we can check the backtrace until the python frame evaluation function. If there are only numpy and python functions in between that and our entry point we should be able to elide the temporary.
This PR implements this: https://github.com/numpy/numpy/pull/7997
It currently only supports Linux with glibc (which has reliable backtraces via unwinding) and maybe MacOS depending on how good their backtrace is. On windows the backtrace APIs are different and I don't know them but in theory it could also be done there.
A problem is that checking the backtrace is quite expensive, so should only be enabled when the involved arrays are large enough for it to be worthwhile. In my testing this seems to be around 180-300KiB sized arrays, basically where they start spilling out of the CPU L2 cache.
I made a little crappy benchmark script to test this cutoff in this branch: https://github.com/juliantaylor/numpy/tree/elide-bench
If you are interested you can run it with: python setup.py build_ext -j 4 --inplace ipython --profile=null check.ipy
At the end it will plot the ratio between elided and non-elided runtime. It should get larger than one around 180KiB on most cpus.
If no one points out some flaw in the approach, I'm hoping to get this into the next numpy version.
cheers, Julian
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpy-discussion
-- Francesc Alted
All, On Wed, Oct 5, 2016 at 11:46 AM, Francesc Alted <faltet@gmail.com> wrote:
2016-10-05 8:45 GMT+02:00 srean <srean.list@gmail.com>:
Good discussion, but was surprised by the absence of numexpr in the discussion., given how relevant it (numexpr) is to the topic.
Is the goal to fold in the numexpr functionality (and beyond) into Numpy ?
Yes, the question about merging numexpr into numpy has been something that periodically shows up in this list. I think mostly everyone agree that it is a good idea, but things are not so easy, and so far nobody provided a good patch for this. Also, the fact that numexpr relies on grouping an expression by using a string (e.g. (y = ne.evaluate("x**3 + tanh(x**2) + 4")) does not play well with the way in that numpy evaluates expressions, so something should be suggested to cope with this too.
As Francesc said, Numexpr is going to get most of its power through grouping a series of operations so it can send blocks to the CPU cache and run the entire series of operations on the cache before returning the block to system memory. If it was just used to back-end NumPy, it would only gain from the multi-threading portion inside each function call. I'm not sure how one would go about grouping successive numpy expressions without modifying the Python interpreter? I put a bit of effort into extending numexpr to use 4-byte word opcodes instead of 1-byte. Progress has been very slow, however, due to time constraints, but I have most of the numpy data types (u[1-4], i[1-4], f[4,8], c[8,16], S[1-4], U[1-4]). On Tuesday I finished writing a Python generator script that writes all the C-side opcode macros for opcodes.hpp. Now I have about 900 opcodes, and this could easily grow into thousands if more functions are added, so I also built a reverse lookup tree (based on collections.defaultdict) for the Python-side of numexpr. Robert -- Robert McLeod, Ph.D. Center for Cellular Imaging and Nano Analytics (C-CINA) Biozentrum der Universität Basel Mattenstrasse 26, 4058 Basel Work: +41.061.387.3225 robert.mcleod@unibas.ch robert.mcleod@bsse.ethz.ch <robert.mcleod@ethz.ch> robbmcleod@gmail.com
Thanks Francesc, Robert for giving me a broader picture of where this fits in. I believe numexpr does not handle slicing, so that might be another thing to look at. On Wed, Oct 5, 2016 at 4:26 PM, Robert McLeod <robbmcleod@gmail.com> wrote:
As Francesc said, Numexpr is going to get most of its power through grouping a series of operations so it can send blocks to the CPU cache and run the entire series of operations on the cache before returning the block to system memory. If it was just used to back-end NumPy, it would only gain from the multi-threading portion inside each function call.
Is that so ? I thought numexpr also cuts down on number of temporary buffers that get filled (in other words copy operations) if the same expression was written as series of operations. My understanding can be wrong, and would appreciate correction. The 'out' parameter in ufuncs can eliminate extra temporaries but its not composable. Right now I have to manually carry along the array where the in place operations take place. I think the goal here is to eliminate that.
On Wed, Oct 5, 2016 at 1:11 PM, srean <srean.list@gmail.com> wrote:
Thanks Francesc, Robert for giving me a broader picture of where this fits in. I believe numexpr does not handle slicing, so that might be another thing to look at.
Dereferencing would be relatively simple to add into numexpr, as it would just be some getattr() calls. Personally I will add that at some point because it will clean up my code. Slicing, maybe only for continuous blocks in memory? I.e. imageStack[0,:,:] would be possible, but imageStack[:, ::2, ::2] would not be trivial (I think...). I seem to remember someone asked David Cooke about slicing and he said something along the lines of, "that's what Numba is for." Perhaps NumPy backended by Numba is more so what you are looking for, as it hooks into the byte compiler? The main advantage of numexpr is that a series of numpy functions in <expression> can be enclosed in ne.evaluate( "<expression>" ) and it provides a big acceleration for little programmer effort, but it's not nearly as sophisticated as Numba or PyPy.
On Wed, Oct 5, 2016 at 4:26 PM, Robert McLeod <robbmcleod@gmail.com> wrote:
As Francesc said, Numexpr is going to get most of its power through grouping a series of operations so it can send blocks to the CPU cache and run the entire series of operations on the cache before returning the block to system memory. If it was just used to back-end NumPy, it would only gain from the multi-threading portion inside each function call.
Is that so ?
I thought numexpr also cuts down on number of temporary buffers that get filled (in other words copy operations) if the same expression was written as series of operations. My understanding can be wrong, and would appreciate correction.
The 'out' parameter in ufuncs can eliminate extra temporaries but its not composable. Right now I have to manually carry along the array where the in place operations take place. I think the goal here is to eliminate that.
The numexpr virtual machine does create temporaries where needed when it parses the abstract syntax tree for all the operations it has to do. I believe the main advantage is that the temporaries are created on the CPU cache, and not in system memory. It's certainly true that numexpr doesn't create a lot of OP_COPY operations, rather it's optimized to minimize them, so probably it's fewer ops than naive successive calls to numpy within python, but I'm unsure if there's any difference in operation count between a hand-optimized numpy with out= set and numexpr. Numexpr just does it for you. This blog post from Tim Hochberg is useful for understanding the performance advantages of blocking versus multithreading: http://www.bitsofbits.com/2014/09/21/numpy-micro-optimization-and-numexpr/ Robert -- Robert McLeod, Ph.D. Center for Cellular Imaging and Nano Analytics (C-CINA) Biozentrum der Universität Basel Mattenstrasse 26, 4058 Basel Work: +41.061.387.3225 robert.mcleod@unibas.ch robert.mcleod@bsse.ethz.ch <robert.mcleod@ethz.ch> robbmcleod@gmail.com
On Wed, Oct 5, 2016 at 5:36 PM, Robert McLeod <robbmcleod@gmail.com> wrote:
It's certainly true that numexpr doesn't create a lot of OP_COPY operations, rather it's optimized to minimize them, so probably it's fewer ops than naive successive calls to numpy within python, but I'm unsure if there's any difference in operation count between a hand-optimized numpy with out= set and numexpr. Numexpr just does it for you.
That was my understanding as well. If it automatically does what one could achieve by carrying the state along in the 'out' parameter, that's as good as it can get in terms removing unnecessary ops. There are other speedup opportunities of course, but that's a separate matter.
This blog post from Tim Hochberg is useful for understanding the performance advantages of blocking versus multithreading:
http://www.bitsofbits.com/2014/09/21/numpy-micro-optimization-and-numexpr/
Hadnt come across that one before. Great link. Thanks. using caches and vector registers well trumps threading, unless one has a lot of data and it helps to disable hyper-threading.
hi, This PR has now been merged. As it has the potential to break stuff please test your applications, in particular ones that use cython and C-extensions against master. It will only do something when working on large arrays (> 256 KiB) on platforms providing the backtrace() function. So linux and mac but windows probably not (though windows could probably be added if someone volunteers to write the equivalent winapi code). If you compile from source and want to see for sure if it is doing anything you can set the NPY_ELIDE_DEBUG define in ./numpy/core/src/multiarray/temp_elide.c to 1 or 2 and it will print out when it is eliding something. cheers, Julian On 30.09.2016 15:38, Julian Taylor wrote:
hi, Temporary arrays generated in expressions are expensive as the imply extra memory bandwidth which is the bottleneck in most numpy operations. For example:
r = a + b + c
creates the b + c temporary and then adds a to it. This can be rewritten to be more efficient using inplace operations:
r = b + c r += a
This saves some memory bandwidth and can speedup the operation by 50% for very large arrays or even more if the inplace operation allows it to be completed completely in the cpu cache.
The problem is that inplace operations are a lot less readable so they are often only used in well optimized code. But due to pythons refcounting semantics we can actually do some inplace conversions transparently. If an operand in python has a reference count of one it must be a temporary so we can use it as the destination array. CPython itself does this optimization for string concatenations.
In numpy we have the issue that we can be called from the C-API directly where the reference count may be one for other reasons. To solve this we can check the backtrace until the python frame evaluation function. If there are only numpy and python functions in between that and our entry point we should be able to elide the temporary.
This PR implements this: https://github.com/numpy/numpy/pull/7997
It currently only supports Linux with glibc (which has reliable backtraces via unwinding) and maybe MacOS depending on how good their backtrace is. On windows the backtrace APIs are different and I don't know them but in theory it could also be done there.
A problem is that checking the backtrace is quite expensive, so should only be enabled when the involved arrays are large enough for it to be worthwhile. In my testing this seems to be around 180-300KiB sized arrays, basically where they start spilling out of the CPU L2 cache.
I made a little crappy benchmark script to test this cutoff in this branch: https://github.com/juliantaylor/numpy/tree/elide-bench
If you are interested you can run it with: python setup.py build_ext -j 4 --inplace ipython --profile=null check.ipy
At the end it will plot the ratio between elided and non-elided runtime. It should get larger than one around 180KiB on most cpus.
If no one points out some flaw in the approach, I'm hoping to get this into the next numpy version.
cheers, Julian
What's the timeline for the next release? I have the perfect usecase for this (Haversine calculation on large arrays that takes up ~33% of one of my processing scripts). However, to test it out, I have a huge dependency mess to wade through first, and there are no resources devoted to that project for at least a few weeks. I want to make sure I get feedback to y'all. Cheers! Ben Root On Sat, Feb 25, 2017 at 5:17 AM, Julian Taylor < jtaylor.debian@googlemail.com> wrote:
hi, This PR has now been merged. As it has the potential to break stuff please test your applications, in particular ones that use cython and C-extensions against master.
It will only do something when working on large arrays (> 256 KiB) on platforms providing the backtrace() function. So linux and mac but windows probably not (though windows could probably be added if someone volunteers to write the equivalent winapi code).
If you compile from source and want to see for sure if it is doing anything you can set the NPY_ELIDE_DEBUG define in ./numpy/core/src/multiarray/temp_elide.c to 1 or 2 and it will print out when it is eliding something.
cheers, Julian
On 30.09.2016 15:38, Julian Taylor wrote:
hi, Temporary arrays generated in expressions are expensive as the imply extra memory bandwidth which is the bottleneck in most numpy operations. For example:
r = a + b + c
creates the b + c temporary and then adds a to it. This can be rewritten to be more efficient using inplace operations:
r = b + c r += a
This saves some memory bandwidth and can speedup the operation by 50% for very large arrays or even more if the inplace operation allows it to be completed completely in the cpu cache.
The problem is that inplace operations are a lot less readable so they are often only used in well optimized code. But due to pythons refcounting semantics we can actually do some inplace conversions transparently. If an operand in python has a reference count of one it must be a temporary so we can use it as the destination array. CPython itself does this optimization for string concatenations.
In numpy we have the issue that we can be called from the C-API directly where the reference count may be one for other reasons. To solve this we can check the backtrace until the python frame evaluation function. If there are only numpy and python functions in between that and our entry point we should be able to elide the temporary.
This PR implements this: https://github.com/numpy/numpy/pull/7997
It currently only supports Linux with glibc (which has reliable backtraces via unwinding) and maybe MacOS depending on how good their backtrace is. On windows the backtrace APIs are different and I don't know them but in theory it could also be done there.
A problem is that checking the backtrace is quite expensive, so should only be enabled when the involved arrays are large enough for it to be worthwhile. In my testing this seems to be around 180-300KiB sized arrays, basically where they start spilling out of the CPU L2 cache.
I made a little crappy benchmark script to test this cutoff in this branch: https://github.com/juliantaylor/numpy/tree/elide-bench
If you are interested you can run it with: python setup.py build_ext -j 4 --inplace ipython --profile=null check.ipy
At the end it will plot the ratio between elided and non-elided runtime. It should get larger than one around 180KiB on most cpus.
If no one points out some flaw in the approach, I'm hoping to get this into the next numpy version.
cheers, Julian
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpy-discussion
On Mon, Feb 27, 2017 at 11:43 AM, Benjamin Root <ben.v.root@gmail.com> wrote:
What's the timeline for the next release? I have the perfect usecase for this (Haversine calculation on large arrays that takes up ~33% of one of my processing scripts). However, to test it out, I have a huge dependency mess to wade through first, and there are no resources devoted to that project for at least a few weeks. I want to make sure I get feedback to y'all.
I'd like to branch 1.13.x at the end of March. The planned features that still need to go in are the `__array_ufunc__` work and the `lapack_lite` update. The first RC should not take much longer. I believe Matthew is building wheels for testing on the fly but I don't know where you can find them. Chuck
Hi, On Mon, Feb 27, 2017 at 11:27 AM, Charles R Harris <charlesr.harris@gmail.com> wrote:
On Mon, Feb 27, 2017 at 11:43 AM, Benjamin Root <ben.v.root@gmail.com> wrote:
What's the timeline for the next release? I have the perfect usecase for this (Haversine calculation on large arrays that takes up ~33% of one of my processing scripts). However, to test it out, I have a huge dependency mess to wade through first, and there are no resources devoted to that project for at least a few weeks. I want to make sure I get feedback to y'all.
I'd like to branch 1.13.x at the end of March. The planned features that still need to go in are the `__array_ufunc__` work and the `lapack_lite` update. The first RC should not take much longer. I believe Matthew is building wheels for testing on the fly but I don't know where you can find them.
Latest wheels at : https://7933911d6844c6c53a7d-47bd50c35cd79bd838daf386af554a83.ssl.cf2.rackcd... PRE_URL=https://7933911d6844c6c53a7d-47bd50c35cd79bd838daf386af554a83.ssl.cf2.rackcd... pip install -f $PRE_URL --pre numpy Cheers, Matthew
Ah, that wheel would be a huge help. Most of the packages I have as dependencies for this project were compiled against v1.10, so I am hoping that there won't be too big of a problem. Are the mostlinux wheels still compatible with CentOS5, or has the base image been bumped to CentOS6? Cheers! Ben Root On Mon, Feb 27, 2017 at 2:31 PM, Matthew Brett <matthew.brett@gmail.com> wrote:
Hi,
On Mon, Feb 27, 2017 at 11:27 AM, Charles R Harris <charlesr.harris@gmail.com> wrote:
On Mon, Feb 27, 2017 at 11:43 AM, Benjamin Root <ben.v.root@gmail.com> wrote:
What's the timeline for the next release? I have the perfect usecase for this (Haversine calculation on large arrays that takes up ~33% of one
processing scripts). However, to test it out, I have a huge dependency mess to wade through first, and there are no resources devoted to that
of my project
for at least a few weeks. I want to make sure I get feedback to y'all.
I'd like to branch 1.13.x at the end of March. The planned features that still need to go in are the `__array_ufunc__` work and the `lapack_lite` update. The first RC should not take much longer. I believe Matthew is building wheels for testing on the fly but I don't know where you can find them.
Latest wheels at : https://7933911d6844c6c53a7d-47bd50c35cd79bd838daf386af554a 83.ssl.cf2.rackcdn.com/
PRE_URL=https://7933911d6844c6c53a7d-47bd50c35cd79bd838daf386af554a 83.ssl.cf2.rackcdn.com pip install -f $PRE_URL --pre numpy
Cheers,
Matthew _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpy-discussion
Hi, On Mon, Feb 27, 2017 at 11:46 AM, Benjamin Root <ben.v.root@gmail.com> wrote:
Ah, that wheel would be a huge help. Most of the packages I have as dependencies for this project were compiled against v1.10, so I am hoping that there won't be too big of a problem.
I don't think so - the problems generally arise when you try to import a package compiled against a more recent version of numpy than the one you have installed.
Are the mostlinux wheels still compatible with CentOS5, or has the base image been bumped to CentOS6?
Yup, still CentOS5. When it shifts to CentOS6, probably not very soon, you'll likely see a version number bump to manylinux2 (manylinux1 is the current version). Cheers, Matthew
participants (11)
-
Benjamin Root
-
Charles R Harris
-
Chris Barker
-
Francesc Alted
-
josef.pktd@gmail.com
-
Julian Taylor
-
Marten van Kerkwijk
-
Matthew Brett
-
Pauli Virtanen
-
Robert McLeod
-
srean