Hi Guys, I'm hoping that I could get some help and advice on c++ and Cython. I've written an OpenCL implementation for the prefix-sum algorithm which I use for generating a compacted lookup table for sparse binary array (called stream compacting) The algorithm isn't really important right now, but just to show what it does here's an example… <https://lh3.googleusercontent.com/-RgftPXLsAkg/UaPGuDqDcgI/AAAAAAAAADc/Qtoaoc74VRg/s1600/Screen+Shot+2013-05-27+at+10.46.56+PM.png> In the end, the result is a nice compact array of the indices that were flagged… I'm mainly using it to know which tiles to process for grow cut of graph cut on the gpu like this: <https://lh5.googleusercontent.com/-WBzEmrGga-M/UaPHDOPFzVI/AAAAAAAAADk/7vkyOcUI6lE/s1600/Screen+Shot+2013-05-27+at+10.47.02+PM.png> This operation has to happen a lot… so I really need it to be fast. The problem I'm having is that the when I isolate and measure the execution time of the gpu code it's much faster than that of the c++ or Cython wrapper - which I cannot really do without. So I'm kinda hoping someone can help me to really squash the additional execution time from the overhead of the wrapper. Originally I wrote a Python then Cython wrapper and when looking at the difference between the execution time of just the gnu code vs the total time, I thought it must be from the overhead of the Python/Cython. But I've just written a c++ wrapper and it's not a whole lot faster than Python/Cython, but I'm still hoping there's a lot that can be done… Here are two graphs that might help explain… The one below is measuring the execution time of just the gpu code in the 3 implementations. They should be exactly the same and they are more or less. <https://lh6.googleusercontent.com/-CkzsR0Fx5cI/UaPHWDh-evI/AAAAAAAAADs/pnWsvhnZC_g/s1600/Screen+Shot+2013-05-27+at+10.47.14+PM.png> The problem is this next graph…. Besides the difference between the c++ and the other two, there's still a large difference between the c++ plot and the plots in the graph above... <https://lh3.googleusercontent.com/--dfWc8dngo8/UaPH8bCaceI/AAAAAAAAAD4/kJDh-YFdOFk/s1600/Screen+Shot+2013-05-27+at+10.47.23+PM.png> The code is all on https://github.com/mdeklerk/cl-util The files of interest are pyPrefixSum.py, PrefixSum.pyx, which can be tested with test_PrefixSum and PrefixSum.cpp which just needs to be compiled ran… If you've gotten this far, thanks for reading it, I hope it's clear :) I'll greatly appreciate any help, even pointing me more or less in the right direction etc… Cheers, Marc
Where can I find clutil.py? Johannes Schönberger Am 27.05.2013 um 22:55 schrieb Marc de Klerk <deklerkmc@gmail.com>:
Hi Guys,
I'm hoping that I could get some help and advice on c++ and Cython.
I've written an OpenCL implementation for the prefix-sum algorithm which I use for generating a compacted lookup table for sparse binary array (called stream compacting) The algorithm isn't really important right now, but just to show what it does here's an example…
In the end, the result is a nice compact array of the indices that were flagged… I'm mainly using it to know which tiles to process for grow cut of graph cut on the gpu like this:
This operation has to happen a lot… so I really need it to be fast. The problem I'm having is that the when I isolate and measure the execution time of the gpu code it's much faster than that of the c++ or Cython wrapper - which I cannot really do without.
So I'm kinda hoping someone can help me to really squash the additional execution time from the overhead of the wrapper.
Originally I wrote a Python then Cython wrapper and when looking at the difference between the execution time of just the gnu code vs the total time, I thought it must be from the overhead of the Python/Cython. But I've just written a c++ wrapper and it's not a whole lot faster than Python/Cython, but I'm still hoping there's a lot that can be done…
Here are two graphs that might help explain… The one below is measuring the execution time of just the gpu code in the 3 implementations. They should be exactly the same and they are more or less.
The problem is this next graph…. Besides the difference between the c++ and the other two, there's still a large difference between the c++ plot and the plots in the graph above...
The code is all on https://github.com/mdeklerk/cl-util The files of interest are pyPrefixSum.py, PrefixSum.pyx, which can be tested with test_PrefixSum and PrefixSum.cpp which just needs to be compiled ran…
If you've gotten this far, thanks for reading it, I hope it's clear :) I'll greatly appreciate any help, even pointing me more or less in the right direction etc…
Cheers, Marc
-- You received this message because you are subscribed to the Google Groups "scikit-image" group. To unsubscribe from this group and stop receiving emails from it, send an email to scikit-image+unsubscribe@googlegroups.com. For more options, visit https://groups.google.com/groups/opt_out.
Sorry, do not know why I did not see it … Johannes Schönberger Am 27.05.2013 um 23:39 schrieb Johannes Schönberger <jschoenberger@demuc.de>:
Where can I find clutil.py?
Johannes Schönberger
Am 27.05.2013 um 22:55 schrieb Marc de Klerk <deklerkmc@gmail.com>:
Hi Guys,
I'm hoping that I could get some help and advice on c++ and Cython.
I've written an OpenCL implementation for the prefix-sum algorithm which I use for generating a compacted lookup table for sparse binary array (called stream compacting) The algorithm isn't really important right now, but just to show what it does here's an example…
In the end, the result is a nice compact array of the indices that were flagged… I'm mainly using it to know which tiles to process for grow cut of graph cut on the gpu like this:
This operation has to happen a lot… so I really need it to be fast. The problem I'm having is that the when I isolate and measure the execution time of the gpu code it's much faster than that of the c++ or Cython wrapper - which I cannot really do without.
So I'm kinda hoping someone can help me to really squash the additional execution time from the overhead of the wrapper.
Originally I wrote a Python then Cython wrapper and when looking at the difference between the execution time of just the gnu code vs the total time, I thought it must be from the overhead of the Python/Cython. But I've just written a c++ wrapper and it's not a whole lot faster than Python/Cython, but I'm still hoping there's a lot that can be done…
Here are two graphs that might help explain… The one below is measuring the execution time of just the gpu code in the 3 implementations. They should be exactly the same and they are more or less.
The problem is this next graph…. Besides the difference between the c++ and the other two, there's still a large difference between the c++ plot and the plots in the graph above...
The code is all on https://github.com/mdeklerk/cl-util The files of interest are pyPrefixSum.py, PrefixSum.pyx, which can be tested with test_PrefixSum and PrefixSum.cpp which just needs to be compiled ran…
If you've gotten this far, thanks for reading it, I hope it's clear :) I'll greatly appreciate any help, even pointing me more or less in the right direction etc…
Cheers, Marc
-- You received this message because you are subscribed to the Google Groups "scikit-image" group. To unsubscribe from this group and stop receiving emails from it, send an email to scikit-image+unsubscribe@googlegroups.com. For more options, visit https://groups.google.com/groups/opt_out.
On Mon, May 27, 2013 at 10:55 PM, Marc de Klerk <deklerkmc@gmail.com> wrote:
This operation has to happen a lot… so I really need it to be fast. The problem I'm having is that the when I isolate and measure the execution time of the gpu code it's much faster than that of the c++ or Cython wrapper - which I cannot really do without.
Another option is also to call into the NumPy C API to evoke essentially the equivalent of np.nonzero(np.diff(np.cumsum(x)))[0] + 1 Stéfan
On Mon, May 27, 2013 at 10:55 PM, Marc de Klerk <deklerkmc@gmail.com> wrote:
The problem is this next graph…. Besides the difference between the c++ and the other two, there's still a large difference between the c++
Can you just explain again the difference between the two plots above? Stéfan
participants (3)
-
Johannes Schönberger
-
Marc de Klerk
-
Stéfan van der Walt