Hi Stéfan, Johannes, Thanks for having a read through this... I've tried to explain the reason why I shouldn't perform the prefix/cumulative sum on the CPU using numpy etc in this diagram... <https://lh5.googleusercontent.com/-AhB5JfV8qQA/UaSk96a9jDI/AAAAAAAAAEY/MKD8iYCFyf8/s1600/scikit.jpg> In method 1 I have to ship the entire flag array, build the compacted array and ship it back. I've also tested cumsum with timeit and I'm certain the CPU prefix-sum algorithms are simply too slow..
timeit.timeit('np.cumsum(x)', setup='import numpy as np; x = np.random.random_integers(0, 10, 1850)')
8.654011964797974
Method 2 just requires that the length of the queue be shipped across to the CPU to know have many threads to execture the GPU method gpu_process_tiles() with. On Monday, May 27, 2013 11:54:50 PM UTC+2, Stefan van der Walt wrote:
On Mon, May 27, 2013 at 10:55 PM, Marc de Klerk <dekl...@gmail.com<javascript:>
wrote:
This operation has to happen a lot… so I really need it to be fast. The problem I'm having is that the when I isolate and measure the execution time of the gpu code it's much faster than that of the c++ or Cython wrapper - which I cannot really do without.
Another option is also to call into the NumPy C API to evoke essentially the equivalent of
np.nonzero(np.diff(np.cumsum(x)))[0] + 1
Stéfan
participants (1)
-
Marc de Klerk