Hi Stéfan, Johannes,
Thanks for having a read through this...
I've tried to explain the reason why I shouldn't perform the prefix/cumulative sum on the CPU using numpy etc in this diagram...
In method 1 I have to ship the entire flag array, build the compacted array and ship it back. I've also tested cumsum with timeit and I'm certain the CPU prefix-sum algorithms are simply too slow..
timeit.timeit('np.cumsum(x)', setup='import numpy as np; x =
np.random.random_integers(0, 10, 1850)')
Method 2 just requires that the length of the queue be shipped across to the CPU to know have many threads to execture the GPU method gpu_process_tiles() with.
On Monday, May 27, 2013 11:54:50 PM UTC+2, Stefan van der Walt wrote:
This operation has to happen a lot… so I really need it to be fast. The problem I'm having is that the when I isolate and measure the execution time of the gpu code it's much faster than that of the c++ or Cython wrapper - which I cannot really do without.
Another option is also to call into the NumPy C API to evoke essentially the equivalent of
np.nonzero(np.diff(np.cumsum(x))) + 1