Mailman 3 Cython / c++ help - scikit-image

27 May 2013

      Hi Guys,

I'm hoping that I could get some help and advice on c++ and Cython.

I've written an OpenCL implementation for the prefix-sum algorithm which I 
use for generating a compacted lookup table for sparse binary array (called 
stream compacting)

The algorithm isn't really important right now, but just to show what it 
does here's an example…

<https://lh3.googleusercontent.com/-RgftPXLsAkg/UaPGuDqDcgI/AAAAAAAAADc/Qtoaoc74VRg/s1600/Screen+Shot+2013-05-27+at+10.46.56+PM.png>

In the end, the result is a nice compact array of the indices that were 
flagged…

I'm mainly using it to know which tiles to process for grow cut of graph 
cut on the gpu like this:

<https://lh5.googleusercontent.com/-WBzEmrGga-M/UaPHDOPFzVI/AAAAAAAAADk/7vkyOcUI6lE/s1600/Screen+Shot+2013-05-27+at+10.47.02+PM.png>

This operation has to happen a lot… so I really need it to be fast. The 
problem I'm having is that the when I isolate and measure the execution 
time of the gpu code it's much faster than that of the c++ or Cython 
wrapper - which I cannot really do without.

So I'm kinda hoping someone can help me to really squash the additional 
execution time from the overhead of the wrapper.

Originally I wrote a Python then Cython wrapper and when looking at the 
difference between the execution time of just the gnu code vs the total 
time, I thought it must be from the overhead of the Python/Cython. But I've 
just written a c++ wrapper and it's not a whole lot faster than 
Python/Cython, but I'm still hoping there's a lot that can be done…

Here are two graphs that might help explain…

The one below is measuring the execution time of just the gpu code in the 3 
implementations. They should be exactly the same and

they are more or less.

<https://lh6.googleusercontent.com/-CkzsR0Fx5cI/UaPHWDh-evI/AAAAAAAAADs/pnWsvhnZC_g/s1600/Screen+Shot+2013-05-27+at+10.47.14+PM.png>

The problem is this next graph…. Besides the difference between the c++ and 
the other two, there's still a large difference between the c++

plot and the plots in the graph above...

<https://lh3.googleusercontent.com/--dfWc8dngo8/UaPH8bCaceI/AAAAAAAAAD4/kJDh-YFdOFk/s1600/Screen+Shot+2013-05-27+at+10.47.23+PM.png>

The code is all on https://github.com/mdeklerk/cl-util

The files of interest are pyPrefixSum.py, PrefixSum.pyx, which can be 
tested with test_PrefixSum and PrefixSum.cpp which just needs to be 
compiled ran…

If you've gotten this far, thanks for reading it, I hope it's clear :)

I'll greatly appreciate any help, even pointing me more or less in the 
right direction etc…

Cheers,

Marc

Cython / c++ help

Marc de Klerk

Johannes Schönberger

Johannes Schönberger

Stéfan van der Walt

Stéfan van der Walt

tags

participants (3)