[Cython] [cython-users] memoryviews & parameter passing

Tue Dec 20 21:18:10 CET 2011

On 20 December 2011 18:57, Dirk Rothe <thecere at gmail.com> wrote:
> Hello Cython-Devs,
>
> I'v thought I check out the memoryview syntax from cython-trunk to
> refactor some tight loops on numpy arrays into smaller functions. But
> either I'm doing something wrong or the call-overhead (of dostuff() )
> is still very large. Am I missing something?
>
> @cython.boundscheck(False)
> cdef inline int dostuff(np.int_t[:] data, int i, int j) nogil:
>    return data[j] + i + j
>
> @cython.boundscheck(False)
> def test():
>    cdef np.int_t[:, :] data = np.zeros((3000, 20000), dtype=np.int)
>    cdef int i, j
>    with nogil:
>        for i in range(3000):
>            for j in range(20000):
>                # try to be as fast
>                data[i, j] = dostuff(data[i], i, j)
>                # as direct array access
>                #~ data[i, j] = data[i, j] + i + j
>
> thnx, dirk

The performance difference is indeed quite large. There are several
problems with the implementation of slices:

    1) the overhead of PyThread_acquire_lock() is quite large, we
should resort to an atomic approach
    2) the slices support up to 32 dimensions by default (configurable
as compiler option). This is a lot of memory to copy around all the
time. I think a default of 8 would be more sensible and the compiler
option should be documented well (who uses 32 dimensions anyway?)
    3) the slice function has a generic approach and could be somewhat
faster if the slice is direct and strided

Addressing these problems by tweaking the generated code brings it
down from ~16 seconds to ~2.4 seconds. The direct indexing approach
without function call takes ~0.35 seconds. Slicing will never be as
fast, so if you'd really want to write that code you'd move the
data[i] call to the outer loop, as in:

for i in range(3000):
    dataslice = data[i]
    for j in range(...): ...

Now Cython could do that optimization itself as the 'data' slice does
not change in the inner loop, but it doesn't. But at least it should
not be more than 10 times slower (so this will be worked on).

@cython-dev
How should atomic operations be supported? Should this use something
like http://gcc.gnu.org/onlinedocs/gcc-4.1.2/gcc/Atomic-Builtins.html
, or something like libatomic? Or should we "just" implement a garbage
collector for pure-Cython level stuff (like memoryview slices),
thereby avoiding the need to acquisition count?