[Cython] PR on refcounting memoryview buffers

Mon Feb 18 19:32:40 CET 2013

As Stefan suggested, I have posted a PR for a better fix for the issue 
when MinGW for some reason emits the symbol "__synch_fetch_and_add_4" 
instead of generating atomic opcode for the __synch_fetch_and_add builtin.

The PR is here:
https://github.com/cython/cython/pull/185

The discussion probably belongs on this list instead og Cython user:

The problem this addresses is when GCC does not use atomic builtins and 
emits __synch_fetch_and_add_4 and __synch_fetch_and_sub_4 when Cython 
are internally refcounting memoryview buffers. For some reason it can 
even happen on x86 and amd64.

My PR undos Marks quick fix that always uses PyThread_acquire_lock on 
MinGW. PyThread_acquire_lock uses a kernel object (semaphore) on Windows 
and is not very efficient. I want slicing memoryviews to be fast, and 
that means PyThread_acquire_lock must go. My PR uses Windows API atomic 
function InterlockedAdd to implement the semantics of 
__synch_fetch_and_add_4 and __synch_fetch_and_sub_4 instead of using a 
Python lock.

Usually MinGW is configured to compile GNU atomic builtins correctly. I 
have yet to see a case where it is not. But obviously one user (JF 
Gallant) has encountered it. I don't think it is a MinGW specific 
problem, but currently it has only been seen on MinGW and the fix is 
MinGW specific (well, it should work on Cygwin too). But whenever MinGW 
does use atomic builtins it just uses them. So it incurs no speed 
penalty on well-behaved MinGW builds.

I took the liberty to use GNU extensions __inline__ and 
__attribute(always_inline)__. They will make sure the functions always 
behave like macros. The rationale being that it is GCC specific code so 
we can assume GNU extensions are available. If we take them away the 
code should still work, but we have no guarantee the functions will be 
inlined. I did not use macros because __synch_fetch_and_add is emitted 
by the preprocessor, and thus GCC will presumably emit 
__synch_fetch_and_sub_4 after the preprocessing step, which could 
require __synch_fetch_and_sub_4 to be a function instead of another 
macro. (I have no way of finding it out since I cannot test for it.)

Regarding Linux and OSX:

Failure of GCC to use atomic builtins could also happen on other GCC 
builds though. I don't think it is a MinGW-only issue. It's probably due 
to how the GCC build was configured. So we should as a safeguard have 
this for other OSes too.

http://developer.apple.com/library/ios/#DOCUMENTATION/System/Conceptual/ManPages_iPhoneOS/man3/OSAtomicAdd32.3.html

We probably just need similar code to what I wrote for MinGW. I can 
write the code, but I don't have a Mac on which to test it.

Also we should use OSAtomic* on clang/LLVM, which is now the platform C 
compiler on OSX. This will avoid PyThread_acquire_lock being the common 
synch mechanism for refcounting memoryview buffers on OSX.

On Linux I am not sure what to suggest if GCC fails to use atomic 
builtins. I can handcode inline assembly for x86/amd64. I could also use 
pthreads and pth threads locks. But we could also assume that it never 
happen and just let the linker fail on __synch_fetch_and_add_4.

Sturla