[Cython] PR on refcounting memoryview buffers
Sturla Molden
sturla at molden.no
Mon Feb 18 19:32:40 CET 2013
As Stefan suggested, I have posted a PR for a better fix for the issue
when MinGW for some reason emits the symbol "__synch_fetch_and_add_4"
instead of generating atomic opcode for the __synch_fetch_and_add builtin.
The PR is here:
https://github.com/cython/cython/pull/185
The discussion probably belongs on this list instead og Cython user:
The problem this addresses is when GCC does not use atomic builtins and
emits __synch_fetch_and_add_4 and __synch_fetch_and_sub_4 when Cython
are internally refcounting memoryview buffers. For some reason it can
even happen on x86 and amd64.
My PR undos Marks quick fix that always uses PyThread_acquire_lock on
MinGW. PyThread_acquire_lock uses a kernel object (semaphore) on Windows
and is not very efficient. I want slicing memoryviews to be fast, and
that means PyThread_acquire_lock must go. My PR uses Windows API atomic
function InterlockedAdd to implement the semantics of
__synch_fetch_and_add_4 and __synch_fetch_and_sub_4 instead of using a
Python lock.
Usually MinGW is configured to compile GNU atomic builtins correctly. I
have yet to see a case where it is not. But obviously one user (JF
Gallant) has encountered it. I don't think it is a MinGW specific
problem, but currently it has only been seen on MinGW and the fix is
MinGW specific (well, it should work on Cygwin too). But whenever MinGW
does use atomic builtins it just uses them. So it incurs no speed
penalty on well-behaved MinGW builds.
I took the liberty to use GNU extensions __inline__ and
__attribute(always_inline)__. They will make sure the functions always
behave like macros. The rationale being that it is GCC specific code so
we can assume GNU extensions are available. If we take them away the
code should still work, but we have no guarantee the functions will be
inlined. I did not use macros because __synch_fetch_and_add is emitted
by the preprocessor, and thus GCC will presumably emit
__synch_fetch_and_sub_4 after the preprocessing step, which could
require __synch_fetch_and_sub_4 to be a function instead of another
macro. (I have no way of finding it out since I cannot test for it.)
Regarding Linux and OSX:
Failure of GCC to use atomic builtins could also happen on other GCC
builds though. I don't think it is a MinGW-only issue. It's probably due
to how the GCC build was configured. So we should as a safeguard have
this for other OSes too.
http://developer.apple.com/library/ios/#DOCUMENTATION/System/Conceptual/ManPages_iPhoneOS/man3/OSAtomicAdd32.3.html
We probably just need similar code to what I wrote for MinGW. I can
write the code, but I don't have a Mac on which to test it.
Also we should use OSAtomic* on clang/LLVM, which is now the platform C
compiler on OSX. This will avoid PyThread_acquire_lock being the common
synch mechanism for refcounting memoryview buffers on OSX.
On Linux I am not sure what to suggest if GCC fails to use atomic
builtins. I can handcode inline assembly for x86/amd64. I could also use
pthreads and pth threads locks. But we could also assume that it never
happen and just let the linker fail on __synch_fetch_and_add_4.
Sturla
More information about the cython-devel
mailing list