Did you try running the same code with stock Python?
One reason I ask is the IIUC, you are using numpy for the individual vector operations, and numpy already releases the GIL in some circumstances.
I had not run the same code with stock Python (but see below). Also, I only used numpy for two bits: 1. I use numpy arrays filled with random values, and the output array is also a numpy array. The vector multiplication is done in a simple for loop in my vecmul() function. 2. Early on I compared my results with the result of numpy.matmul just to make sure I had things right. That said, I have now run my example code using both PYTHONGIL=0 and PYTHONGIL=1 of Sam's nogil branch as well as the following other Python3 versions: * Conda Python3 (3.9.7) * /usr/bin/python3 (3.9.1 in my case) * 3.9 branch tip (3.9.7+) The results were confusing, so I dredged up a copy of pystone to make sure I wasn't missing anything w.r.t. basic execution performance. I'm still confused, so will keep digging. It would also be fun to see David Beezley’s example from his seminal talk:
Thanks, I'll take a look when I get a chance. Might give me the excuse I need to wake up extra early and tag along with Dave on an early morning bike ride. Skip