Did you try running  the same code with stock Python?

One reason I ask is the IIUC, you are using numpy for the individual  vector operations, and numpy already releases the GIL in some circumstances.

I had not run the same code with stock Python (but see below). Also, I only used numpy for two bits:

1. I use numpy arrays filled with random values, and the output array is also a numpy array. The vector multiplication is done in a simple for loop in my vecmul() function.

2. Early on I compared my results with the result of numpy.matmul just to make sure I had things right.

That said, I have now run my example code using both PYTHONGIL=0 and PYTHONGIL=1 of Sam's nogil branch as well as the following other Python3 versions:

* Conda Python3 (3.9.7)
* /usr/bin/python3 (3.9.1 in my case)
* 3.9 branch tip (3.9.7+)

The results were confusing, so I dredged up a copy of pystone to make sure I wasn't missing anything w.r.t. basic execution performance. I'm still confused, so will keep digging.

It would also be fun to see David Beezley’s example from his seminal talk:

https://youtu.be/ph374fJqFPE

Thanks, I'll take a look when I get a chance. Might give me the excuse I need to wake up extra early and tag along with Dave on an early morning bike ride.

Skip