Skip> 1. I use numpy arrays filled with random values, and the output array is also a numpy array. The vector multiplication is done in a simple for loop in my vecmul() function.

CHB> probably doesn't make a difference for this exercise, but numpy arrays make lousy replacements for a  regular list ...

Yeah, I don't think it should matter here. Both versions should be similarly penalized.

Skip> The results were confusing, so I dredged up a copy of pystone to make sure I wasn't missing anything w.r.t. basic execution performance. I'm still confused, so will keep digging.

CHB> I'll be interested to see what you find out :-)

I'm still scratching my head. I was thinking there was something about the messaging between the main and worker threads, so I tweaked to accept 0 as a number of threads. That means it would call matmul which would call vecmul directly. The original queue-using versions were simply renamed to matmul_t and vecmul_t.

I am still confused. Here are the pystone numbers, nogil first, then the 3.9 git tip:

(base) nogil_build% ./bin/python3 ~/cmd/
Pystone(1.1.1) time for 50000 passes = 0.137658
This machine benchmarks at 363218 pystones/second

(base) 3.9_build% ./bin/python3 ~/cmd/
Pystone(1.1.1) time for 50000 passes = 0.207102
This machine benchmarks at 241427 pystones/second

That suggests nogil is indeed a definite improvement over vanilla 3.9. However, here's a quick nogil v 3.9 timing run of my matrix multiplication, again, nogil followed by 3.9 tip:

(base) nogil_build% time ./bin/python3 ~/tmp/ 0 100000
a: (160, 625) b: (625, 320) result: (160, 320) -> 51200

real 0m9.314s
user 0m9.302s
sys 0m0.012s

(base) 3.9_build% time ./bin/python3 ~/tmp/ 0 100000
a: (160, 625) b: (625, 320) result: (160, 320) -> 51200

real 0m4.918s
user 0m5.180s
sys 0m0.380s

What's up with that? Suddenly nogil is much slower than 3.9 tip. No threads are in use. I thought perhaps the nogil run somehow didn't use Sam's VM improvements, so I disassembled the two versions of vecmul. I won't bore you with the entire dis.dis output, but suffice it to say that Sam's instruction set appears to be in play:

(base) nogil_build% PYTHONPATH=$HOME/tmp ./bin/python3/python3
Python 3.9.0a4+ (heads/nogil:b0ee2c4740, Oct 30 2021, 16:23:03)
[GCC 9.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import matmul, dis
>>> dis.dis(matmul.vecmul)
 26           0 FUNC_HEADER             11 (11)

 28           2 LOAD_CONST               2 (0.0)
              4 STORE_FAST               2 (result)

 29           6 LOAD_GLOBAL          3 254 ('len'; 254)
              9 STORE_FAST               8 (.t3)
             11 COPY                   9 0 (.t4 <- a)
             14 CALL_FUNCTION          9 1 (.t4 to .t5)
             18 STORE_FAST               5 (.t0)

So I unboxed the two numpy arrays once and used lists of lists for the actual work. The nogil version still performs worse by about a factor of two:

(base) nogil_build% time ./bin/python3 ~/tmp/ 0 100000
a: (160, 625) b: (625, 320) result: (160, 320) -> 51200

real 0m9.537s
user 0m9.525s
sys 0m0.012s

(base) 3.9_build% time ./bin/python3 ~/tmp/ 0 100000
a: (160, 625) b: (625, 320) result: (160, 320) -> 51200

real 0m4.836s
user 0m5.109s
sys 0m0.365s

Still scratching my head and am open to suggestions about what to try next. If anyone is playing along from home, I've updated my script:

I'm sure there are things I could have done more efficiently, but I would think both Python versions would be similarly penalized by dumb s**t I've done.