Skip> 1. I use numpy arrays filled with random values, and the output array is also a numpy array. The vector multiplication is done in a simple for loop in my vecmul() function. CHB> probably doesn't make a difference for this exercise, but numpy arrays make lousy replacements for a regular list ... Yeah, I don't think it should matter here. Both versions should be similarly penalized. Skip> The results were confusing, so I dredged up a copy of pystone to make sure I wasn't missing anything w.r.t. basic execution performance. I'm still confused, so will keep digging. CHB> I'll be interested to see what you find out :-) I'm still scratching my head. I was thinking there was something about the messaging between the main and worker threads, so I tweaked matmul.py to accept 0 as a number of threads. That means it would call matmul which would call vecmul directly. The original queue-using versions were simply renamed to matmul_t and vecmul_t. I am still confused. Here are the pystone numbers, nogil first, then the 3.9 git tip: (base) nogil_build% ./bin/python3 ~/cmd/pystone.py Pystone(1.1.1) time for 50000 passes = 0.137658 This machine benchmarks at 363218 pystones/second (base) 3.9_build% ./bin/python3 ~/cmd/pystone.py Pystone(1.1.1) time for 50000 passes = 0.207102 This machine benchmarks at 241427 pystones/second That suggests nogil is indeed a definite improvement over vanilla 3.9. However, here's a quick nogil v 3.9 timing run of my matrix multiplication, again, nogil followed by 3.9 tip: (base) nogil_build% time ./bin/python3 ~/tmp/matmul.py 0 100000 a: (160, 625) b: (625, 320) result: (160, 320) -> 51200 real 0m9.314s user 0m9.302s sys 0m0.012s (base) 3.9_build% time ./bin/python3 ~/tmp/matmul.py 0 100000 a: (160, 625) b: (625, 320) result: (160, 320) -> 51200 real 0m4.918s user 0m5.180s sys 0m0.380s What's up with that? Suddenly nogil is much slower than 3.9 tip. No threads are in use. I thought perhaps the nogil run somehow didn't use Sam's VM improvements, so I disassembled the two versions of vecmul. I won't bore you with the entire dis.dis output, but suffice it to say that Sam's instruction set appears to be in play: (base) nogil_build% PYTHONPATH=$HOME/tmp ./bin/python3/python3 Python 3.9.0a4+ (heads/nogil:b0ee2c4740, Oct 30 2021, 16:23:03) [GCC 9.3.0] on linux Type "help", "copyright", "credits" or "license" for more information.
import matmul, dis dis.dis(matmul.vecmul) 26 0 FUNC_HEADER 11 (11)
28 2 LOAD_CONST 2 (0.0) 4 STORE_FAST 2 (result) 29 6 LOAD_GLOBAL 3 254 ('len'; 254) 9 STORE_FAST 8 (.t3) 11 COPY 9 0 (.t4 <- a) 14 CALL_FUNCTION 9 1 (.t4 to .t5) 18 STORE_FAST 5 (.t0) ... So I unboxed the two numpy arrays once and used lists of lists for the actual work. The nogil version still performs worse by about a factor of two: (base) nogil_build% time ./bin/python3 ~/tmp/matmul.py 0 100000 a: (160, 625) b: (625, 320) result: (160, 320) -> 51200 real 0m9.537s user 0m9.525s sys 0m0.012s (base) 3.9_build% time ./bin/python3 ~/tmp/matmul.py 0 100000 a: (160, 625) b: (625, 320) result: (160, 320) -> 51200 real 0m4.836s user 0m5.109s sys 0m0.365s Still scratching my head and am open to suggestions about what to try next. If anyone is playing along from home, I've updated my script: https://gist.github.com/smontanaro/80f788a506d2f41156dae779562fd08d I'm sure there are things I could have done more efficiently, but I would think both Python versions would be similarly penalized by dumb s**t I've done. Skip Skip