Sam> I think the performance difference is because of different versions of NumPy.
Thanks all for the help/input/advice. It never occurred to me that two relatively recent versions of numpy would differ so much for the simple tasks in my script (array creation & transform). I confirmed this by removing 1.21.3 and installing 1.19.4 in my 3.9 build.
I also got a little bit familiar with pyperf, and as a "stretch" goal completely removed random numbers and numpy from my script. (Took me a couple tries to get my array init and transposition correct. Let's just say that it's been awhile. Numpy *was* a nice crutch...) With no trace of numpyleft I now get identical results for single-threaded matrix multiply (a size==10000, b size==20000):
3.9: matmul: Mean +- std dev: 102 ms +- 1 ms nogil: matmul: Mean +- std dev: 103 ms +- 2 ms
and a nice speedup for multi-threaded (a size==30000, b size=60000, nthreads=3):
3.9: matmul_t: Mean +- std dev: 290 ms +- 13 ms nogil: matmul_t: Mean +- std dev: 102 ms +- 3 ms
Sam> I'll update the version of NumPy for "nogil" Python if I have some time this week.
I think it would be sufficient to alert users to the 1.19/1.21 performance differences and recommend they force install 1.19 in non-nogil builds for testing purposes. Hopefully adding a simple note to your README will take less time than porting your changes to numpy 1.21 and adjusting your build configs/scripts.