When you mean "an order of magnitude less overhead than the current CPython implementation" do you mean compared with the main branch? We recently implemented already almost everything is listed in this paragraph.
I think I wrote that in August when "current CPython" meant something different from today :) I'll update it.
Thanks for the links to the PRs. I'll need to look at them more closely, but one I think one remaining difference is that
the "nogil" interpreter stays within the same interpreter loop for many Python function calls, while upstream CPython
recursively calls into _PyEval_EvalFrameDefault.
I've been using this mini-benchmark to measure the overhead of Python function calls for various numbers of
arguments and keywords:
For zero, two, and four argument functions, I get:
nogil (nogil/fb6aabed): 10ns, 14ns, 18ns
3.11 (main/b108db63): 47ns, 54ns, 63ns