When you mean "an order of magnitude less overhead than the current CPython implementation" do you mean compared with the main branch? We recently implemented already almost everything is listed in this paragraph:
We also pack some extra similar optimizations in this other PR, including stealing the frame arguments from python to python calls:
This could explain why the performance is closer to the current master branch as you indicate:
This means that if we remove the GIL + add the 3.11 improvements we should get some more speed?
(or if those are integrated in the POC?)