To speed-up function calls, the interpreter uses a linear, resizable stack to store function call frames, an idea taken from LuaJIT. The stack stores the interpreter registers (local variables + space for temporaries) plus some extra information per-function call. This avoids the need for allocating PyFrameObjects for each call. For compatibility, the PyFrameObject type still exists, but they are created lazily as-needed (such as for exception handling and for sys._getframe).
The optimized function calls have about an order of magnitude less overhead than the current CPython implementation.
The change also simplifies the use of deferred reference counting with the data that is stored per-call like the function object. The interpreter can usually avoid incrementing the reference count of the function object during a call. Like other objects on the stack, a borrowed reference to the function is indicated by setting the least-significant-bit.
Congrats Sam! This is incredible work! One quick question after reading the design doc:
When you mean "an order of magnitude less overhead than the current CPython implementation" do you mean compared with the main branch? We recently implemented already almost everything is listed in this paragraph:
We also pack some extra similar optimizations in this other PR, including stealing the frame arguments from python to python calls:
This could explain why the performance is closer to the current master branch as you indicate:
It gets about the same average performance as the “main” branch of CPython 3.11 as of early September 2021.
Cheers from cloudy London,
Pablo Galindo Salgado