As an experiment, I thought I would try moving the thread state (what you get from _PyThreadState_GET() ) to TLS.
It works, passing all the tests, and seems sound.
It is a small patch (< 50 lines) and doesn't increase the overall code size.
My branch is GCC/Clang only, so will need a bit of extra code for Windows. It should only need a few more lines; I haven't done it as I don't have a Windows machine to test it on.
This is a *much* cleaner approach to removing the global variable than adding lots of extra parameters all over the place.