
Thanks for your input. I now see how things evolved to the present state. in the context of PEP 451, my proposal would have been to move all default module creation tasks to ModuleType.tp_new (taking an optional spec parameter), making separate create and exec unnecessary. Too late, I guess.
Once the question is narrowed down to "How can an extension module fully support subinterpreters and multiple Py_Initialize/Finalize cycles without incurring PEP 3121's performance overhead?" then the short answer becomes "We don't know, but ideas for that are certainly welcome, either here or over on import-sig".
I mentioned the way to avoid state access overhead in my first post. It's independent of module loading mechanism: 1) define a new "calling convention" flag like METH_GLOBALS. 2) store module ref in PyCFunctionObject.m_module (currently it stores only the module name) 3) pass module ref as an extra arg to methods with METH_GLOBALS flag. 4) PyModule_State, reimplemented as a macro, would amount to one indirection from the passed parameter. I suspect that most C ABIs allow to pass the extra arg unconditionally, (this is certainly the case for x86 and x64 on Windows and Linux). Meaning that METH_GLOBALS won't increase the actual number of possible dispatch targets in PyCFunction_Call and won't impact Python-to-C call performance at all.