Is it also slower even when running with PYTHONGIL=1? If it could be made the same speed for single-threaded code when running in GIL-enabled mode, that might be an easier intermediate target while still adding value.
Running with PYTHONGIL=1 is a bit less than 1% faster (on pyperformance) than with PYTHONGIL=0. It might be possible to improve PYTHONGIL=1 by another 1-2% by adding runtime checks for the GIL before attempting to lock dicts and lists during mutations. I think further optimizations specific to the PYTHONGIL=1 use case would be tricky.