Improved multi-tasking performance through deterministic GIL hand-off

I don't know how many people will remember some work that David Beazley did about a decade ago on how the GIL impacts multithreading performance - essentially he instrumented the Python interpreter to log how multiple threads competed for the GIL and gave several presentations over the space of 2-3 years. A couple of years ago I reached out to him with an idea on how to significantly improve the way that Python handles multi-threading hand-off of the GIL, but unfortunately he was not interested in pursuing this further. I am raising it here in the hope that someone else would be interested in implementing this. In essence my idea is to stop Python handing off the GIL through a competition between threads that are ready to run, and instead for Python to implement a scheduler for the GIL which decides which thread should get the GIL next and directly hands it over. Background ======== Here are the links to David Beazley's presentations: 2009: Inside the Python GIL - https://www.youtube.com/watch?v=ph374fJqFPE 2010: Understanding the Python GIL - https://speakerdeck.com/dabeaz/understanding-the-python-gil https://www.youtube.com/watch?v=Obt-vMVdM8s 2011: Embracing the Global Interpreter Lock - https://speakerdeck.com/dabeaz/embracing-the-global-interpreter-lock https://www.youtube.com/watch?v=fwzPF2JLoeU 2011: In Search of the Perfect Global Interpreter Lock - https://speakerdeck.com/dabeaz/in-search-of-the-perfect-global-interpreter-l... https://www.youtube.com/watch?v=5jbG7UKT1l4 The Problem ======== A currently executing thread can give up the GIL (essentially) for two reasons: A) It has started a blocking operation (like a reading data from disk) and is no longer runnable; or B) because it is using a lot of CPU and has reached a Python CPU timeout (nothing to do with the O/S scheduler time-slice) and is still runnable. When the currently executing thread wants to give up the GIL, the current competitive approach essentially sends a signal to all runnable threads that the GIL is now available and the first thread to grab it gets to execute with the GIL. The problem with this (at least as I understand it) is that all threads except the thread that has just given up the GIL are not actually running on a CPU and need the O/S Scheduler to dispatch them before they can run, but the current thread is already running - so if the currently executing thread is is still runnable i.e. because it has reached the Python CPU timeout, then it is highly likely to be the first to grab the GIL back again. This leads to CPU-heavy threads dominating the CPU usage, and the I/O threads being starved out - and this is the opposite of what O/S Scheduling Theory says is optimum - which is to run the I/O threads first to keep the data flowing and everything responsive, typically using low levels of CPU, and let the CPU heavy threads mop up the remaining CPU. But even if the currently executing CPU heavy thread doesn't grab the GIL again, there is no guarantee that the most appropriate of the other threads will grab it instead. It should also be noted that this competitive approach has a lot of overhead - because every thread wakes up to try to compete for the GIL. Proposed Solution ============ My proposed solution is for Python to implement its own (light-weight) scheduling system for the GIL - maintaining a list of threads that are runnable and whether they are I/O bound (gave up the GIL because they became non-runnable due to blocking I/O or waiting for a signal) or CPU heavy (gave up the GIL because of reaching the Python timeout). Ideally we would also add functionality to Python to allow you to specify a thread priority (ideally adjusting the O/S thread priority to match) and we would also retain this priority as part of the GIL scheduling. As far as I can see this proposal is complementary to the current work on sub-interpreters - because there is a big base of existing code which would immediately benefit from these changes without any coding changes. The solution I have in mind is this: a. A scheduling queue is created with its own mutex. The reason for having a separate mutex is that threads can add themselves to the queue without needing to obtain the GIL. b. The queue consists of a tree with the following branches: i) A list of thread priorities, decreasing order ii) Two subqueues: I/O bound, CPU bound iii) Each has a FIFO list of runnable threads When a thread is created, it is assumed to be I/O bound and is added to the back of the appropriate queue. When the GIL is given up, if the currently executing thread is still runnable (reached the Python timeout) it adds itself to the back of the appropriate queue. It then takes the first runnable thread in the queue (highest priority, I/O bound if non empty, otherwise CPU bound) and signals that thread to take the GIL and start executing. As a thread becomes runnable (because it gets a signal it is waiting for or because a blocking I/O completes), it adds itself to the back of the appropriate queue. This is a deterministic solution with low overhead that should ensure that the appropriate thread gets the GIL. And I believe that it would be compatible with the intended sub-interpreter implementation. A couple of caveats: A. I think that there may be some other rare cases where a thread that has the GIL is forced to give it up - and the above algorithm might needs some tweaks to accommodate that. B. Although this should be deterministic, I can imagine a possibility that the signalled thread does not wake up and take the GIL (due to bugs or other reasons not imagined) - and it might be worthwhile having some watchdog timeout mechanism to be started immediately before sending the GIL transfer signal and terminated immediately after receiving the GIL transfer signal, which will be triggered if for any reason the signalled thread does not wake up and take the GIL within a short period. Because the code for handing off the GIL is quite localised, I believe that the implementation of this might be reasonably straight forward.

Given this is very old information I think the first thing needed is to reproduce David's experiments and see if the 3.10 implementation has the same issues. Have you done this already? If you turn these slides into benchmark code that would make it easier to experiment with. Benchmarks will need running on macOS, Windows, Linux at least. Barry

It looks like the GIL code has not changed in a long time. But for 3.7 FORCE_SWITCHING is always defined that changes the GIL behaviour. This comment in Python/ceval_gil.h explains what that does: - When a thread releases the GIL and gil_drop_request is set, that thread ensures that another GIL-awaiting thread gets scheduled. It does so by waiting on a condition variable (switch_cond) until the value of last_holder is changed to something else than its own thread state pointer, indicating that another thread was able to take the GIL. This is meant to prohibit the latency-adverse behaviour on multi-core machines where one thread would speculatively release the GIL, but still run and end up being the first to re-acquire it, making the "timeslices" much longer than expected. (Note: this mechanism is enabled with FORCE_SWITCHING above) Barry

I'm currently learning about the GIL and how can it be removed. So I'm confessing that I understood only half of your words and I can be wrong. As far I understood by reading your long passage, the problem is other threads don't get enough chance to run and the CPU-Bound Python process will re-acquire the GIL right? If that's what you're trying to say, Antoine Pitrou already solved this by implementing the "new gil" which tells that before reacquiring the GIL make sure that other threads have a chance to run. However as I confessed earlier I can be wrong and I misunderstood what you're trying to say. Please correct me if that's the case.

Given this is very old information I think the first thing needed is to reproduce David's experiments and see if the 3.10 implementation has the same issues. Have you done this already? If you turn these slides into benchmark code that would make it easier to experiment with. Benchmarks will need running on macOS, Windows, Linux at least. Barry

It looks like the GIL code has not changed in a long time. But for 3.7 FORCE_SWITCHING is always defined that changes the GIL behaviour. This comment in Python/ceval_gil.h explains what that does: - When a thread releases the GIL and gil_drop_request is set, that thread ensures that another GIL-awaiting thread gets scheduled. It does so by waiting on a condition variable (switch_cond) until the value of last_holder is changed to something else than its own thread state pointer, indicating that another thread was able to take the GIL. This is meant to prohibit the latency-adverse behaviour on multi-core machines where one thread would speculatively release the GIL, but still run and end up being the first to re-acquire it, making the "timeslices" much longer than expected. (Note: this mechanism is enabled with FORCE_SWITCHING above) Barry

I'm currently learning about the GIL and how can it be removed. So I'm confessing that I understood only half of your words and I can be wrong. As far I understood by reading your long passage, the problem is other threads don't get enough chance to run and the CPU-Bound Python process will re-acquire the GIL right? If that's what you're trying to say, Antoine Pitrou already solved this by implementing the "new gil" which tells that before reacquiring the GIL make sure that other threads have a chance to run. However as I confessed earlier I can be wrong and I misunderstood what you're trying to say. Please correct me if that's the case.
participants (3)
-
Barry Scott
-
Shreyan Avigyan
-
Sophist