Mailman 3 Improved multi-tasking performance through deterministic GIL hand-off - Python-ideas

May 10, 2021

      I don't know how many people will remember some work that David Beazley did about a decade ago on how the GIL impacts multithreading performance - essentially he instrumented the Python interpreter to log how multiple threads competed for  the GIL and gave several presentations over the space of 2-3 years. A couple of years ago I reached out to him with an idea on how to significantly improve the way that Python handles multi-threading hand-off of the GIL, but unfortunately he was not interested in pursuing this further. I am raising it here in the hope that someone else would be interested in implementing this.

In essence my idea is to stop Python handing off the GIL through a competition between threads that are ready to run, and instead for Python to implement a scheduler for the GIL which decides which thread should get the GIL next and directly hands it over. 

Background
========
Here are the links to David Beazley's presentations:

2009: Inside the Python GIL                                          - https://www.youtube.com/watch?v=ph374fJqFPE
2010: Understanding the Python GIL                            - https://speakerdeck.com/dabeaz/understanding-the-python-gil https://www.youtube.com/watch?v=Obt-vMVdM8s
2011: Embracing the Global Interpreter Lock               - https://speakerdeck.com/dabeaz/embracing-the-global-interpreter-lock https://www.youtube.com/watch?v=fwzPF2JLoeU
2011: In Search of the Perfect Global Interpreter Lock - https://speakerdeck.com/dabeaz/in-search-of-the-perfect-global-interpreter-l... https://www.youtube.com/watch?v=5jbG7UKT1l4

The Problem
========

A currently executing thread can give up the GIL (essentially) for two reasons: A) It has started a blocking operation (like a reading data from disk) and is no longer runnable; or B) because it is using a lot of CPU and has reached a Python CPU timeout (nothing to do with the O/S scheduler time-slice) and is still runnable.

When the currently executing thread wants to give up the GIL, the current competitive approach essentially sends a signal to all runnable threads that the GIL is now available and the first thread to grab it gets to execute with the GIL. The problem with this (at least as I understand it) is that all threads except the thread that has just given up the GIL are not actually running on a CPU and need the O/S Scheduler to dispatch them before they can run, but the current thread is already running - so if the currently executing thread is is still runnable i.e. because it has reached the Python CPU timeout, then it is highly likely to be the first to grab the GIL back again. This leads to CPU-heavy threads dominating the CPU usage, and the I/O threads being starved out - and this is the opposite of what O/S Scheduling Theory says is optimum - which is to run the I/O threads first to keep the data flowing and everything responsive, typically using low levels of CPU, and let the CPU heavy threads mop up the remaining CPU.

But even if the currently executing CPU heavy thread doesn't grab the GIL again, there is no guarantee that the most appropriate of the other threads will grab it instead.

It should also be noted that this competitive approach has a lot of overhead - because every thread wakes up to try to compete for the GIL.

Proposed Solution
============

My proposed solution is for Python to implement its own (light-weight) scheduling system for the GIL - maintaining a list of threads that are runnable and whether they are I/O bound (gave up the GIL because they became non-runnable due to blocking I/O or waiting for a signal) or CPU heavy (gave up the GIL because of reaching the Python timeout). Ideally we would also add functionality to Python to allow you to specify a thread priority (ideally adjusting the O/S thread priority to match) and we would also retain this priority as part of the GIL scheduling.

As far as I can see this proposal is complementary to the current work on sub-interpreters - because there is a big base of existing code which would immediately benefit from these changes without any coding changes.

The solution I have in mind is this:

a. A scheduling queue is created with its own mutex. The reason for having a separate mutex is that threads can add themselves to the queue without needing to obtain the GIL.
b. The queue consists of a tree with the following branches:
    i) A list of thread priorities, decreasing order
    ii) Two subqueues: I/O bound, CPU bound
    iii) Each has a FIFO list of runnable threads 

When a thread is created, it is assumed to be I/O bound and is added to the back of the appropriate queue.

When the GIL is given up, if the currently executing thread is still runnable (reached the Python timeout) it adds itself to the back of the appropriate queue. It then takes the first runnable thread in the queue (highest priority, I/O bound if non empty, otherwise CPU bound) and signals that thread to take the GIL and start executing.

As a thread becomes runnable (because it gets a signal it is waiting for or because a blocking I/O completes), it adds itself to the back of the appropriate queue.

This is a deterministic solution with low overhead that should ensure that the appropriate thread gets the GIL. And I believe that it would be compatible with the intended sub-interpreter implementation.

A couple of caveats:

A. I think that there may be some other rare cases where a thread that has the GIL is forced to give it up - and the above algorithm might needs some tweaks to accommodate that.
B. Although this should be deterministic, I can imagine a possibility that the signalled thread does not wake up and take the GIL (due to bugs or other reasons not imagined) - and it might be worthwhile having some watchdog timeout mechanism to be started immediately before sending the GIL transfer signal and terminated immediately after receiving the GIL transfer signal, which will be triggered if for any reason the signalled thread does not wake up and take the GIL within a short period.

Because the code for handing off the GIL is quite localised, I believe that the implementation of this might be reasonably straight forward.

Improved multi-tasking performance through deterministic GIL hand-off

Sophist

Barry Scott

Barry Scott

Shreyan Avigyan

Barry Scott

Barry Scott

Shreyan Avigyan

tags

participants (3)