[New-bugs-announce] [issue45274] Race condition in Thread._wait_for_tstate_lock()

STINNER Victor report at bugs.python.org
Thu Sep 23 16:13:09 EDT 2021


New submission from STINNER Victor <vstinner at python.org>:

Bernát Gábor found an interesting bug on Windows. Sometimes, when a process is interrupted on Windows with CTRL+C, its parent process hangs in thread.join():
https://twitter.com/gjbernat/status/1440949682759426050

Reproducer:

* Install https://github.com/gaborbernat/tox/tree/2159
* Make an empty folder and put the above tox.ini in it
* Invoke python -m tox and once that command is sleeping press CTRL+C (the app should lock there indefinitely).

tox.ini:
---

[testenv]
skip_install = true
commands = python -c 'import os; print(f"start {os.getpid()}"); import time; time.sleep(100); print("end");'
---

Source: https://gist.github.com/gaborbernat/f1e1aff0f2ee514b50a92a4d019d4d13

I tried to attach the Python process in Python: there is a single thread, the main thread which is blocked in thread.join(). You can also see it in the faulthandler traceback.

I did a long analysis of the _tstate_lock and I checked that thread really completed. Raw debug traces:

== thread 6200 exit ==
 
thread_run[pid=3984, thread_id=6200]: clear
PyThreadState_Clear[pid=3984, thread_id=6200]: on_delete()
release_sentinel[pid=3984, thread_id=6200]: enter
release_sentinel[pid=3984, thread_id=6200]: release(obj=000001C1122669C0, lock=000001C110BBDA00)
release_sentinel[pid=3984, thread_id=6200]: exit
PyThreadState_Clear[pid=3984, thread_id=6200]: on_delete()--
 
== main thread is calling join() but gets a KeyboardInterrupt ==
 
[pid=3984, thread_id=8000] Lock<obj=000001C1122669C0>.acquire() -> ACQUIRED
Current thread 0x00001f40 (most recent call first):
  File "C:\vstinner\python\3.10\lib\threading.py", line 1118 in _wait_for_tstate_lock
  File "C:\vstinner\python\3.10\lib\threading.py", line 1092 in join
  File "C:\vstinner\env\lib\site-packages\tox\session\cmthread_run[pid=3984, thread_id=6200]: exit
d\run\common.py", line 203 in execute
  File "C:\vstinner\env\lib\site-packages\tox\session\cmd\run\sequential.py", line 20 in run_sequential
  File "C:\vstinner\env\lib\site-packages\tox\session\cmd\legacy.py", line 104 in legacy
  File "C:\vstinner\env\lib\site-packages\tox\run.py", line 49 in main
  File "C:\vstinner\env\lib\site-packages\tox\run.py", line 23 in run
  File "C:\vstinner\env\lib\site-packages\tox\__main__.py", line 4 in <module>
  File "C:\vstinner\python\3.10\lib\runpy.py", line 86 in _run_code
  File "C:\vstinner\python\3.10\lib\runpy.py", line 196 in _run_module_as_main
_wait_for_tstate_lock[pid=3984, current thread_id=8000, self thread_id=6200]: EXC: KeyboardInterrupt(); acquired? None
 
== main thread calls repr(thread) ==
 
ROOT: [3984] KeyboardInterrupt - teardown started
_wait_for_tstate_lock[pid=3984, current thread_id=8000, self thread_id=6200]: acquire(block=False, timeout=-1): lock obj= 0x1c1122669c0
  File "C:\vstinner\python\3.10\lib\runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "C:\vstinner\python\3.10\lib\runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "C:\vstinner\env\lib\site-packages\tox\__main__.py", line 4, in <module>
    run()
  File "C:\vstinner\env\lib\site-packages\tox\run.py", line 23, in run
    result = main(sys.argv[1:] if args is None else args)
  File "C:\vstinner\env\lib\site-packages\tox\run.py", line 49, in main
    result = handler(state)
  File "C:\vstinner\env\lib\site-packages\tox\session\cmd\legacy.py", line 104, in legacy
    return run_sequential(state)
  File "C:\vstinner\env\lib\site-packages\tox\session\cmd\run\sequential.py", line 20, in run_sequential
    return execute(state, max_workers=1, has_spinner=False, live=True)
  File "C:\vstinner\env\lib\site-packages\tox\session\cmd\run\common.py", line 217, in execute
    print(f'join {thread}')
  File "C:\vstinner\python\3.10\lib\threading.py", line 901, in __repr__
    self.is_alive() # easy way to get ._is_stopped set when appropriate
  File "C:\vstinner\python\3.10\lib\threading.py", line 1181, in is_alive
    self._wait_for_tstate_lock(False)
  File "C:\vstinner\python\3.10\lib\threading.py", line 1113, in _wait_for_tstate_lock
    traceback.print_stack()
_wait_for_tstate_lock[pid=3984, current thread_id=8000, self thread_id=6200]: failed to acquire 0x1c1122669c0
_wait_for_tstate_lock[pid=3984, current thread_id=8000, self thread_id=6200]: exit 0x1c1122669c0
join <Thread(tox-interrupt, started 6200)>
 
== main thread calls thread.join()... which hangs ==
 
_wait_for_tstate_lock[pid=3984, current thread_id=8000, self thread_id=6200]: acquire(block=True, timeout=-1): lock obj= 0x1c1122669c0
  File "C:\vstinner\python\3.10\lib\runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "C:\vstinner\python\3.10\lib\runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "C:\vstinner\env\lib\site-packages\tox\__main__.py", line 4, in <module>
    run()
  File "C:\vstinner\env\lib\site-packages\tox\run.py", line 23, in run
    result = main(sys.argv[1:] if args is None else args)
  File "C:\vstinner\env\lib\site-packages\tox\run.py", line 49, in main
    result = handler(state)
  File "C:\vstinner\env\lib\site-packages\tox\session\cmd\legacy.py", line 104, in legacy
    return run_sequential(state)
  File "C:\vstinner\env\lib\site-packages\tox\session\cmd\run\sequential.py", line 20, in run_sequential
    return execute(state, max_workers=1, has_spinner=False, live=True)
  File "C:\vstinner\env\lib\site-packages\tox\session\cmd\run\common.py", line 218, in execute
    thread.join()
  File "C:\vstinner\python\3.10\lib\threading.py", line 1092, in join
    self._wait_for_tstate_lock()
  File "C:\vstinner\python\3.10\lib\threading.py", line 1113, in _wait_for_tstate_lock
    traceback.print_stack()


Explanation:

* Context: The main thread is waiting on thread.join()
* The thread 6200 completes: release_sentinel() is called to to release _tstate_lock
* The main thread succeed to acquire _tstate_lock (of the thread 6200) since it was just release
* Ooops oops oops, the main thread gets KeyboardInterrupt in Thread._wait_for_tstate_lock() before being able to release the lock. As if the function if interrupted here:

    def _wait_for_tstate_lock(self, block=True, timeout=-1):
        lock = self._tstate_lock
        if lock is None:
            assert self._is_stopped
        elif lock.acquire(block, timeout):
            # -- got KeyboardInterrupt here ---
            lock.release()
            self._stop()

* (tox does something in the main thread)
* (there are only one remaining thread: the main thread)
* tox catchs KeyboardInterrupt and calls thread.join() again
* thread.join() hangs because the _tstate_lock was already acquire, so lock.acquire() hangs forever => NOT GOOD


You can reproduce the issue on Linux with attached patch and script:

* Apply threading_bug.patch on Python
* Run threading_bug.py
* See that the script hangs forever

Example:
---
$ git apply threading_bug.patch
$ ./python threading_bug.py 
join...
join failed with: KeyboardInterrupt()
join again...
---


I'm now working on a PR to fix the race condition.

----------
components: Library (Lib)
files: threading_bug.patch
keywords: patch
messages: 402523
nosy: vstinner
priority: normal
severity: normal
status: open
title: Race condition in Thread._wait_for_tstate_lock()
versions: Python 3.11
Added file: https://bugs.python.org/file50299/threading_bug.patch

_______________________________________
Python tracker <report at bugs.python.org>
<https://bugs.python.org/issue45274>
_______________________________________


More information about the New-bugs-announce mailing list