scikit-learn 1 - pytest - multiprocessing Pool - hangs?
Dear all, I am trying to track down a strange behaviour in one of our (Fujitsu) library we are planning to open source. In preparation for that, I am trying to bring it into a state that it works with scikit-learn >= 1. But, some of our tests fail when running in parallel mode. But they only fail when running under pytest, but NOT when running under python. The library code contains def fit(self, X, y=None): ... p = multiprocessing.Pool() ret = _reduce( p.map(....)) Now what happens is that with scikit-learn 1(.0.1), the code hangs forever. I adjusted the code also so that the pool definition is not in the fit function, but in the __init__ function, and saved into self, but that didn't help either. When interrupted, pytest gives: !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! KeyboardInterrupt !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! /home/norbert/.pyenv/versions/3.9.6/lib/python3.9/threading.py:312: KeyboardInterrupt (to show a full traceback on KeyboardInterrupt use --full-trace) ================================================ 1 passed, 2 warnings in 273.84s (0:04:33) ================================================= Exception ignored in: <function Pool.__del__ at 0x7ff72f31b9d0> Traceback (most recent call last): File "/home/norbert/.pyenv/versions/3.9.6/lib/python3.9/multiprocessing/pool.py", line 268, in __del__ self._change_notifier.put(None) File "/home/norbert/.pyenv/versions/3.9.6/lib/python3.9/multiprocessing/queues.py", line 378, in put self._writer.send_bytes(obj) File "/home/norbert/.pyenv/versions/3.9.6/lib/python3.9/multiprocessing/connection.py", line 205, in send_bytes self._send_bytes(m[offset:offset + size]) File "/home/norbert/.pyenv/versions/3.9.6/lib/python3.9/multiprocessing/connection.py", line 416, in _send_bytes self._send(header + buf) File "/home/norbert/.pyenv/versions/3.9.6/lib/python3.9/multiprocessing/connection.py", line 373, in _send n = write(self._handle, buf) While when running under python testfile.py all goes well. I have tested the following combinations: * scikit-learn 0.23.*, python 3.8 and python 3.9 => works * scikit-learn 0.24.*, python 3.8 and python 3.9 => works * scikit-learn 1.0.1, python 3.8 and python 3.9 => fails I don't really understand where scikit-learn comes into the play here, so I wanted to ask whether someone here has an idea. Thanks for any suggestion Norbert -- PREINING Norbert https://www.preining.info Fujitsu Research + IFMGA Guide + TU Wien + TeX Live + Debian Dev GPG: 0x860CDC13 fp: F7D8 A928 26E3 16A1 9FA0 ACF0 6CAC A448 860C DC13
Maybe you can try to use faulthandler.dump_traceback_later https://docs.python.org/3/library/faulthandler.html#faulthandler.dump_traceb... to get a traceback of all the threads of the main process. But the fact that you are using the default `p = multiprocessing.Pool()` makes me think that it might be related to the lack of fork-safety of the OpenMP runtime library of GCC (libgomp) [1]. There are several ways to check this: - print the output of threadpoolctl.threadpool_info() before calling the code that freezes to confirm (or not) that the libgomp runtime has been loaded before creating the MP Pool. - use multiprocessing Pool using a forkserver context instead of the default fork context: multiprocessing.get_context("forkserver").Pool() - alternatively, use loky.get_reusable_excutor() instead of multiprocessing.Pool() (with a slightly different API) - alternatively, use joblib that uses loky internally with an even more different API. - alternatively, recompile scikit-learn from source with clang instead of gcc so as to link scikit-learn to llvm-openmp instead of gcc's libgomp runtime. llvm-openmp is forksafe, - alternatively, install scikit-learn from conda-forge (conda install -c conda-forge scikit-learn) as the conda-forge distribution relinks all OpenMP compiled extensions of its packaged libraries to llvm-openmp transparently at install time, even if they were built with GCC (maybe we should do that for our linux wheels). [1] https://gcc.gnu.org/legacy-ml/gcc-patches/2014-02/msg00979.html If that does not work or need more help, please feel free to open an issue with a minimal reproducer and ping me on gitter or discord. Le jeu. 9 déc. 2021 à 05:59, Norbert Preining <norbert@preining.info> a écrit :
Dear all,
I am trying to track down a strange behaviour in one of our (Fujitsu) library we are planning to open source. In preparation for that, I am trying to bring it into a state that it works with scikit-learn >= 1.
But, some of our tests fail when running in parallel mode. But they only fail when running under pytest, but NOT when running under python.
The library code contains
def fit(self, X, y=None): ... p = multiprocessing.Pool() ret = _reduce( p.map(....))
Now what happens is that with scikit-learn 1(.0.1), the code hangs forever. I adjusted the code also so that the pool definition is not in the fit function, but in the __init__ function, and saved into self, but that didn't help either.
When interrupted, pytest gives:
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! KeyboardInterrupt !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! /home/norbert/.pyenv/versions/3.9.6/lib/python3.9/threading.py:312: KeyboardInterrupt (to show a full traceback on KeyboardInterrupt use --full-trace) ================================================ 1 passed, 2 warnings in 273.84s (0:04:33) ================================================= Exception ignored in: <function Pool.__del__ at 0x7ff72f31b9d0> Traceback (most recent call last): File "/home/norbert/.pyenv/versions/3.9.6/lib/python3.9/multiprocessing/pool.py", line 268, in __del__ self._change_notifier.put(None) File "/home/norbert/.pyenv/versions/3.9.6/lib/python3.9/multiprocessing/queues.py", line 378, in put self._writer.send_bytes(obj) File "/home/norbert/.pyenv/versions/3.9.6/lib/python3.9/multiprocessing/connection.py", line 205, in send_bytes self._send_bytes(m[offset:offset + size]) File "/home/norbert/.pyenv/versions/3.9.6/lib/python3.9/multiprocessing/connection.py", line 416, in _send_bytes self._send(header + buf) File "/home/norbert/.pyenv/versions/3.9.6/lib/python3.9/multiprocessing/connection.py", line 373, in _send n = write(self._handle, buf)
While when running under python testfile.py all goes well.
I have tested the following combinations: * scikit-learn 0.23.*, python 3.8 and python 3.9 => works * scikit-learn 0.24.*, python 3.8 and python 3.9 => works * scikit-learn 1.0.1, python 3.8 and python 3.9 => fails
I don't really understand where scikit-learn comes into the play here, so I wanted to ask whether someone here has an idea.
Thanks for any suggestion
Norbert
-- PREINING Norbert https://www.preining.info Fujitsu Research + IFMGA Guide + TU Wien + TeX Live + Debian Dev GPG: 0x860CDC13 fp: F7D8 A928 26E3 16A1 9FA0 ACF0 6CAC A448 860C DC13 _______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
-- Olivier
Hi Olivier, thanks a lot, I will try the various options and see what I can do. If and when I understand more, I will report back. Thanks again for the detailed explanation and hints, much appreciated. Best Norbert On Thu, 09 Dec 2021, Olivier Grisel wrote:
Maybe you can try to use faulthandler.dump_traceback_later https://docs.python.org/3/library/faulthandler.html#faulthandler.dump_traceb... to get a traceback of all the threads of the main process. [...]
-- PREINING Norbert https://www.preining.info Fujitsu Research + IFMGA Guide + TU Wien + TeX Live + Debian Dev GPG: 0x860CDC13 fp: F7D8 A928 26E3 16A1 9FA0 ACF0 6CAC A448 860C DC13
participants (2)
-
Norbert Preining -
Olivier Grisel