Since the multiprocessing "spawn" start method was enabled for pytest-xdist as in https://github.com/gentoo/portage/pull/1265, it triggers pytest-xdist worker crashes like these that I observed while working on https://github.com/gentoo/portage/pull/1263, and they are resolved simply by disabling the pytest-xdist plugin: https://github.com/gentoo/portage/actions/runs/7881445624/job/21505078453 > 2024-02-13T03:58:08.0646108Z =================================== FAILURES =================================== > 2024-02-13T03:58:08.0646351Z _________________ lib/portage/tests/util/futures/test_retry.py _________________ > 2024-02-13T03:58:08.0646825Z [gw3] linux -- Python 3.12.2 /opt/hostedtoolcache/Python/3.12.2/x64/bin/python > 2024-02-13T03:58:08.0647599Z worker 'gw3' crashed while running 'lib/portage/tests/util/futures/test_retry.py::RetryForkExecutorTestCase::testOverallTimeoutWithTimeoutError' > 2024-02-13T03:58:08.0647839Z _________________ lib/portage/tests/ebuild/test_ipc_daemon.py __________________ > 2024-02-13T03:58:08.0648146Z [gw0] linux -- Python 3.12.2 /opt/hostedtoolcache/Python/3.12.2/x64/bin/python > 2024-02-13T03:58:08.0648683Z worker 'gw0' crashed while running 'lib/portage/tests/ebuild/test_ipc_daemon.py::IpcDaemonTestCase::testIpcDaemon' https://github.com/gentoo/portage/actions/runs/7881695867/job/21505735793 > 2024-02-13T04:33:58.8748178Z =================================== FAILURES =================================== > 2024-02-13T04:33:58.8748721Z ______________ lib/portage/tests/process/test_spawn_returnproc.py ______________ > 2024-02-13T04:33:58.8749432Z [gw1] linux -- Python 3.12.1 /opt/hostedtoolcache/Python/3.12.1/x64/bin/python > 2024-02-13T04:33:58.8750541Z worker 'gw1' crashed while running 'lib/portage/tests/process/test_spawn_returnproc.py::SpawnReturnProcTestCase::testSpawnReturnProcWait' https://github.com/gentoo/portage/actions/runs/7881816146/job/21506044613 > 2024-02-13T04:51:38.8635997Z =================================== FAILURES =================================== > 2024-02-13T04:51:38.8636255Z ______________ lib/portage/tests/locks/test_asynchronous_lock.py _______________ > 2024-02-13T04:51:38.8636576Z [gw1] linux -- Python 3.12.1 /opt/hostedtoolcache/Python/3.12.1/x64/bin/python > 2024-02-13T04:51:38.8637323Z worker 'gw1' crashed while running 'lib/portage/tests/locks/test_asynchronous_lock.py::AsynchronousLockTestCase::testAsynchronousLockWaitKill'
The bug has been referenced in the following commit(s): https://gitweb.gentoo.org/proj/portage.git/commit/?id=86f92dae52382fd6b7fac8ced1d4e5e6456ce68b commit 86f92dae52382fd6b7fac8ced1d4e5e6456ce68b Author: Zac Medico <zmedico@gentoo.org> AuthorDate: 2024-02-13 09:04:40 +0000 Commit: Zac Medico <zmedico@gentoo.org> CommitDate: 2024-02-13 09:42:36 +0000 actions: disable pytest-xdist for spawn start-method (workers crash) Bug: https://bugs.gentoo.org/924416 Signed-off-by: Zac Medico <zmedico@gentoo.org> .github/workflows/ci.yml | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-)
In the past I've attempted to use a multiprocessing context to avoid a global multiprocessing.set_start_method("spawn", force=True) call, but it did not work for me and I instead ended up using a temporary set_start_method call like in this commit: https://gitweb.gentoo.org/proj/portage.git/commit/?id=3b1234ba69a31709cd5aec1ae070901e3a28bb7c We could try again to use a multiprocessing context to avoid the global multiprocessing.set_start_method("spawn", force=True) call, to see if we can avoid crashing the pytest-xdist workers that way. Alternatively, we could do something similar to the above commit, and try restoring the original multiprocessing start method in the unittest.TestCase.tearDown method: https://docs.python.org/3/library/unittest.html#unittest.TestCase.tearDown
(In reply to Zac Medico from comment #2) > Alternatively, we could do > something similar to the above commit, and try restoring the original > multiprocessing start method in the unittest.TestCase.tearDown method: > > https://docs.python.org/3/library/unittest.html#unittest.TestCase.tearDown The tearDown approach didn't solve the problem: https://github.com/gentoo/portage/actions/runs/7891065409/job/21534605783 > =================================== FAILURES =================================== > _________________ lib/portage/tests/ebuild/test_ipc_daemon.py __________________ > [gw0] linux -- Python 3.12.2 /opt/hostedtoolcache/Python/3.12.2/x64/bin/python > worker 'gw0' crashed while running 'lib/portage/tests/ebuild/test_ipc_daemon.py::IpcDaemonTestCase::testIpcDaemon' > ______________ lib/portage/tests/locks/test_asynchronous_lock.py _______________ > [gw2] linux -- Python 3.12.2 /opt/hostedtoolcache/Python/3.12.2/x64/bin/python > worker 'gw2' crashed while running 'lib/portage/tests/locks/test_asynchronous_lock.py::AsynchronousLockTestCase::testAsynchronousLockWaitKillHardlink'
(In reply to Zac Medico from comment #2) > In the past I've attempted to use a multiprocessing context to avoid a > global multiprocessing.set_start_method("spawn", force=True) call, but it > did not work for me and I instead ended up using a temporary > set_start_method call like in this commit: > > https://gitweb.gentoo.org/proj/portage.git/commit/ > ?id=3b1234ba69a31709cd5aec1ae070901e3a28bb7c > > We could try again to use a multiprocessing context to avoid the global > multiprocessing.set_start_method("spawn", force=True) call, to see if we can > avoid crashing the pytest-xdist workers that way. We might use a portage.multiprocessing context as a layer of indirection for multiprocessing throughout portage.
Reopening since using a portage.multiprocessing context as a layer of indirection for multiprocessing throughout portage seems promising. When I originally attempted to use a multiprocessing context as noted in comment #2, I tried to mock the global multiprocessing module which was doomed to fail.
With the current state of https://github.com/gentoo/portage/pull/1270, pytest-xdist worker crashes are much less frequent. At this point, it might be viable to retry tests for crashed workers, since pytest-rerunfailures says it can recover from "hard crashes" when combined with pytest-xdist -n: https://github.com/pytest-dev/pytest-rerunfailures?tab=readme-ov-file#recover-from-hard-crashes
I closed https://github.com/gentoo/portage/pull/1270 since it seems like the private multiprocessing context does not really solve the pytest-xdist worker crashes. Opened https://github.com/gentoo/portage/pull/1273 to just handle it with pytest-rerunfailures.
The bug has been referenced in the following commit(s): https://gitweb.gentoo.org/proj/portage.git/commit/?id=ffcb1eb3ded0dbd244836ab3020fdc5244407298 commit ffcb1eb3ded0dbd244836ab3020fdc5244407298 Author: Zac Medico <zmedico@gentoo.org> AuthorDate: 2024-02-14 03:37:08 +0000 Commit: Zac Medico <zmedico@gentoo.org> CommitDate: 2024-02-14 19:07:57 +0000 actions: Use pytest-rerunfailures for pytest-xdist worker crash Since pytest-xdist workers crash intermittently for the multiprocessing spawn start method, use pytest-rerunfailures to detect and handle this case. Only use pytest-rerunfailures for the spawn start-method since that is the only case where we've observed intermittent pytest-xdist worker crashes, and use --only-rerun 'worker .* crashed while running' to ensure that rerun only triggers for worker crashes. Bug: https://bugs.gentoo.org/924416 See: https://github.com/pytest-dev/execnet/issues/96 Signed-off-by: Zac Medico <zmedico@gentoo.org> .github/workflows/ci.yml | 9 ++++----- 1 file changed, 4 insertions(+), 5 deletions(-)
I've sent https://github.com/pytest-dev/execnet/pull/243 as a possible fix for https://github.com/pytest-dev/execnet/issues/96, but when I tested the patch locally I still observed a pytest-xdist worker crash with it. Anyway, I hope they merge this fix or something like it since portage tests need to run in the main thread (corresponding pytest-xdist issue is https://github.com/pytest-dev/pytest-xdist/issues/620).
I've been running the tests using the main_thread_only support in https://github.com/pytest-dev/pytest-xdist/issues/620, and they don't usually crash, but testSpawnReturnProcTerminate seems to crash it more than anything else. The crash seems to be linked to the state of a long-lived execnet gateway since this does not seem to trigger if you only run lib/portage/tests/process/test_spawn_returnproc.py by itself: > lib/portage/tests/process/test_spawn_returnproc.py::SpawnReturnProcTestCase::testSpawnReturnProcTerminate > [gw1] node down: Not properly terminated > [gw1] RERUN lib/portage/tests/process/test_spawn_returnproc.py::SpawnReturnProcTestCase::testSpawnReturnProcTerminate > > replacing crashed worker gw1
There may be some correlation between signal usage and the crash since I just saw testAsynchronousLockWaitKill crash, and it seems like all the tests mentioned in comment #0 were also using SIGTERM: > lib/portage/tests/locks/test_asynchronous_lock.py::AsynchronousLockTestCase::testAsynchronousLockWaitKill > [gw1] node down: Not properly terminated > [gw1] RERUN > lib/portage/tests/locks/test_asynchronous_lock.py::AsynchronousLockTestCase::testAsynchronousLockWaitKill > > replacing crashed worker gw1