Summary: | sys-apps/portage: multiprocessing "spawn" start method triggers pytest-xdist worker crash | ||
---|---|---|---|
Product: | Portage Development | Reporter: | Zac Medico <zmedico> |
Component: | Core | Assignee: | Portage team <dev-portage> |
Status: | RESOLVED FIXED | ||
Severity: | normal | CC: | zmedico |
Priority: | Normal | Keywords: | InVCS |
Version: | unspecified | ||
Hardware: | All | ||
OS: | Linux | ||
See Also: |
https://github.com/gentoo/portage/pull/1268 https://github.com/gentoo/portage/pull/1269 https://bugs.gentoo.org/show_bug.cgi?id=923847 https://github.com/gentoo/portage/pull/1273 https://github.com/pytest-dev/execnet/issues/96 https://github.com/pytest-dev/pytest-xdist/issues/620 https://github.com/pytest-dev/execnet/issues/150 https://bugs.gentoo.org/show_bug.cgi?id=924369 |
||
Whiteboard: | |||
Package list: | Runtime testing required: | --- | |
Bug Depends on: | |||
Bug Blocks: | 914876 |
Description
Zac Medico
2024-02-13 08:56:22 UTC
The bug has been referenced in the following commit(s): https://gitweb.gentoo.org/proj/portage.git/commit/?id=86f92dae52382fd6b7fac8ced1d4e5e6456ce68b commit 86f92dae52382fd6b7fac8ced1d4e5e6456ce68b Author: Zac Medico <zmedico@gentoo.org> AuthorDate: 2024-02-13 09:04:40 +0000 Commit: Zac Medico <zmedico@gentoo.org> CommitDate: 2024-02-13 09:42:36 +0000 actions: disable pytest-xdist for spawn start-method (workers crash) Bug: https://bugs.gentoo.org/924416 Signed-off-by: Zac Medico <zmedico@gentoo.org> .github/workflows/ci.yml | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) In the past I've attempted to use a multiprocessing context to avoid a global multiprocessing.set_start_method("spawn", force=True) call, but it did not work for me and I instead ended up using a temporary set_start_method call like in this commit: https://gitweb.gentoo.org/proj/portage.git/commit/?id=3b1234ba69a31709cd5aec1ae070901e3a28bb7c We could try again to use a multiprocessing context to avoid the global multiprocessing.set_start_method("spawn", force=True) call, to see if we can avoid crashing the pytest-xdist workers that way. Alternatively, we could do something similar to the above commit, and try restoring the original multiprocessing start method in the unittest.TestCase.tearDown method: https://docs.python.org/3/library/unittest.html#unittest.TestCase.tearDown (In reply to Zac Medico from comment #2) > Alternatively, we could do > something similar to the above commit, and try restoring the original > multiprocessing start method in the unittest.TestCase.tearDown method: > > https://docs.python.org/3/library/unittest.html#unittest.TestCase.tearDown The tearDown approach didn't solve the problem: https://github.com/gentoo/portage/actions/runs/7891065409/job/21534605783 > =================================== FAILURES =================================== > _________________ lib/portage/tests/ebuild/test_ipc_daemon.py __________________ > [gw0] linux -- Python 3.12.2 /opt/hostedtoolcache/Python/3.12.2/x64/bin/python > worker 'gw0' crashed while running 'lib/portage/tests/ebuild/test_ipc_daemon.py::IpcDaemonTestCase::testIpcDaemon' > ______________ lib/portage/tests/locks/test_asynchronous_lock.py _______________ > [gw2] linux -- Python 3.12.2 /opt/hostedtoolcache/Python/3.12.2/x64/bin/python > worker 'gw2' crashed while running 'lib/portage/tests/locks/test_asynchronous_lock.py::AsynchronousLockTestCase::testAsynchronousLockWaitKillHardlink' (In reply to Zac Medico from comment #2) > In the past I've attempted to use a multiprocessing context to avoid a > global multiprocessing.set_start_method("spawn", force=True) call, but it > did not work for me and I instead ended up using a temporary > set_start_method call like in this commit: > > https://gitweb.gentoo.org/proj/portage.git/commit/ > ?id=3b1234ba69a31709cd5aec1ae070901e3a28bb7c > > We could try again to use a multiprocessing context to avoid the global > multiprocessing.set_start_method("spawn", force=True) call, to see if we can > avoid crashing the pytest-xdist workers that way. We might use a portage.multiprocessing context as a layer of indirection for multiprocessing throughout portage. Reopening since using a portage.multiprocessing context as a layer of indirection for multiprocessing throughout portage seems promising. When I originally attempted to use a multiprocessing context as noted in comment #2, I tried to mock the global multiprocessing module which was doomed to fail. With the current state of https://github.com/gentoo/portage/pull/1270, pytest-xdist worker crashes are much less frequent. At this point, it might be viable to retry tests for crashed workers, since pytest-rerunfailures says it can recover from "hard crashes" when combined with pytest-xdist -n: https://github.com/pytest-dev/pytest-rerunfailures?tab=readme-ov-file#recover-from-hard-crashes I closed https://github.com/gentoo/portage/pull/1270 since it seems like the private multiprocessing context does not really solve the pytest-xdist worker crashes. Opened https://github.com/gentoo/portage/pull/1273 to just handle it with pytest-rerunfailures. The bug has been referenced in the following commit(s): https://gitweb.gentoo.org/proj/portage.git/commit/?id=ffcb1eb3ded0dbd244836ab3020fdc5244407298 commit ffcb1eb3ded0dbd244836ab3020fdc5244407298 Author: Zac Medico <zmedico@gentoo.org> AuthorDate: 2024-02-14 03:37:08 +0000 Commit: Zac Medico <zmedico@gentoo.org> CommitDate: 2024-02-14 19:07:57 +0000 actions: Use pytest-rerunfailures for pytest-xdist worker crash Since pytest-xdist workers crash intermittently for the multiprocessing spawn start method, use pytest-rerunfailures to detect and handle this case. Only use pytest-rerunfailures for the spawn start-method since that is the only case where we've observed intermittent pytest-xdist worker crashes, and use --only-rerun 'worker .* crashed while running' to ensure that rerun only triggers for worker crashes. Bug: https://bugs.gentoo.org/924416 See: https://github.com/pytest-dev/execnet/issues/96 Signed-off-by: Zac Medico <zmedico@gentoo.org> .github/workflows/ci.yml | 9 ++++----- 1 file changed, 4 insertions(+), 5 deletions(-) I've sent https://github.com/pytest-dev/execnet/pull/243 as a possible fix for https://github.com/pytest-dev/execnet/issues/96, but when I tested the patch locally I still observed a pytest-xdist worker crash with it. Anyway, I hope they merge this fix or something like it since portage tests need to run in the main thread (corresponding pytest-xdist issue is https://github.com/pytest-dev/pytest-xdist/issues/620). I've been running the tests using the main_thread_only support in https://github.com/pytest-dev/pytest-xdist/issues/620, and they don't usually crash, but testSpawnReturnProcTerminate seems to crash it more than anything else. The crash seems to be linked to the state of a long-lived execnet gateway since this does not seem to trigger if you only run lib/portage/tests/process/test_spawn_returnproc.py by itself: > lib/portage/tests/process/test_spawn_returnproc.py::SpawnReturnProcTestCase::testSpawnReturnProcTerminate > [gw1] node down: Not properly terminated > [gw1] RERUN lib/portage/tests/process/test_spawn_returnproc.py::SpawnReturnProcTestCase::testSpawnReturnProcTerminate > > replacing crashed worker gw1 There may be some correlation between signal usage and the crash since I just saw testAsynchronousLockWaitKill crash, and it seems like all the tests mentioned in comment #0 were also using SIGTERM: > lib/portage/tests/locks/test_asynchronous_lock.py::AsynchronousLockTestCase::testAsynchronousLockWaitKill > [gw1] node down: Not properly terminated > [gw1] RERUN > lib/portage/tests/locks/test_asynchronous_lock.py::AsynchronousLockTestCase::testAsynchronousLockWaitKill > > replacing crashed worker gw1 |