Gentoo Websites Logo
Go to: Gentoo Home Documentation Forums Lists Bugs Planet Store Wiki Get Gentoo!
Bug 924416 - sys-apps/portage: multiprocessing "spawn" start method triggers pytest-xdist worker crash
Summary: sys-apps/portage: multiprocessing "spawn" start method triggers pytest-xdist ...
Status: RESOLVED FIXED
Alias: None
Product: Portage Development
Classification: Unclassified
Component: Core (show other bugs)
Hardware: All Linux
: Normal normal (vote)
Assignee: Portage team
URL:
Whiteboard:
Keywords: InVCS
Depends on:
Blocks: 914876
  Show dependency tree
 
Reported: 2024-02-13 08:56 UTC by Zac Medico
Modified: 2024-02-29 04:55 UTC (History)
1 user (show)

See Also:
Package list:
Runtime testing required: ---


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Zac Medico gentoo-dev 2024-02-13 08:56:22 UTC
Since the multiprocessing "spawn" start method was enabled for pytest-xdist as in https://github.com/gentoo/portage/pull/1265, it triggers pytest-xdist worker crashes like these that I observed while working on https://github.com/gentoo/portage/pull/1263, and they are resolved simply by disabling the pytest-xdist plugin:

https://github.com/gentoo/portage/actions/runs/7881445624/job/21505078453

> 2024-02-13T03:58:08.0646108Z =================================== FAILURES ===================================
> 2024-02-13T03:58:08.0646351Z _________________ lib/portage/tests/util/futures/test_retry.py _________________
> 2024-02-13T03:58:08.0646825Z [gw3] linux -- Python 3.12.2 /opt/hostedtoolcache/Python/3.12.2/x64/bin/python
> 2024-02-13T03:58:08.0647599Z worker 'gw3' crashed while running 'lib/portage/tests/util/futures/test_retry.py::RetryForkExecutorTestCase::testOverallTimeoutWithTimeoutError'
> 2024-02-13T03:58:08.0647839Z _________________ lib/portage/tests/ebuild/test_ipc_daemon.py __________________
> 2024-02-13T03:58:08.0648146Z [gw0] linux -- Python 3.12.2 /opt/hostedtoolcache/Python/3.12.2/x64/bin/python
> 2024-02-13T03:58:08.0648683Z worker 'gw0' crashed while running 'lib/portage/tests/ebuild/test_ipc_daemon.py::IpcDaemonTestCase::testIpcDaemon'

https://github.com/gentoo/portage/actions/runs/7881695867/job/21505735793

> 2024-02-13T04:33:58.8748178Z =================================== FAILURES ===================================
> 2024-02-13T04:33:58.8748721Z ______________ lib/portage/tests/process/test_spawn_returnproc.py ______________
> 2024-02-13T04:33:58.8749432Z [gw1] linux -- Python 3.12.1 /opt/hostedtoolcache/Python/3.12.1/x64/bin/python
> 2024-02-13T04:33:58.8750541Z worker 'gw1' crashed while running 'lib/portage/tests/process/test_spawn_returnproc.py::SpawnReturnProcTestCase::testSpawnReturnProcWait'

https://github.com/gentoo/portage/actions/runs/7881816146/job/21506044613

> 2024-02-13T04:51:38.8635997Z =================================== FAILURES ===================================
> 2024-02-13T04:51:38.8636255Z ______________ lib/portage/tests/locks/test_asynchronous_lock.py _______________
> 2024-02-13T04:51:38.8636576Z [gw1] linux -- Python 3.12.1 /opt/hostedtoolcache/Python/3.12.1/x64/bin/python
> 2024-02-13T04:51:38.8637323Z worker 'gw1' crashed while running 'lib/portage/tests/locks/test_asynchronous_lock.py::AsynchronousLockTestCase::testAsynchronousLockWaitKill'
Comment 1 Larry the Git Cow gentoo-dev 2024-02-13 10:01:51 UTC
The bug has been referenced in the following commit(s):

https://gitweb.gentoo.org/proj/portage.git/commit/?id=86f92dae52382fd6b7fac8ced1d4e5e6456ce68b

commit 86f92dae52382fd6b7fac8ced1d4e5e6456ce68b
Author:     Zac Medico <zmedico@gentoo.org>
AuthorDate: 2024-02-13 09:04:40 +0000
Commit:     Zac Medico <zmedico@gentoo.org>
CommitDate: 2024-02-13 09:42:36 +0000

    actions: disable pytest-xdist for spawn start-method (workers crash)
    
    Bug: https://bugs.gentoo.org/924416
    Signed-off-by: Zac Medico <zmedico@gentoo.org>

 .github/workflows/ci.yml | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)
Comment 2 Zac Medico gentoo-dev 2024-02-13 17:21:14 UTC
In the past I've attempted to use a multiprocessing context to avoid a global multiprocessing.set_start_method("spawn", force=True) call, but it did not work for me and I instead ended up using a temporary set_start_method call like in this commit:

https://gitweb.gentoo.org/proj/portage.git/commit/?id=3b1234ba69a31709cd5aec1ae070901e3a28bb7c

We could try again to use a multiprocessing context to avoid the global multiprocessing.set_start_method("spawn", force=True) call, to see if we can avoid crashing the pytest-xdist workers that way. Alternatively, we could do something similar to the above commit, and try restoring the original multiprocessing start method in the unittest.TestCase.tearDown method:

https://docs.python.org/3/library/unittest.html#unittest.TestCase.tearDown
Comment 3 Zac Medico gentoo-dev 2024-02-13 18:40:41 UTC
(In reply to Zac Medico from comment #2)
> Alternatively, we could do
> something similar to the above commit, and try restoring the original
> multiprocessing start method in the unittest.TestCase.tearDown method:
> 
> https://docs.python.org/3/library/unittest.html#unittest.TestCase.tearDown

The tearDown approach didn't solve the problem:

https://github.com/gentoo/portage/actions/runs/7891065409/job/21534605783

>  =================================== FAILURES ===================================
>  _________________ lib/portage/tests/ebuild/test_ipc_daemon.py __________________
>  [gw0] linux -- Python 3.12.2 /opt/hostedtoolcache/Python/3.12.2/x64/bin/python
>  worker 'gw0' crashed while running 'lib/portage/tests/ebuild/test_ipc_daemon.py::IpcDaemonTestCase::testIpcDaemon'
>  ______________ lib/portage/tests/locks/test_asynchronous_lock.py _______________
>  [gw2] linux -- Python 3.12.2 /opt/hostedtoolcache/Python/3.12.2/x64/bin/python
>  worker 'gw2' crashed while running 'lib/portage/tests/locks/test_asynchronous_lock.py::AsynchronousLockTestCase::testAsynchronousLockWaitKillHardlink'
Comment 4 Zac Medico gentoo-dev 2024-02-13 19:09:40 UTC
(In reply to Zac Medico from comment #2)
> In the past I've attempted to use a multiprocessing context to avoid a
> global multiprocessing.set_start_method("spawn", force=True) call, but it
> did not work for me and I instead ended up using a temporary
> set_start_method call like in this commit:
> 
> https://gitweb.gentoo.org/proj/portage.git/commit/
> ?id=3b1234ba69a31709cd5aec1ae070901e3a28bb7c
> 
> We could try again to use a multiprocessing context to avoid the global
> multiprocessing.set_start_method("spawn", force=True) call, to see if we can
> avoid crashing the pytest-xdist workers that way.

We might use a portage.multiprocessing context as a layer of indirection for multiprocessing throughout portage.
Comment 5 Zac Medico gentoo-dev 2024-02-13 19:36:47 UTC
Reopening since using a portage.multiprocessing context as a layer of indirection for multiprocessing throughout portage seems promising. When I originally attempted to use a multiprocessing context as noted in comment #2, I tried to mock the global multiprocessing module which was doomed to fail.
Comment 6 Zac Medico gentoo-dev 2024-02-14 06:58:54 UTC
With the current state of https://github.com/gentoo/portage/pull/1270, pytest-xdist worker crashes are much less frequent. At this point, it might be viable to retry tests for crashed workers, since pytest-rerunfailures says it can recover from "hard crashes" when combined with pytest-xdist -n:

https://github.com/pytest-dev/pytest-rerunfailures?tab=readme-ov-file#recover-from-hard-crashes
Comment 7 Zac Medico gentoo-dev 2024-02-14 18:49:01 UTC
I closed https://github.com/gentoo/portage/pull/1270 since it seems like the private multiprocessing context does not really solve the pytest-xdist worker crashes. Opened https://github.com/gentoo/portage/pull/1273 to just handle it with pytest-rerunfailures.
Comment 8 Larry the Git Cow gentoo-dev 2024-02-14 19:09:31 UTC
The bug has been referenced in the following commit(s):

https://gitweb.gentoo.org/proj/portage.git/commit/?id=ffcb1eb3ded0dbd244836ab3020fdc5244407298

commit ffcb1eb3ded0dbd244836ab3020fdc5244407298
Author:     Zac Medico <zmedico@gentoo.org>
AuthorDate: 2024-02-14 03:37:08 +0000
Commit:     Zac Medico <zmedico@gentoo.org>
CommitDate: 2024-02-14 19:07:57 +0000

    actions: Use pytest-rerunfailures for pytest-xdist worker crash
    
    Since pytest-xdist workers crash intermittently for the multiprocessing
    spawn start method, use pytest-rerunfailures to detect and handle this
    case. Only use pytest-rerunfailures for the spawn start-method since
    that is the only case where we've observed intermittent pytest-xdist
    worker crashes, and use --only-rerun 'worker .* crashed while running'
    to ensure that rerun only triggers for worker crashes.
    
    Bug: https://bugs.gentoo.org/924416
    See: https://github.com/pytest-dev/execnet/issues/96
    Signed-off-by: Zac Medico <zmedico@gentoo.org>

 .github/workflows/ci.yml | 9 ++++-----
 1 file changed, 4 insertions(+), 5 deletions(-)
Comment 9 Zac Medico gentoo-dev 2024-02-15 07:38:48 UTC
I've sent https://github.com/pytest-dev/execnet/pull/243 as a possible fix for https://github.com/pytest-dev/execnet/issues/96, but when I tested the patch locally I still observed a pytest-xdist worker crash with it. Anyway, I hope they merge this fix or something like it since portage tests need to run in the main thread (corresponding pytest-xdist issue is https://github.com/pytest-dev/pytest-xdist/issues/620).
Comment 10 Zac Medico gentoo-dev 2024-02-28 07:25:10 UTC
I've been running the tests using the main_thread_only support in https://github.com/pytest-dev/pytest-xdist/issues/620, and they don't usually crash, but testSpawnReturnProcTerminate seems to crash it more than anything else. The crash seems to be linked to the state of a long-lived execnet gateway since this does not seem to trigger if you only run lib/portage/tests/process/test_spawn_returnproc.py by itself:

> lib/portage/tests/process/test_spawn_returnproc.py::SpawnReturnProcTestCase::testSpawnReturnProcTerminate 
> [gw1] node down: Not properly terminated
> [gw1] RERUN lib/portage/tests/process/test_spawn_returnproc.py::SpawnReturnProcTestCase::testSpawnReturnProcTerminate 
> 
> replacing crashed worker gw1
Comment 11 Zac Medico gentoo-dev 2024-02-29 04:55:51 UTC
There may be some correlation between signal usage and the crash since I just saw testAsynchronousLockWaitKill crash, and it seems like all the tests mentioned in comment #0 were also using SIGTERM:

> lib/portage/tests/locks/test_asynchronous_lock.py::AsynchronousLockTestCase::testAsynchronousLockWaitKill 
> [gw1] node down: Not properly terminated
> [gw1] RERUN > lib/portage/tests/locks/test_asynchronous_lock.py::AsynchronousLockTestCase::testAsynchronousLockWaitKill 
> 
> replacing crashed worker gw1