Running current HEAD portage 3.0.65-10+g20cd76664, I noticed this issue only because I was tailing /var/log/emerge-fetch.log when emerge --keep-going restarted, and I saw that the thunderbird-bin_x86_64-115.13.0.tar.bz2 fetch job just kept going through the --keep-going dependency calculation. The log showed it finish the download, check the sizes of two other files, rename the file to thunderbird-bin_x86_64-115.13.0.tar.bz2._checksum_failure_.z0bnlc4d, and then successfully retry: > 2024-07-18 20:18:03 (821 KB/s) - ‘/var/portage/distfiles/thunderbird-bin_x86_64-115.13.0.tar.bz2.__download__’ saved [80720623/80720623] > > * mtools-4.0.44.tar.lz size ;-) ... [ ok ] > * OpenJDK21U-jdk_x64_linux_hotspot_21.0.3_9.tar.gz size ;-) ... [ ok ] > Refetching... File renamed to '/var/portage/distfiles/thunderbird-bin_x86_64-115.13.0.tar.bz2._checksum_failure_.z0bnlc4d' > > >>> Downloading 'http://distfiles.gentoo.org/distfiles/29/thunderbird-bin_x86_64-115.13.0.tar.bz2' Cancellation is complicated by the fact that EbuildFetcher calls the fetch function in a subprocess, so the actual FETCHCOMMAND process is a child of a subprocess.
It looks like it created a second thunderbird-bin prefetcher in the second calculation, but the second prefetcher mush have exited silently because the file was still locked by the first prefetcher.
It seems the ForkProcess cancel method was ineffective, so a good place to start would be to add a basic cancellation test case which might reveal a bug. The subprocess has default signal handlers installed by ForkProcess like this: signal.signal(signal.SIGINT, signal.SIG_DFL) signal.signal(signal.SIGTERM, signal.SIG_DFL) The "Refetching... File renamed" log message in comment #0 shows that either the multiprocessing.Process terminate method was not called or it was ineffective for some reason.
It may be that the multiprocessing.Process terminate call killed the child python process but not the FETCHCOMMAND process which inherited stdout, which explains why the fetch output continued through the dependency calculation. Then the "Refetching... File renamed" message would have come from the second thunderbird-bin prefetcher (my thought about it exiting silently in comment #1 was incorrect).
We can install a SIGTERM signal handler in the subprocess, and have it raise an exception which the spawn function can handle by killing the subprocess (FETCHCOMMAND) it's waiting for and then re-raising the exception.
We can re-use this SignalInterrupt exception used in bin/emerge and similar scripts: # Inherit from KeyboardInterrupt to avoid a traceback from asyncio. class SignalInterrupt(KeyboardInterrupt): def __init__(self, signum): self.signum = signum signal.signal(signal.SIGTERM, signal_interrupt)
(In reply to Zac Medico from comment #5) > We can re-use this SignalInterrupt exception used in bin/emerge and similar > scripts: > > # Inherit from KeyboardInterrupt to avoid a traceback from asyncio. > class SignalInterrupt(KeyboardInterrupt): > def __init__(self, signum): > self.signum = signum > > signal.signal(signal.SIGTERM, signal_interrupt) I've tested this approach a bit, and it seems too fragile because there are certain times when this signal handler will cause a misbehavior. For example, with multiprocessing start method spawn, it misbehaves if the signal arrives while the subprocess attempting to start a thread to send file descriptors to the FETCHCOMMAND process: > File "lib/portage/util/_async/ForkProcess.py", line 180, in _main > await self.scheduler.run_in_executor( > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > File "/usr/lib/python3.12/asyncio/base_events.py", line 858, in run_in_executor > executor = concurrent.futures.ThreadPoolExecutor( > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > File "/usr/lib/python3.12/concurrent/futures/__init__.py", line 49, in __getattr__ > from .thread import ThreadPoolExecutor as te > File "/usr/lib/python3.12/concurrent/futures/thread.py", line 10, in <module> > import queue > File "/usr/lib/python3.12/queue.py", line 9, in <module> > from _queue import SimpleQueue > File "<frozen importlib._bootstrap>", line 1360, in _find_and_load > File "<frozen importlib._bootstrap>", line 1322, in _find_and_load_unlocked > File "<frozen importlib._bootstrap>", line 1262, in _find_spec > File "<frozen importlib._bootstrap_external>", line 1528, in find_spec > File "<frozen importlib._bootstrap_external>", line 1502, in _get_spec > File "<frozen importlib._bootstrap_external>", line 1601, in find_spec > File "<frozen importlib._bootstrap_external>", line 147, in _path_stat > File "lib/portage/exception.py", line 127, in handler > raise SignalInterrupt(signum) > portage.exception.SignalInterrupt: SignalInterrupt(signum=15)
Since it's error-prone to raise exceptions from signal handlers, the signal handler needs to call a function that triggers killing of child processes. Generally, it's more practical for this killing to occur asynchronously, in case the main thread happens to be in some kind of critical section when the signal arrives (these critical sections are also the reason why it's error-prone to raise an exception from a signal handler).
If we slightly refactor the fetch function into a generator function that yields processes to be managed by the caller. This sort of pattern is used elsewhere in the code base, like in emirrordist's FetchIterator, egencache's ManifestScheduler, and the iter_completed function which operates on a future generator.
Since the fetch function is called recursively via the get_mirror_url, it's appealing to convert both fetch and get_mirror_url to async functions, and use asyncio.CancelledError handlers to terminate running processes. The SIGTERM handler can schedule graceful termination via call_soon_threadsafe.
For asyncio.CancelledError handlers, there's a race, where the asyncio.CancelledError could arrive after a child has forked but before we have a reference to the process to work with in the asyncio.CancelledError handler. We should be able to clean up after this race in the EbuildFetcher process by using the multiprocessing.active_children() method to detect any children missed by the asyncio.CancelledError handler race.
The bug has been referenced in the following commit(s): https://gitweb.gentoo.org/proj/portage.git/commit/?id=8b5b5186965c47605ba004d317e8fd58e70e97cd commit 8b5b5186965c47605ba004d317e8fd58e70e97cd Author: Zac Medico <zmedico@gentoo.org> AuthorDate: 2024-08-03 21:33:41 +0000 Commit: Zac Medico <zmedico@gentoo.org> CommitDate: 2024-08-03 21:33:41 +0000 _EbuildFetcherProcess: Handle SIGTERM Fix _EbuildFetcherProcess to handle SIGTERM, so that FETCHCOMMAND processes will not be left running in the background: * Convert the fetch function to an async_fetch coroutine function so that it can use asyncio.CancelledError handlers to terminate running processes. * Use multiprocessing.active_children() to detect and terminate any processes that asyncio.CancelledError handlers did not have an opportunity to terminate because the exception arrived too soon after fork/spawn. * Add unit test to verify that a child process is correctly killed when EbuildFetcher is cancelled, with short timeout in case it takes some time for the process to disappear. Bug: https://bugs.gentoo.org/936273 Signed-off-by: Zac Medico <zmedico@gentoo.org> NEWS | 3 + lib/_emerge/EbuildFetcher.py | 68 ++++++++++++++++++---- lib/portage/package/ebuild/fetch.py | 102 ++++++++++++++++++++++++++++----- lib/portage/tests/ebuild/test_fetch.py | 100 +++++++++++++++++++++++++++++++- lib/portage/tests/util/test_socks5.py | 16 ++++-- 5 files changed, 257 insertions(+), 32 deletions(-)
The bug has been closed via the following commit(s): https://gitweb.gentoo.org/repo/gentoo.git/commit/?id=02d0e00a1ba811b39140d10e17488f7fc3916534 commit 02d0e00a1ba811b39140d10e17488f7fc3916534 Author: Sam James <sam@gentoo.org> AuthorDate: 2024-09-11 01:30:10 +0000 Commit: Sam James <sam@gentoo.org> CommitDate: 2024-09-11 01:30:30 +0000 sys-apps/portage: add 3.0.66 Closes: https://bugs.gentoo.org/435066 Closes: https://bugs.gentoo.org/907061 Closes: https://bugs.gentoo.org/910560 Closes: https://bugs.gentoo.org/933433 Closes: https://bugs.gentoo.org/934220 Closes: https://bugs.gentoo.org/934514 Closes: https://bugs.gentoo.org/934784 Closes: https://bugs.gentoo.org/935830 Closes: https://bugs.gentoo.org/936273 Closes: https://bugs.gentoo.org/937384 Closes: https://bugs.gentoo.org/937485 Closes: https://bugs.gentoo.org/937740 Closes: https://bugs.gentoo.org/937888 Closes: https://bugs.gentoo.org/937891 Closes: https://bugs.gentoo.org/938127 Closes: https://bugs.gentoo.org/933499 Signed-off-by: Sam James <sam@gentoo.org> sys-apps/portage/Manifest | 1 + sys-apps/portage/portage-3.0.66.ebuild | 227 +++++++++++++++++++++++++++++++++ 2 files changed, 228 insertions(+)