Gentoo Websites Logo
Go to: Gentoo Home Documentation Forums Lists Bugs Planet Store Wiki Get Gentoo!
Bug 936273 - sys-apps/portage: parallel-fetch EbuildFetcher jobs do not cancel correctly when emerge recalculates dependencies for emerge --keep-going
Summary: sys-apps/portage: parallel-fetch EbuildFetcher jobs do not cancel correctly w...
Status: IN_PROGRESS
Alias: None
Product: Portage Development
Classification: Unclassified
Component: Core - Interface (emerge) (show other bugs)
Hardware: All All
: Normal normal (vote)
Assignee: Portage team
URL:
Whiteboard:
Keywords: InVCS
Depends on: 933499
Blocks: 373807 377365 910332
  Show dependency tree
 
Reported: 2024-07-19 04:04 UTC by Zac Medico
Modified: 2024-08-13 23:28 UTC (History)
2 users (show)

See Also:
Package list:
Runtime testing required: ---


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Zac Medico gentoo-dev 2024-07-19 04:04:03 UTC
Running current HEAD portage 3.0.65-10+g20cd76664, I noticed this issue only because I was tailing /var/log/emerge-fetch.log when emerge --keep-going restarted, and I saw that the thunderbird-bin_x86_64-115.13.0.tar.bz2 fetch job just kept going through the --keep-going dependency calculation.

The log showed it finish the download, check the sizes of two other files, rename the file to thunderbird-bin_x86_64-115.13.0.tar.bz2._checksum_failure_.z0bnlc4d, and then successfully retry:

> 2024-07-18 20:18:03 (821 KB/s) - ‘/var/portage/distfiles/thunderbird-bin_x86_64-115.13.0.tar.bz2.__download__’ saved [80720623/80720623]
> 
>  * mtools-4.0.44.tar.lz size ;-) ...                                     [ ok ]
>  * OpenJDK21U-jdk_x64_linux_hotspot_21.0.3_9.tar.gz size ;-) ...         [ ok ]
> Refetching... File renamed to '/var/portage/distfiles/thunderbird-bin_x86_64-115.13.0.tar.bz2._checksum_failure_.z0bnlc4d'
> 
> >>> Downloading 'http://distfiles.gentoo.org/distfiles/29/thunderbird-bin_x86_64-115.13.0.tar.bz2'

Cancellation is complicated by the fact that EbuildFetcher calls the fetch function in a subprocess, so the actual FETCHCOMMAND process is a child of a subprocess.
Comment 1 Zac Medico gentoo-dev 2024-07-19 04:21:37 UTC
It looks like it created a second thunderbird-bin prefetcher in the second calculation, but the second prefetcher mush have exited silently because the file was still locked by the first prefetcher.
Comment 2 Zac Medico gentoo-dev 2024-07-20 05:02:44 UTC
It seems the ForkProcess cancel method was ineffective, so a good place to start would be to add a basic cancellation test case which might reveal a bug. The subprocess has default signal handlers installed by ForkProcess like this:

        signal.signal(signal.SIGINT, signal.SIG_DFL)
        signal.signal(signal.SIGTERM, signal.SIG_DFL)

The "Refetching... File renamed" log message in comment #0 shows that either the multiprocessing.Process terminate method was not called or it was ineffective for some reason.
Comment 3 Zac Medico gentoo-dev 2024-07-20 05:14:05 UTC
It may be that the multiprocessing.Process terminate call killed the child python process but not the FETCHCOMMAND process which inherited stdout, which explains why the fetch output continued through the dependency calculation.

Then the "Refetching... File renamed" message would have come from the second thunderbird-bin prefetcher (my thought about it exiting silently in comment #1 was incorrect).
Comment 4 Zac Medico gentoo-dev 2024-07-20 05:36:23 UTC
We can install a SIGTERM signal handler in the subprocess, and have it raise an exception which the spawn function can handle by killing the subprocess (FETCHCOMMAND) it's waiting for and then re-raising the exception.
Comment 5 Zac Medico gentoo-dev 2024-07-20 05:47:33 UTC
We can re-use this SignalInterrupt exception used in bin/emerge and similar scripts: 

# Inherit from KeyboardInterrupt to avoid a traceback from asyncio.
class SignalInterrupt(KeyboardInterrupt):
    def __init__(self, signum):
        self.signum = signum

signal.signal(signal.SIGTERM, signal_interrupt)
Comment 6 Zac Medico gentoo-dev 2024-07-21 03:52:54 UTC
(In reply to Zac Medico from comment #5)
> We can re-use this SignalInterrupt exception used in bin/emerge and similar
> scripts: 
> 
> # Inherit from KeyboardInterrupt to avoid a traceback from asyncio.
> class SignalInterrupt(KeyboardInterrupt):
>     def __init__(self, signum):
>         self.signum = signum
> 
> signal.signal(signal.SIGTERM, signal_interrupt)

I've tested this approach a bit, and it seems too fragile because there are certain times when this signal handler will cause a misbehavior. For example, with multiprocessing start method spawn, it misbehaves if the signal arrives while the subprocess attempting to start a thread to send file descriptors to the FETCHCOMMAND process:

>   File "lib/portage/util/_async/ForkProcess.py", line 180, in _main
>     await self.scheduler.run_in_executor(
>           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>   File "/usr/lib/python3.12/asyncio/base_events.py", line 858, in run_in_executor
>     executor = concurrent.futures.ThreadPoolExecutor(
>                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>   File "/usr/lib/python3.12/concurrent/futures/__init__.py", line 49, in __getattr__
>     from .thread import ThreadPoolExecutor as te
>   File "/usr/lib/python3.12/concurrent/futures/thread.py", line 10, in <module>
>     import queue
>   File "/usr/lib/python3.12/queue.py", line 9, in <module>
>     from _queue import SimpleQueue
>   File "<frozen importlib._bootstrap>", line 1360, in _find_and_load
>   File "<frozen importlib._bootstrap>", line 1322, in _find_and_load_unlocked
>   File "<frozen importlib._bootstrap>", line 1262, in _find_spec
>   File "<frozen importlib._bootstrap_external>", line 1528, in find_spec
>   File "<frozen importlib._bootstrap_external>", line 1502, in _get_spec
>   File "<frozen importlib._bootstrap_external>", line 1601, in find_spec
>   File "<frozen importlib._bootstrap_external>", line 147, in _path_stat
>   File "lib/portage/exception.py", line 127, in handler
>     raise SignalInterrupt(signum)
> portage.exception.SignalInterrupt: SignalInterrupt(signum=15)
Comment 7 Zac Medico gentoo-dev 2024-07-21 05:22:16 UTC
Since it's error-prone to raise exceptions from signal handlers, the signal handler needs to call a function that triggers killing of child processes. Generally, it's more practical for this killing to occur asynchronously, in case the main thread happens to be in some kind of critical section when the signal arrives (these critical sections are also the reason why it's error-prone to raise an exception from a signal handler).
Comment 8 Zac Medico gentoo-dev 2024-07-21 05:47:46 UTC
If we slightly refactor the fetch function into a generator function that yields processes to be managed by the caller. This sort of pattern is used elsewhere in the code base, like in emirrordist's FetchIterator, egencache's ManifestScheduler, and the iter_completed function which operates on a future generator.
Comment 9 Zac Medico gentoo-dev 2024-07-21 07:28:19 UTC
Since the fetch function is called recursively via the get_mirror_url, it's appealing to convert both fetch and get_mirror_url to async functions, and use asyncio.CancelledError handlers to terminate running processes. The SIGTERM handler can schedule graceful termination via call_soon_threadsafe.
Comment 10 Zac Medico gentoo-dev 2024-07-21 19:43:03 UTC
For asyncio.CancelledError handlers, there's a race, where the asyncio.CancelledError could arrive after a child has forked but before we have a reference to the process to work with in the asyncio.CancelledError handler.

We should be able to clean up after this race in the EbuildFetcher process by using the multiprocessing.active_children() method to detect any children missed by the asyncio.CancelledError handler race.
Comment 11 Larry the Git Cow gentoo-dev 2024-08-03 21:36:07 UTC
The bug has been referenced in the following commit(s):

https://gitweb.gentoo.org/proj/portage.git/commit/?id=8b5b5186965c47605ba004d317e8fd58e70e97cd

commit 8b5b5186965c47605ba004d317e8fd58e70e97cd
Author:     Zac Medico <zmedico@gentoo.org>
AuthorDate: 2024-08-03 21:33:41 +0000
Commit:     Zac Medico <zmedico@gentoo.org>
CommitDate: 2024-08-03 21:33:41 +0000

    _EbuildFetcherProcess: Handle SIGTERM
    
    Fix _EbuildFetcherProcess to handle SIGTERM, so that FETCHCOMMAND
    processes will not be left running in the background:
    
    * Convert the fetch function to an async_fetch coroutine function
      so that it can use asyncio.CancelledError handlers to terminate
      running processes.
    
    * Use multiprocessing.active_children() to detect and terminate
      any processes that asyncio.CancelledError handlers did not have
      an opportunity to terminate because the exception arrived too
      soon after fork/spawn.
    
    * Add unit test to verify that a child process is correctly
      killed when EbuildFetcher is cancelled, with short timeout in
      case it takes some time for the process to disappear.
    
    Bug: https://bugs.gentoo.org/936273
    Signed-off-by: Zac Medico <zmedico@gentoo.org>

 NEWS                                   |   3 +
 lib/_emerge/EbuildFetcher.py           |  68 ++++++++++++++++++----
 lib/portage/package/ebuild/fetch.py    | 102 ++++++++++++++++++++++++++++-----
 lib/portage/tests/ebuild/test_fetch.py | 100 +++++++++++++++++++++++++++++++-
 lib/portage/tests/util/test_socks5.py  |  16 ++++--
 5 files changed, 257 insertions(+), 32 deletions(-)