Gentoo Websites Logo
Go to: Gentoo Home Documentation Forums Lists Bugs Planet Store Wiki Get Gentoo!
Bug 919072 - emerge hangs for parallel installation including binpkgs and disabled IPC (exposed by dev-util/meson-1.3.0)
Summary: emerge hangs for parallel installation including binpkgs and disabled IPC (ex...
Status: RESOLVED FIXED
Alias: None
Product: Portage Development
Classification: Unclassified
Component: Unclassified (show other bugs)
Hardware: All Linux
: Normal normal (vote)
Assignee: Portage team
URL:
Whiteboard:
Keywords: InVCS
Depends on: 918929
Blocks:
  Show dependency tree
 
Reported: 2023-12-02 23:49 UTC by Petr Vaněk
Modified: 2024-02-14 15:45 UTC (History)
5 users (show)

See Also:
Package list:
Runtime testing required: ---


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Petr Vaněk gentoo-dev 2023-12-02 23:49:25 UTC
For few days I am experiencing an issue, where emerge just hangs after issuing following command. It is ~amd64 machine and everything is update to to latest available version.

  $ emerge -qv1k -j8 net-im/synapse

I see only few py3.12 zombie processes all portage processes are sleeping and the console looks like this:

>>> Verifying ebuild manifests
>>> Running pre-merge checks for acct-group/synapse-0-r1
>>> Running pre-merge checks for acct-group/postgres-0-r2
>>> Running pre-merge checks for acct-user/synapse-0-r1
>>> Running pre-merge checks for acct-user/postgres-0-r2
>>> Running pre-merge checks for dev-lang/rust-bin-1.73.0
>>> Emerging (1 of 91) acct-group/synapse-0-r1::gentoo
>>> Emerging (2 of 91) acct-group/postgres-0-r2::gentoo
>>> Emerging binary (3 of 91) media-libs/libjpeg-turbo-3.0.1::gentoo
>>> Emerging binary (4 of 91) dev-libs/icu-74.1::gentoo
>>> Emerging binary (5 of 91) dev-libs/libyaml-0.2.5::gentoo
>>> Emerging binary (6 of 91) dev-libs/yajl-2.1.0-r4::gentoo
>>> Emerging binary (7 of 91) media-libs/libpng-1.6.40-r1::gentoo
>>> Emerging binary (8 of 91) app-eselect/eselect-postgresql-2.4-r1::gentoo
>>> Installing (3 of 91) media-libs/libjpeg-turbo-3.0.1::gentoo
>>> Emerging binary (9 of 91) app-eselect/eselect-rust-20210703::gentoo

I tried to track down which update is responsible for this issue and I located commit 6475fd300200 ("dev-util/meson: add 1.3.0"). meson-1.3.0 contains this change https://github.com/mesonbuild/meson/commit/b91244c3b7f77578d601502628bceb9e92183387 which changes how replacement works for boolean type. It affects https://github.com/gentoo/portage/blob/84e1358fbd35910a778502f00d939b5cbb77c548/lib/_emerge/AbstractEbuildProcess.py#L56 where meson-1.3.0 substitutes @IPC@ with `1` instead of `True` as it used to with <meson-1.3, effectively disabling IPC. Everything works as expected with enabled IPC. It affects portage-3.0.{55,56}, I  didn't try older versions. I can reproduce it always.
Comment 1 Sam James archtester Gentoo Infrastructure gentoo-dev Security 2023-12-03 00:04:33 UTC
So, there's two bugs:
1) Either a bug in Meson or what we're expecting of Meson in Portage wrt the substitution behaviour;

2) The deadlock with IPC disabled
Comment 2 Eli Schwartz 2023-12-03 00:37:09 UTC
The linked change is highly specific to cmake... or should be. ;) it is NOT supposed to change how configure_file works when in the default meson replacement mode.

The linked change is badly tested, which also indicates that we don't have robust tests for configure_file in general...

Can you check if https://github.com/mesonbuild/meson/pull/12532 fixes it?
Comment 3 Larry the Git Cow gentoo-dev 2023-12-03 01:09:28 UTC
The bug has been referenced in the following commit(s):

https://gitweb.gentoo.org/repo/gentoo.git/commit/?id=de9e76a42d645a57f908c3fdb2c887f656118918

commit de9e76a42d645a57f908c3fdb2c887f656118918
Author:     Eli Schwartz <eschwartz93@gmail.com>
AuthorDate: 2023-12-03 00:59:06 +0000
Commit:     Sam James <sam@gentoo.org>
CommitDate: 2023-12-03 01:05:45 +0000

    package.mask: mask broken meson version for anyone that had it installed
    
    It breaks portage so it is quite important that users not use it for any
    reason.
    
    Bug: https://bugs.gentoo.org/919072
    Signed-off-by: Eli Schwartz <eschwartz93@gmail.com>
    Closes: https://github.com/gentoo/gentoo/pull/34099
    Signed-off-by: Sam James <sam@gentoo.org>

 profiles/package.mask | 5 +++++
 1 file changed, 5 insertions(+)

https://gitweb.gentoo.org/repo/gentoo.git/commit/?id=2556ee17fb54aea6c0b6d766f239db009f7b57fc

commit 2556ee17fb54aea6c0b6d766f239db009f7b57fc
Author:     Eli Schwartz <eschwartz93@gmail.com>
AuthorDate: 2023-12-03 00:53:47 +0000
Commit:     Sam James <sam@gentoo.org>
CommitDate: 2023-12-03 01:05:44 +0000

    dev-util/meson: backport fix for broken configure_file output
    
    This broke portage itself.
    
    Bug: https://github.com/mesonbuild/meson/pull/12532
    Bug: https://bugs.gentoo.org/919072
    Signed-off-by: Eli Schwartz <eschwartz93@gmail.com>
    Signed-off-by: Sam James <sam@gentoo.org>

 ...-convert-boolean-values-for-cmake-formats.patch | 31 ++++++++++++++++++++++
 .../{meson-1.3.0.ebuild => meson-1.3.0-r1.ebuild}  |  3 +++
 2 files changed, 34 insertions(+)
Comment 4 Larry the Git Cow gentoo-dev 2023-12-03 08:02:41 UTC
The bug has been referenced in the following commit(s):

https://gitweb.gentoo.org/repo/gentoo.git/commit/?id=33701a3a34c89639c79e3d3f719358a1a01bc57d

commit 33701a3a34c89639c79e3d3f719358a1a01bc57d
Author:     Sam James <sam@gentoo.org>
AuthorDate: 2023-12-03 07:59:07 +0000
Commit:     Sam James <sam@gentoo.org>
CommitDate: 2023-12-03 08:02:12 +0000

    sys-apps/portage: don't allow broken meson for build
    
    =dev-util/meson-1.3.0 could lead to a miscompiled Portage with IPC disabled.
    
    Bug: https://bugs.gentoo.org/919072
    Signed-off-by: Sam James <sam@gentoo.org>

 sys-apps/portage/{portage-3.0.56.ebuild => portage-3.0.56-r1.ebuild} | 4 ++++
 1 file changed, 4 insertions(+)
Comment 5 Petr Vaněk gentoo-dev 2023-12-03 09:11:12 UTC
I have minimal reproducer for the deadlock with IPC disabled.

1. create two empty packages dev-libs/a dev-libs/b with no dependencies
2. create binary packages for them
   $ emerge -b dev-libs/a dev-libs/b
3. (re)install them from binary packages
   $ emerge -k -j2 dev-libs/a dev-libs/b
4. observe the deadlock

The important aspect seems to be the independence of the packages being installed, which appears to be the trigger for the deadlock when IPC is disabled.
Comment 6 Petr Vaněk gentoo-dev 2023-12-03 12:58:24 UTC
The deadlock issue was introduced in portage-2.3.102 in commit dd69ce742c62 ("Support PORTAGE_LOG_FILTER_FILE_CMD (bug 709746)") https://gitweb.gentoo.org/proj/portage.git/commit/?id=dd69ce742c62b9515cf7ae37e46bcf7f178777db
Comment 7 Zac Medico gentoo-dev 2023-12-04 17:58:42 UTC
(In reply to Petr Vaněk from comment #6)
> The deadlock issue was introduced in portage-2.3.102 in commit dd69ce742c62
> ("Support PORTAGE_LOG_FILTER_FILE_CMD (bug 709746)")
> https://gitweb.gentoo.org/proj/portage.git/commit/
> ?id=dd69ce742c62b9515cf7ae37e46bcf7f178777db

Do you have PORTAGE_LOG_FILTER_FILE_CMD set or unset? Can you check the portage process tree with `ps axf` during the deadlock, and also try to identify any lock files that are open by portage processes with lsof? The lock files typically have portage_lockfile in the name.
Comment 8 Zac Medico gentoo-dev 2023-12-04 18:14:11 UTC
It might be interesting to see if PipeLoggerTestCase deadlocks, since that's the most obvious type of deadlock related to dd69ce742c62b9515cf7ae37e46bcf7f178777db. The test case was added in the commit just before:

https://gitweb.gentoo.org/proj/portage.git/commit/?id=72ac22e722549833c1ee7e7ad1b585db55f7dafc

You can run it from a portage checkout like this:

pytest lib/portage/tests/process/test_PipeLogger.py
Comment 9 Petr Vaněk gentoo-dev 2023-12-04 21:22:35 UTC
(In reply to Zac Medico from comment #7)
> Do you have PORTAGE_LOG_FILTER_FILE_CMD set or unset?

It is unset.

> Can you check the portage process tree with `ps axf` during the deadlock, and
> also try to identify any lock files that are open by portage processes with
> lsof? The lock files typically have portage_lockfile in the name.

$ ps axf
23752 pts/0    S+     0:01 strace -s 256 -fftto /tmp/test-1/emerge.strace emerge -k -j2 dev-libs/a dev-libs/b
23755 pts/0    S+     0:01  \_ /usr/bin/python3.12 /usr/lib/python-exec/python3.12/emerge -k -j2 dev-libs/a dev-libs/b
23838 pts/0    Z+     0:00      \_ [python3.12] <defunct>
23848 pts/0    S+     0:00      \_ /usr/bin/python3.12 /usr/lib/python-exec/python3.12/emerge -k -j2 dev-libs/a dev-libs/b
23913 pts/0    S+     0:00          \_ /usr/bin/python3.12 /usr/lib/portage/python3.12/lock-helper.py /var/db/pkg-ebuild

$ lsof -p 23755,23838,23848,23913 | grep lock
emerge    23755 root   7uW     REG               0,37        0   2589534 /var/tmp/portage/dev-libs/.b-1.portage_lockfile
emerge    23755 root   8uW     REG               0,37        0   2799267 /var/tmp/portage/dev-libs/.a-1.portage_lockfile
emerge    23755 root  12uW     REG               0,39        0 151863886 /var/db/.pkg-ebuild.portage_lockfile
emerge    23848 root   7uW     REG               0,39        0 214816387 /var/db/pkg/dev-libs/.a:0.portage_lockfile
python3.1 23913 root   3u      REG               0,39        0 151863886 /var/db/.pkg-ebuild.portage_lockfile

Process 23838 executes pid-ns-init which exits successfully. I see from the
strace, that 23755 receives the SIGCHLD signal which is not handled or the wait
syscall is missing and I think this might be the reason for deadlock.

some snippets from 23755 strace:

20:48:32.708962 openat(AT_FDCWD, "/dev/ptmx", O_RDWR) = 14
...
20:48:32.710491 clone(child_stack=NULL, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f93d8392a10) = 23838
...
20:48:32.712065 pipe2([15, 16], O_CLOEXEC) = 0
...
20:48:32.814741 epoll_wait(3, [{events=EPOLLHUP, data={u32=14, u64=140269336920078}}], 8, 1884) = 1
20:48:33.238554 --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=23838, si_uid=0, si_status=0, si_utime=3 /* 0.03 s */, si_stime=1 /* 0.01 s */} ---
20:48:33.238642 epoll_wait(3, [{events=EPOLLHUP, data={u32=14, u64=140269336920078}}], 8, 0) = 1
20:48:33.238701 epoll_ctl(3, EPOLL_CTL_DEL, 14, 0x7ffd46e7683c) = 0
20:48:33.238736 read(14, 0x55f83c6d0c20, 4096) = -1 EIO (Input/output error)
20:48:33.238787 epoll_wait(3, [], 7, 0) = 0
20:48:33.238853 close(14)               = 0
20:48:33.238898 close(16)               = 0
20:48:33.238953 epoll_wait(3, [], 7, 0) = 0
20:48:33.239002 epoll_wait(3, [], 7, 0) = 0
20:48:33.239067 epoll_wait(3, [], 7, 1459) = 0
20:48:34.699731 epoll_wait(3, [], 7, 3000) = 0

After this, I see only periodic ">>> Jobs:" statistic writes in strace.
Comment 10 Petr Vaněk gentoo-dev 2023-12-04 21:30:02 UTC
(In reply to Zac Medico from comment #8)
> It might be interesting to see if PipeLoggerTestCase deadlocks, since that's
> the most obvious type of deadlock related to
> dd69ce742c62b9515cf7ae37e46bcf7f178777db. The test case was added in the
> commit just before:
> 
> https://gitweb.gentoo.org/proj/portage.git/commit/
> ?id=72ac22e722549833c1ee7e7ad1b585db55f7dafc
> 
> You can run it from a portage checkout like this:
> 
> pytest lib/portage/tests/process/test_PipeLogger.py

This test passes for me.
Comment 11 Zac Medico gentoo-dev 2023-12-04 22:53:23 UTC
The main pid (23755) is holding the /var/db/.pkg-ebuild.portage_lockfile lock, and would probably release it after it detects exit of the defunct 23838 pid, allowing   pid 23913 to acquire the lock.

The interaction with IPC could be due to the fact that process exit handling is different with IPC, since there is an ebuild-ipc exit callback.

The interaction with commit dd69ce742c62b9515cf7ae37e46bcf7f178777db could be due to the fact that the logging file descriptor is used to monitor for process exit.

I haven't been able to reproduce it, but I can review dd69ce742c62b9515cf7ae37e46bcf7f178777db in order to try and understand why that might prevent the process exit from being detected. A simple solution would be to stop using the logging file descriptor to detect process exit, and to instead simply rely on efficient exit detection via PidfdChildWatcher which is used since bug 914873. It's kind of nice to use the logging file descriptor for this though, since that naturally ensures that we do not prematurely close it and lose log output.
Comment 12 Zac Medico gentoo-dev 2023-12-04 23:08:33 UTC
The whole think kind of suggests that you have a deamonized process forking into the background an holding open the stdio file descriptor(s), as in bug 278895.
Comment 13 Zac Medico gentoo-dev 2023-12-04 23:19:07 UTC
FEATURES=ipc-sandbox is supposed to be an alternative solution for bug 278895, so do you have that enabled?  It won't help in pkg_* phases aside from pkg_setup, since we disabled it there for bug 673794.
Comment 14 Zac Medico gentoo-dev 2023-12-04 23:20:36 UTC
I mean FEATURES=pid-sandox.
Comment 15 Zac Medico gentoo-dev 2023-12-05 03:46:02 UTC
(In reply to Petr Vaněk from comment #9)
> 20:48:33.238736 read(14, 0x55f83c6d0c20, 4096) = -1 EIO (Input/output error)

It looks like the EOF is seen here, indicating process exit.

> 20:48:33.238787 epoll_wait(3, [], 7, 0) = 0
> 20:48:33.238853 close(14)               = 0

Here it closes the pty device because of the EOF.

> 20:48:33.238898 close(16)               = 0

Here it closes write end of the BuildLogger pipe, which is supposed to trigger EOF on the other end (fd 15) which is supposed to be handled by the BuildLogger's internal PipeLogger, but we never see the PipeLogger close fd 15.
Comment 16 Zac Medico gentoo-dev 2023-12-05 03:56:18 UTC
It looks like pid 23848 inherited fd 16 and held it open, preventing the EOF from being observed on fd 15 in the main process. So, the solution is to close fd 15 in every process where we fork (without exec). We have something similar in the portage.locks._close_fds() function which ForkProcess calls after fork.

We could avoid the issue by using multiprocessing.set_start_method("spawn"), which would prevent unwanted file descriptor inheritance in ForkProcess.
Comment 17 Petr Vaněk gentoo-dev 2023-12-05 09:28:18 UTC
(In reply to Zac Medico from comment #14)
> I mean FEATURES=pid-sandox.

Yes, pid-sandbox is enabled.

FEATURES="assume-digests binpkg-docompress binpkg-dostrip binpkg-logs buildpkg-live config-protect-if-modified distlocks ebuild-locks fixlafiles ipc-sandbox merge-sync multilib-strict network-sandbox news parallel-fetch parallel-install pid-sandbox pkgdir-index-trusted preserve-libs protect-owned qa-unresolved-soname-deps sandbox sfperms strict unknown-features-warn unmerge-logs unmerge-orphans userfetch userpriv usersandbox usersync xattr"

I think except of parallel-install I use default FEATURES in this scenario. But I completely forgot to mention that and it is also important piece to the puzzle, without parallel-install it works.
Comment 18 Petr Vaněk gentoo-dev 2023-12-05 09:33:05 UTC
(In reply to Zac Medico from comment #16)
> It looks like pid 23848 inherited fd 16 and held it open, preventing the EOF
> from being observed on fd 15 in the main process. So, the solution is to
> close fd 15 in every process where we fork (without exec). We have something
> similar in the portage.locks._close_fds() function which ForkProcess calls
> after fork.

Yes, you are right, 23848 inherits both fds, 15 and 16, after fork/clone.
Comment 19 Larry the Git Cow gentoo-dev 2023-12-06 16:25:01 UTC
The bug has been referenced in the following commit(s):

https://gitweb.gentoo.org/proj/portage.git/commit/?id=df212738bbb209356472911cda79902f0e25918e

commit df212738bbb209356472911cda79902f0e25918e
Author:     Zac Medico <zmedico@gentoo.org>
AuthorDate: 2023-12-05 04:23:56 +0000
Commit:     Zac Medico <zmedico@gentoo.org>
CommitDate: 2023-12-06 16:22:56 +0000

    BuildLogger: Close self._stdin after fork
    
    In order to ensure that we can observe EOF on the read end of the pipe,
    close self._stdin after fork. Since portage.locks._close_fds() already
    does something similar for _lock_manager instances which have a close()
    method, it will also work with these _file_close_wrapper objects.
    
    The portage.locks._close_fds() function calls close after fork, in
    the ForkProcess._bootstrap method. For more general fork coverage,
    we could move the _close_fds() call to the _ForkWatcher.hook method
    in portage/__init__.py, but I've reserved that for a later change
    since _close_fds() has been working fine for us where we call it now.
    
    Bug: https://bugs.gentoo.org/919072
    Signed-off-by: Zac Medico <zmedico@gentoo.org>

 lib/portage/util/_async/BuildLogger.py | 34 +++++++++++++++++++++++++++++++---
 1 file changed, 31 insertions(+), 3 deletions(-)
Comment 20 Larry the Git Cow gentoo-dev 2023-12-10 22:35:33 UTC
The bug has been closed via the following commit(s):

https://gitweb.gentoo.org/repo/gentoo.git/commit/?id=0d365d80099d206e49b592abb30030642f8f09f9

commit 0d365d80099d206e49b592abb30030642f8f09f9
Author:     Sam James <sam@gentoo.org>
AuthorDate: 2023-12-10 22:34:47 +0000
Commit:     Sam James <sam@gentoo.org>
CommitDate: 2023-12-10 22:34:52 +0000

    sys-apps/portage: add 3.0.57
    
    Closes: https://bugs.gentoo.org/918929
    Closes: https://bugs.gentoo.org/913628
    Closes: https://bugs.gentoo.org/915474
    Closes: https://bugs.gentoo.org/918597
    Closes: https://bugs.gentoo.org/919072
    Closes: https://bugs.gentoo.org/919105
    Closes: https://bugs.gentoo.org/919174
    Closes: https://bugs.gentoo.org/919311
    Closes: https://bugs.gentoo.org/919419
    Closes: https://bugs.gentoo.org/919668
    Signed-off-by: Sam James <sam@gentoo.org>

 sys-apps/portage/Manifest              |   1 +
 sys-apps/portage/portage-3.0.57.ebuild | 242 +++++++++++++++++++++++++++++++++
 2 files changed, 243 insertions(+)
Comment 21 Larry the Git Cow gentoo-dev 2024-02-14 15:45:56 UTC
The bug has been referenced in the following commit(s):

https://gitweb.gentoo.org/proj/portage.git/commit/?id=038ad1029ea574b106906380e47479db1041bee2

commit 038ad1029ea574b106906380e47479db1041bee2
Author:     Zac Medico <zmedico@gentoo.org>
AuthorDate: 2024-02-14 05:55:31 +0000
Commit:     Zac Medico <zmedico@gentoo.org>
CommitDate: 2024-02-14 06:04:22 +0000

    BuildLogger: Fix portage.locks._open_fds memory leak
    
    The _file_close_wrapper __getattribute__ method needs to
    be overridden to expose its close method, otherwise the
    underlying file's close method is called and the closed
    file object remains as a memory leak in the global
    portage.locks._open_fds dict. For reference, see similar
    classes like portage.util.atomic_ofstream which overrides
    methods in the same way.
    
    Bug: https://bugs.gentoo.org/919072
    Fixes: df212738bbb2 ("BuildLogger: Close self._stdin after fork")
    Signed-off-by: Zac Medico <zmedico@gentoo.org>

 lib/portage/util/_async/BuildLogger.py | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)