systemd-networkd blocks forever at boot in initgroups(3) because rpcbind.socket is allowed to start before network.target but rpcbind.service is not allowed to start until after network.target. deadlock. I was also able to reproduce this at boot time in a systemd debug_shell with a simple program that only calls initgroups(). Here is a graphical depection of the relevant before/after dependencies. (a -> b means when both unit a and unit b are in a transaction, unit b is not activated until after unit a is activated). rpcbind.socket -> rpcbind.service systemd.networkd.service -> network.target -> rpcbind.service Here is what is happening in order at boot: 1. systemd activates rpcbind.socket. This means systemd itself is binding a socket to (among others) TCP port 111. At this point systemd has performed a listen(2) on the socket. Not sure if it also performs an accept(2). But even if it has, it is not performing a read on the resulting connection socket. 2. systemd activates systemd-networkd.service, which starts process systemd-networkd. Note that systemd-networkd is a notify service 3. The systemd-networkd process calls initgroups("systemd-network", 982). This ends up trying to use NIS to get the group names using code in libnss-nis. The logic in this code successfully connects to the TCP server socket and sends a message, waiting for a response, which never comes even though the socket is connected and will never get disconnected. The way this is supposed to work is that when systemd is allowed to activate rpcbind.service, gets the connected socket passed to it and starts reading from the socket and responding to requests. However, this will never happen because systemd-networkd is apparently blocked in initgroups() before it considers itself to be activated. So it will never notify activation and systemd will never activate network.target, and therefore will never activate rpcbind.service. I know this is what's happening by using gdb to see the stack traces of systemd-networkd while in this state, and also b stepping through my initgroups program. I can also see during this time using netstat that the receive queue of the connected socket on the server side is nonzero, indicating that systemd is not reading from the socket. I was able to work around this bug by removing the ( network.target -> rpcbind.service ) dependency as shown below: harvell@wolfhound system$ diff -u /lib/systemd/system/rpcbind.service /etc/systemd/system/rpcbind.service --- /lib/systemd/system/rpcbind.service 2018-03-05 11:31:17.369211800 -0700 +++ /etc/systemd/system/rpcbind.service 2018-03-08 19:37:50.341803695 -0700 @@ -1,6 +1,6 @@ [Unit] Description=RPC Bind -After=network.target +#After=network.target Wants=rpcbind.target Before=rpcbind.target I think the correct solution to this problem is to for net-nds/rpcbind to make the same change in /lib/systemd/system/rpcbind.service
Created attachment 523098 [details] output of emerge --info
jharvell@wolfhound system$ eix net-nds/rpcbind [I] net-nds/rpcbind Available versions: 0.2.4-r1 **9999 {debug selinux systemd tcpd warmstarts} Installed versions: 0.2.4-r1(11:31:18 05/03/2018)(systemd tcpd -debug -selinux -warmstarts) Homepage: https://sourceforge.net/projects/rpcbind/ Description: portmap replacement which supports RPC over various protocols jharvell@wolfhound system$ eix sys-libs/glibc [I] sys-libs/glibc Available versions: (2.2) [M]2.17^s[1] [M](~)2.18-r1^s [M](~)2.18-r1^s[1] [M]2.19-r1^s [M]2.19-r1^s[1] [M]**2.19-r2^s [M]2.20-r2^s [M]2.20-r2^s[1] [M]2.21-r2^s [M]2.21-r2^s[1] [M]2.22-r4^s [M]2.22-r4^s[1] [M]2.23-r3^s[1] [M]2.23-r4^s [M]2.23-r4^s[1] [M](~)2.24-r3^s[1] [M](~)2.24-r4^s [M]**2.25-r2^s[1] 2.25-r9^s 2.25-r10^s **2.25-r11^s (~)2.26-r5^s (~)2.26-r6^s **2.27-r1^s **9999^s **9999^s[1] {audit caps compile-locales crosscompile_opts_headers-only debug doc gd hardened headers-only multilib nscd profile +rpc selinux suid systemtap vanilla} Installed versions: 2.26-r6(2.2)^s(19:30:32 08/03/2018)(caps multilib suid -audit -debug -doc -gd -hardened -headers-only -nscd -profile -selinux -systemtap -vanilla) Homepage: https://www.gnu.org/software/libc/libc.html Description: GNU libc6 (also called glibc2) C library [1] "wolfhound" /opt/portage jharvell@wolfhound system$ eix sys-auth/libnss-nis [I] sys-auth/libnss-nis Available versions: (~)1.4 {ABI_MIPS="n32 n64 o32" ABI_PPC="32 64" ABI_S390="32 64" ABI_X86="32 64 x32"} Installed versions: 1.4(19:34:42 08/03/2018)(ABI_MIPS="-n32 -n64 -o32" ABI_PPC="-32 -64" ABI_S390="-32 -64" ABI_X86="64 -32 -x32") Homepage: https://github.com/thkukuk/libnss_nis Description: NSS module to provide NIS support
Contents of nsswitch.conf. Note the bug exists regardless of whether I use the uncommented or commented line for groups. jharvell@wolfhound system$ cat /etc/nsswitch.conf # /etc/nsswitch.conf: # $Header: /var/cvsroot/gentoo/src/patchsets/glibc/extra/etc/nsswitch.conf,v 1.1 2006/09/29 23:52:23 vapier Exp $ #passwd: compat #shadow: compat #group: compat passwd: files nis shadow: files nis group: files [success=merge] nis #group: files nis hosts: files dns networks: files dns services: db files protocols: db files rpc: db files ethers: db files netmasks: files netgroup: files nis bootparams: files automount: files nis aliases: files
Created attachment 523100 [details] stack trace of systemd-networkd while problem is manifesting
The bug has been closed via the following commit(s): https://gitweb.gentoo.org/repo/gentoo.git/commit/?id=fbaf911f4355d5c9992694288b586dcbc5f154cc commit fbaf911f4355d5c9992694288b586dcbc5f154cc Author: Mike Gilbert <floppym@gentoo.org> AuthorDate: 2018-03-10 14:09:43 +0000 Commit: Mike Gilbert <floppym@gentoo.org> CommitDate: 2018-03-10 14:09:43 +0000 net-nds/rpcbind: use upstream rpcbind.service Closes: https://bugs.gentoo.org/650030 Package-Manager: Portage-2.3.24, Repoman-2.3.6_p81 net-nds/rpcbind/files/rpcbind.service | 13 ------------- .../{rpcbind-0.2.4-r1.ebuild => rpcbind-0.2.4-r2.ebuild} | 4 +--- net-nds/rpcbind/rpcbind-9999.ebuild | 2 -- 3 files changed, 1 insertion(+), 18 deletions(-)
Good analysis. It turns out we were installing the wrong file by accident here.
This fix is causing issues if you are using systemd and have built rpcbind without warmstarts. The upstream systemd unit passes -w, but Gentoo by default builds rpcbind without support for warmstarts, so it just throws a usage error and never starts up due to not knowing the -w option.
(In reply to Timo Rothenpieler from comment #7) Ah. Do you see any problem with enabling warm starts unconditionally at build time? It looks like it is only enabled at runtime when the -w flag is passed anyway.
The bug has been referenced in the following commit(s): https://gitweb.gentoo.org/repo/gentoo.git/commit/?id=238eaeb1245f965ce01b4a9a7519bc135b7a410a commit 238eaeb1245f965ce01b4a9a7519bc135b7a410a Author: Mike Gilbert <floppym@gentoo.org> AuthorDate: 2018-03-12 17:27:46 +0000 Commit: Mike Gilbert <floppym@gentoo.org> CommitDate: 2018-03-12 17:28:58 +0000 profiles: systemd: enable warmstarts by default for net-nds/rpcbind Bug: https://bugs.gentoo.org/650030#c7 profiles/targets/systemd/package.use | 6 ++++++ 1 file changed, 6 insertions(+) https://gitweb.gentoo.org/repo/gentoo.git/commit/?id=0bffded2ba7ff5c3c5660c19c829a6ffeedea353 commit 0bffded2ba7ff5c3c5660c19c829a6ffeedea353 Author: Mike Gilbert <floppym@gentoo.org> AuthorDate: 2018-03-12 17:24:07 +0000 Commit: Mike Gilbert <floppym@gentoo.org> CommitDate: 2018-03-12 17:28:58 +0000 net-nds/rpcbind: require warmstarts for systemd Bug: https://bugs.gentoo.org/650030#c7 Package-Manager: Portage-2.3.24, Repoman-2.3.6_p81 net-nds/rpcbind/{rpcbind-0.2.4-r2.ebuild => rpcbind-0.2.4-r3.ebuild} | 1 + net-nds/rpcbind/rpcbind-9999.ebuild | 1 + 2 files changed, 2 insertions(+)}