Emerging net-dns/c-ares (as a dependency of sys-auth/sssd) on one machine, I receive the message "ebuild-ipc timed out during write after 15 seconds, retrying..." ad infinitum (with the number of seconds varying, of course). I have not seen this issue with any other packages. Puzzlingly, this happens only on one of two Gentoo servers - the other emerges sssd without issue. The one which emerges successfully is a completely default kernel config by genkernel for sys-kernel/hardened-sources. The one which hangs is a very custom config for sys-kernel/gentoo-sources. I will be attaching both kernels' .config. This bears a passing resemblance to #337764, but the workaround listed in that bug is not effective. (rm -rf /var/tmp/portage/*, more or less). The ipc use flag is forced on for portage in the profile the failing server uses, so I don't think disabling it is an option. Frankly, I don't know enough about Portage to dig into this more and see why it breaks. I am however very happy to take instruction to help. I am attaching a full log of an attempt to build (it is too long for here). Reproducible: Always Steps to Reproduce: 1. (Unknown pre-requisites) 2. Attempt to emerge net-dns/c-ares-1.9.1 3. Watch portage hang with the message "ebuild-ipc timed out during write after 15 seconds, retrying..." emerge --info: Portage 2.2.8-r2 (default/linux/amd64/13.0, gcc-4.7.3, glibc-2.19-r1, 3.14.14-gentoo x86_64) ================================================================= System uname: Linux-3.14.14-gentoo-x86_64-Intel-R-_Atom-TM-_CPU_D525_@_1.80GHz-with-gentoo-2.2 KiB Mem: 4041400 total, 3540348 free KiB Swap: 4194288 total, 4194288 free Timestamp of tree: Thu, 02 Oct 2014 06:15:01 +0000 ld GNU ld (Gentoo 2.23.2 p1.0) 2.23.2 app-shells/bash: 4.2_p50 dev-lang/python: 2.7.7, 3.3.5-r1 dev-util/pkgconfig: 0.28-r1 sys-apps/baselayout: 2.2 sys-apps/openrc: 0.12.4 sys-apps/sandbox: 2.6-r1 sys-devel/autoconf: 2.69 sys-devel/automake: 1.12.6, 1.13.4 sys-devel/binutils: 2.23.2 sys-devel/gcc: 4.7.3-r1 sys-devel/gcc-config: 1.7.3 sys-devel/libtool: 2.4.2-r1 sys-devel/make: 3.82-r4 sys-kernel/linux-headers: 3.13 (virtual/os-headers) sys-libs/glibc: 2.19-r1 Repositories: gentoo spresser-generic ACCEPT_KEYWORDS="amd64" ACCEPT_LICENSE="* -@EULA" CBUILD="x86_64-pc-linux-gnu" CFLAGS="-O2 -pipe" CHOST="x86_64-pc-linux-gnu" CONFIG_PROTECT="/etc /usr/share/gnupg/qualified.txt" CONFIG_PROTECT_MASK="/etc/ca-certificates.conf /etc/env.d /etc/gconf /etc/gentoo-release /etc/revdep-rebuild /etc/sandbox.d /etc/terminfo" CXXFLAGS="-O2 -pipe" DISTDIR="/usr/portage/distfiles" FCFLAGS="-O2 -pipe" FEATURES="assume-digests binpkg-logs config-protect-if-modified distlocks ebuild-locks fixlafiles merge-sync news parallel-fetch preserve-libs protect-owned sandbox sfperms strict unknown-features-warn unmerge-logs unmerge-orphans userfetch userpriv usersandbox usersync" FFLAGS="-O2 -pipe" GENTOO_MIRRORS="http://lug.mtu.edu/gentoo/ rsync://mirrors.rit.edu/gentoo/ ftp://mirrors.rit.edu/gentoo/ http://mirrors.rit.edu/gentoo/" LDFLAGS="-Wl,-O1 -Wl,--as-needed" PKGDIR="/usr/portage/packages" PORTAGE_CONFIGROOT="/" PORTAGE_RSYNC_OPTS="--recursive --links --safe-links --perms --times --omit-dir-times --compress --force --whole-file --delete --stats --human-readable --timeout=180 --exclude=/distfiles --exclude=/local --exclude=/packages" PORTAGE_TMPDIR="/var/tmp" PORTDIR="/usr/portage" PORTDIR_OVERLAY="/usr/local/portage" SYNC="rsync://rsync.us.gentoo.org/gentoo-portage" USE="acl afs amd64 bash-completion berkdb bindist bzip2 cli cracklib crypt cxx dri fortran gdbm iconv ipv6 kerberos mmx modules multilib ncurses nls nptl openmp pam pcre readline session sse sse2 sse3 sse4_1 ssl ssse3 symlink tcpd unicode zlib" ABI_X86="64" ALSA_CARDS="ali5451 als4000 atiixp atiixp-modem bt87x ca0106 cmipci emu10k1x ens1370 ens1371 es1938 es1968 fm801 hda-intel intel8x0 intel8x0m maestro3 trident usb-audio via82xx via82xx-modem ymfpci" APACHE2_MODULES="authn_core authz_core socache_shmcb unixd actions alias auth_basic authn_alias authn_anon authn_dbm authn_default authn_file authz_dbm authz_default authz_groupfile authz_host authz_owner authz_user autoindex cache cgi cgid dav dav_fs dav_lock deflate dir disk_cache env expires ext_filter file_cache filter headers include info log_config logio mem_cache mime mime_magic negotiation rewrite setenvif speling status unique_id userdir usertrack vhost_alias" CALLIGRA_FEATURES="kexi words flow plan sheets stage tables krita karbon braindump author" CAMERAS="ptp2" COLLECTD_PLUGINS="df interface irq load memory rrdtool swap syslog" ELIBC="glibc" GPSD_PROTOCOLS="ashtech aivdm earthmate evermore fv18 garmin garmintxt gpsclock itrax mtk3301 nmea ntrip navcom oceanserver oldstyle oncore rtcm104v2 rtcm104v3 sirf superstar2 timing tsip tripmate tnt ublox ubx" INPUT_DEVICES="keyboard mouse evdev" KERNEL="linux" LCD_DEVICES="bayrad cfontz cfontz633 glk hd44780 lb216 lcdm001 mtxorb ncurses text" LIBREOFFICE_EXTENSIONS="presenter-console presenter-minimizer" OFFICE_IMPLEMENTATION="libreoffice" PHP_TARGETS="php5-5" PYTHON_SINGLE_TARGET="python2_7" PYTHON_TARGETS="python2_7 python3_3" RUBY_TARGETS="ruby19 ruby20" USERLAND="GNU" VIDEO_CARDS="fbdev glint intel mach64 mga nouveau nv r128 radeon savage sis tdfx trident vesa via vmware dummy v4l" XTABLES_ADDONS="quota2 psd pknock lscan length2 ipv4options ipset ipp2p iface geoip fuzzy condition tee tarpit sysrq steal rawnat logmark ipmark dhcpmac delude chaos account" Unset: CPPFLAGS, CTARGET, EMERGE_DEFAULT_OPTS, INSTALL_MASK, LANG, LC_ALL, MAKEOPTS, PORTAGE_BUNZIP2_COMMAND, PORTAGE_COMPRESS, PORTAGE_COMPRESS_FLAGS, PORTAGE_RSYNC_EXTRA_OPTS, USE_PYTHON
Created attachment 385966 [details] Log of emerge invocation which hangs
Created attachment 385968 [details] Kernel config of the affected machine
Created attachment 385970 [details] Kernel config of the UNaffected machine
While ebuild-ipc is timing out, run `lsof | grep /var/tmp/portage` to see what's going on with the ipc pipes.
steve@hawking ~ $ sudo lsof | grep /var/tmp/portage Password: misc-func 1407 root cwd DIR 253,1 4096 151035 /var/tmp/portage/net-dns/c-ares-1.9.1/image python2.7 1681 root 3uW REG 253,1 0 187388 /var/tmp/portage/net-dns/c-ares-1.9.1/.ipc_lock python2.7 1681 root 6r FIFO 253,1 0t0 131113 /var/tmp/portage/net-dns/c-ares-1.9.1/.ipc_out python2.7 1684 root 3u REG 253,1 0 187388 /var/tmp/portage/net-dns/c-ares-1.9.1/.ipc_lock python2.7 1684 root 6r FIFO 253,1 0t0 131113 /var/tmp/portage/net-dns/c-ares-1.9.1/.ipc_out python2.7 2479 root 3u REG 253,1 0 20295 /var/tmp/portage/.app-shells.portage_lockfile python2.7 7955 root 3u REG 253,1 0 200193 /var/tmp/portage/net-dns/c-ares-1.9.1/.ipc_lock (deleted) python2.7 7955 root 6r FIFO 253,1 0t0 198881 /var/tmp/portage/net-dns/c-ares-1.9.1/.ipc_out (deleted) python2.7 9244 root 3u REG 253,1 0 20295 /var/tmp/portage/.app-shells.portage_lockfile python2.7 12360 root 3u REG 253,1 0 151351 /var/tmp/portage/net-dns/c-ares-1.9.1/.ipc_lock (deleted) python2.7 12360 root 6r FIFO 253,1 0t0 131111 /var/tmp/portage/net-dns/c-ares-1.9.1/.ipc_out (deleted) emerge 13313 root 5uW REG 253,1 0 151655 /var/tmp/portage/app-shells/.bash-4.2_p51.portage_lockfile emerge 13313 root 9r FIFO 253,1 0t0 155594 /var/tmp/portage/app-shells/bash-4.2_p51/.ipc_in python2.7 14348 root 3u REG 253,1 0 20295 /var/tmp/portage/.app-shells.portage_lockfile emerge 15981 root 4r FIFO 253,1 0t0 131112 /var/tmp/portage/net-dns/c-ares-1.9.1/.ipc_in emerge 15981 root 8uW REG 253,1 0 131100 /var/tmp/portage/net-dns/.c-ares-1.9.1.portage_lockfile python2.7 22354 root 3u REG 253,1 0 200189 /var/tmp/portage/net-dns/c-ares-1.9.1/.ipc_lock (deleted) python2.7 22354 root 6r FIFO 253,1 0t0 198868 /var/tmp/portage/net-dns/c-ares-1.9.1/.ipc_out (deleted) misc-func 25166 root cwd DIR 253,1 4096 187152 /var/tmp/portage/app-shells/bash-4.2_p51/image python2.7 25494 root 3uW REG 253,1 0 187305 /var/tmp/portage/app-shells/bash-4.2_p51/.ipc_lock python2.7 25494 root 6r FIFO 253,1 0t0 155595 /var/tmp/portage/app-shells/bash-4.2_p51/.ipc_out python2.7 25497 root 3u REG 253,1 0 187305 /var/tmp/portage/app-shells/bash-4.2_p51/.ipc_lock python2.7 25497 root 6r FIFO 253,1 0t0 155595 /var/tmp/portage/app-shells/bash-4.2_p51/.ipc_out python2.7 26652 root 3u REG 253,1 0 151348 /var/tmp/portage/net-dns/c-ares-1.9.1/.ipc_lock (deleted) python2.7 26652 root 6r FIFO 253,1 0t0 131109 /var/tmp/portage/net-dns/c-ares-1.9.1/.ipc_out (deleted) emerge 29980 root 4uW REG 253,1 0 20295 /var/tmp/portage/.app-shells.portage_lockfile python2.7 29986 root 3u REG 253,1 0 151655 /var/tmp/portage/app-shells/.bash-4.2_p51.portage_lockfile
(In reply to Steven Presser from comment #5) > 200193 /var/tmp/portage/net-dns/c-ares-1.9.1/.ipc_lock (deleted) > python2.7 7955 root 6r FIFO 253,1 0t0 > 198881 /var/tmp/portage/net-dns/c-ares-1.9.1/.ipc_out (deleted) > python2.7 9244 root 3u REG 253,1 0 > 151351 /var/tmp/portage/net-dns/c-ares-1.9.1/.ipc_lock (deleted) > python2.7 12360 root 6r FIFO 253,1 0t0 > 131111 /var/tmp/portage/net-dns/c-ares-1.9.1/.ipc_out (deleted) > emerge 15981 root 4r FIFO 253,1 0t0 > 131112 /var/tmp/portage/net-dns/c-ares-1.9.1/.ipc_in > emerge 15981 root 8uW REG 253,1 0 > 131100 /var/tmp/portage/net-dns/.c-ares-1.9.1.portage_lockfile > python2.7 22354 root 3u REG 253,1 0 > 200189 /var/tmp/portage/net-dns/c-ares-1.9.1/.ipc_lock (deleted) > python2.7 22354 root 6r FIFO 253,1 0t0 > 198868 /var/tmp/portage/net-dns/c-ares-1.9.1/.ipc_out (deleted) > python2.7 26652 root 3u REG 253,1 0 > 151348 /var/tmp/portage/net-dns/c-ares-1.9.1/.ipc_lock (deleted) > python2.7 26652 root 6r FIFO 253,1 0t0 > 131109 /var/tmp/portage/net-dns/c-ares-1.9.1/.ipc_out (deleted) This clearly shows that something has deleted the .ipc_out and .ipc_lock files after they were opened, which explains the timeout. Maybe you have some script in /etc/portage/bashrc or /etc/portage/env that could have deleted those named pipes? Usually these kind of issues are due to bashrc doing something bad.
Neither of those paths exists on my install. It's a brand new install; no modifications made to anything other than package.accept_keywords, and package.use. root@hawking ~ # ls -R /etc/portage/ /etc/portage/: bin make.conf make.conf.catalyst make.profile package.accept_keywords package.use postsync.d profile savedconfig /etc/portage/bin: post_sync /etc/portage/postsync.d: demerge-record q-reinitialize /etc/portage/profile: /etc/portage/savedconfig: sys-apps /etc/portage/savedconfig/sys-apps: busybox-1.21.0
Is net-dns/c-ares the only package that you experience this problem with? After you've killed emerge, does lsof still report processes with open files in /var/tmp/portage? Maybe those stale processes interfere somehow. You should kill them all.
(In reply to Zac Medico from comment #8) > Is net-dns/c-ares the only package that you experience this problem with? > > After you've killed emerge, does lsof still report processes with open files > in /var/tmp/portage? Maybe those stale processes interfere somehow. You > should kill them all. So far, yes, c-ares is the only package I have seen this issue with. For kicks I re-emerged grub (I made a use flag update) and it worked. Depending on how quickly I run lsof, I get different results. However, I think they all go to zero within a minute.
(In reply to Steven Presser from comment #9) > Depending on how quickly I run lsof, I get different results. However, I > think they all go to zero within a minute. Well, looking again at comment #5, you can see that /var/tmp/portage/net-dns/c-ares-1.9.1/.ipc_out is deleted in some cases and not in others. This means that you have some stale processes that have probably been hanging around for much longer than a minute. Please make sure you kill all those processes, and then try to re-merge c-ares.
(In reply to Zac Medico from comment #10) > Well, looking again at comment #5, you can see that > /var/tmp/portage/net-dns/c-ares-1.9.1/.ipc_out is deleted in some cases and > not in others. This means that you have some stale processes that have > probably been hanging around for much longer than a minute. Please make sure > you kill all those processes, and then try to re-merge c-ares. And if ebuild-ipc still times out, lets look at the lsof output again, to see if there are still any ipc files that have been mysteriously deleted.
Nothing relevant running: hawking ~ # ps ax PID TTY STAT TIME COMMAND 1 ? Ss 0:05 init [3] 2 ? S 0:00 [kthreadd] 3 ? S 1:01 [ksoftirqd/0] 5 ? S< 0:00 [kworker/0:0H] 7 ? S 0:26 [rcu_sched] 8 ? S 0:00 [rcu_bh] 9 ? S 0:00 [migration/0] 10 ? S 0:00 [migration/1] 11 ? S 1:12 [ksoftirqd/1] 13 ? S< 0:00 [kworker/1:0H] 14 ? S 0:02 [migration/2] 15 ? S 0:51 [ksoftirqd/2] 17 ? S< 0:00 [kworker/2:0H] 18 ? S 0:02 [migration/3] 19 ? S 1:02 [ksoftirqd/3] 21 ? S< 0:00 [kworker/3:0H] 22 ? S< 0:00 [khelper] 23 ? S 0:00 [kdevtmpfs] 24 ? S< 0:00 [netns] 269 ? S< 0:00 [writeback] 272 ? S< 0:00 [bioset] 273 ? S< 0:00 [crypto] 275 ? S< 0:00 [kblockd] 376 ? S< 0:00 [ata_sff] 387 ? S 0:00 [khubd] 395 ? S< 0:00 [md] 532 ? S 0:11 [kswapd0] 601 ? S 0:00 [fsnotify_mark] 719 ? S 0:00 [scsi_eh_0] 720 ? S< 0:00 [scsi_tmf_0] 723 ? S 0:00 [scsi_eh_1] 724 ? S< 0:00 [scsi_tmf_1] 727 ? S 0:00 [scsi_eh_2] 728 ? S< 0:00 [scsi_tmf_2] 731 ? S 0:00 [scsi_eh_3] 732 ? S< 0:00 [scsi_tmf_3] 745 ? S 0:00 [scsi_eh_4] 746 ? S< 0:00 [scsi_tmf_4] 749 ? S 0:00 [scsi_eh_5] 750 ? S< 0:00 [scsi_tmf_5] 805 ? S< 0:00 [raid5wq] 810 ? S< 0:00 [dm_bufio_cache] 811 ? S< 0:00 [kmpathd] 812 ? S< 0:00 [kmpath_handlerd] 867 ? S< 0:00 [ipv6_addrconf] 887 ? S< 0:00 [deferwq] 2191 ? Ss 0:00 sshd: steve [priv] 2197 ? Sl 0:01 sshd: steve@pts/2 2202 pts/2 Ss+ 0:00 -bash 2438 ? S< 3:29 [kworker/3:1H] 2439 ? S< 4:07 [kworker/1:1H] 2445 ? S< 0:00 [bioset] 2447 ? S 192:33 [md127_raid5] 2454 ? S< 0:00 [bioset] 2455 ? S 1:53 [md126_raid1] 2457 ? S< 2:49 [kworker/2:1H] 2458 ? S< 3:21 [kworker/0:1H] 2467 ? S< 0:00 [kdmflush] 2468 ? S< 0:00 [bioset] 2472 ? S< 0:00 [kdmflush] 2473 ? S< 0:00 [bioset] 2476 ? S< 0:00 [kdmflush] 2477 ? S< 0:00 [bioset] 2481 ? S< 0:00 [kdmflush] 2482 ? S< 0:00 [ksnaphd] 2483 ? S< 0:00 [kcopyd] 2484 ? S< 0:00 [bioset] 2485 ? S< 0:00 [bioset] 2486 ? S< 0:00 [bioset] 2491 ? S< 0:00 [kdmflush] 2492 ? S< 0:00 [bioset] 2496 ? S< 0:00 [kdmflush] 2497 ? S< 0:00 [bioset] 2501 ? S< 0:00 [kdmflush] 2502 ? S< 0:00 [bioset] 2506 ? S< 0:00 [kdmflush] 2507 ? S< 0:00 [bioset] 2511 ? S< 0:00 [kdmflush] 2512 ? S< 0:00 [bioset] 2516 ? S< 0:00 [kdmflush] 2517 ? S< 0:00 [bioset] 2521 ? S< 0:00 [kdmflush] 2522 ? S< 0:00 [bioset] 2529 ? S 0:07 [jbd2/dm-1-8] 2530 ? S< 0:00 [ext4-rsv-conver] 2709 ? Ss 0:00 /lib/systemd/systemd-udevd --daemon 3007 ? S 0:00 [jbd2/md126-8] 3008 ? S< 0:00 [ext4-rsv-conver] 3009 ? S 0:02 [jbd2/dm-6-8] 3010 ? S< 0:00 [ext4-rsv-conver] 3011 ? S 0:02 [jbd2/dm-9-8] 3012 ? S< 0:00 [ext4-rsv-conver] 3013 ? S 0:00 [jbd2/dm-10-8] 3014 ? S< 0:00 [ext4-rsv-conver] 3116 ? S 0:00 [kworker/2:0] 3194 ? S 0:00 [kworker/0:2] 3395 ? Ss 0:00 /sbin/lvmetad -p /run/lvmetad.pid 3399 ? S 0:00 [kworker/1:2] 3421 ? S 0:00 supervising syslog-ng 3422 ? Ss 0:42 /usr/sbin/syslog-ng --persist-file /var/lib/syslog-ng 3602 ? Ss 0:01 dhcpcd -t 60 -L -m 2 enp3s0 3648 ? Ss 0:01 /usr/sbin/bitlbee -P /var/run/bitlbee/bitlbee.pid -F 3688 ? Ss 0:03 /usr/sbin/crond 3712 ? Sl 7:40 /usr/bin/python2.7 /usr/bin/fail2ban-server -b -s /ru 3747 ? Ss 0:03 SCREEN -D -m -S irc irssi 3764 ? S 0:00 [kworker/2:1] 3775 ? Ss 0:00 mdadm --monitor --scan --daemonise --pid-file /var/ru 3784 ? S 0:00 [kworker/1:1] 3817 pts/0 Ssl+ 1:49 irssi 3820 ? Ss 0:00 nginx: master process /usr/sbin/nginx -c /etc/nginx/n 3821 ? S 0:00 nginx: worker process 3847 ? Ss 0:00 /usr/bin/k5start -b -p /var/run/nss-k5start.pid -f /e 3866 ? Ss 0:00 /usr/sbin/bosserver -nofork 3871 ? S 0:00 [kworker/3:2] 3885 ? S 0:05 /usr/libexec/openafs/buserver 3886 ? S 0:05 /usr/libexec/openafs/ptserver 3887 ? S 0:05 /usr/libexec/openafs/vlserver 3888 ? S<l 0:00 /usr/libexec/openafs/fileserver 3889 ? Sl 0:00 /usr/libexec/openafs/volserver 3890 ? S 0:00 /usr/libexec/openafs/upserver -crypt /etc/openafs/ser 3921 ? S 0:12 /usr/sbin/bitlbee -P /var/run/bitlbee/bitlbee.pid -F 3961 ? Ss 0:00 sshd: steve [priv] 3966 ? Sl 0:01 sshd: steve@pts/3 3971 pts/3 Ss 0:00 -bash 3991 ? S 0:00 /usr/sbin/smartd -p /run/smartd.pid 4013 ? Ss 0:01 /usr/sbin/sshd 4033 pts/1 S+ 0:00 screen -dr irc 4035 tty1 Ss+ 0:00 /sbin/agetty 38400 tty1 linux 4036 tty2 Ss+ 0:00 /sbin/agetty 38400 tty2 linux 4037 tty3 Ss+ 0:00 /sbin/agetty 38400 tty3 linux 4038 tty4 Ss+ 0:00 /sbin/agetty 38400 tty4 linux 4039 tty5 Ss+ 0:00 /sbin/agetty 38400 tty5 linux 4040 tty6 Ss+ 0:00 /sbin/agetty 38400 tty6 linux 4071 pts/3 S 0:00 sudo su - 4072 pts/3 S 0:00 su - 4073 pts/3 S 0:00 -su 4308 ? Ssl 0:29 /usr/lib64/openldap/slapd -u ldap -g ldap -f /etc/ope 4446 ? Ss 0:08 /usr/sbin/krb5kdc 5060 ? S 0:03 [kworker/3:1] 8657 ? Ss 0:00 sshd: steve [priv] 8661 ? Sl 0:00 sshd: steve@pts/5 8667 pts/5 Ss+ 0:00 -bash 12200 ? Ss 0:00 sshd: steve [priv] 12205 ? Sl 0:00 sshd: steve@pts/1 12210 pts/1 Ss 0:00 -bash 12225 ? S 0:00 [kworker/0:1] 27220 ? S 0:00 [kworker/u8:1] 30113 ? S 0:00 [kworker/u8:0] 30156 pts/3 R+ 0:00 ps ax Old file tree removed: hawking ~ # rm -rf /var/tmp/portage/net-dns/ No stale files: hawking ~ # lsof | grep /var/tmp/portage/ hawking ~ # And it still hangs with no stale files after it is killed (~10sec delay between the kill and the lsof completing). If you want, I'll reboot the machine (it's a server; I'd rather not if I can avoid it) but I did that in my initial debugging and it didn't help.
Lets see the `lsof | grep /var/tmp/portage/` output while it's hung again, but this time without the stale processes showing up. This will allow us to verify whether or not the expected ipc pipes are properly opened by the expected processes.
(In reply to Zac Medico from comment #13) > Lets see the `lsof | grep /var/tmp/portage/` output while it's hung again, > but this time without the stale processes showing up. This will allow us to > verify whether or not the expected ipc pipes are properly opened by the > expected processes. Sure: hawking ~ # lsof | grep /var/tmp/portage/ misc-func 972 root cwd DIR 253,1 4096 151036 /var/tmp/portage/net-dns/c-ares-1.9.1/image python2.7 1246 root 3uW REG 253,1 0 185311 /var/tmp/portage/net-dns/c-ares-1.9.1/.ipc_lock python2.7 1246 root 6r FIFO 253,1 0t0 131114 /var/tmp/portage/net-dns/c-ares-1.9.1/.ipc_out python2.7 1249 root 3u REG 253,1 0 185311 /var/tmp/portage/net-dns/c-ares-1.9.1/.ipc_lock python2.7 1249 root 6r FIFO 253,1 0t0 131114 /var/tmp/portage/net-dns/c-ares-1.9.1/.ipc_out emerge 15821 root 5uW REG 253,1 0 131103 /var/tmp/portage/net-dns/.c-ares-1.9.1.portage_lockfile emerge 15821 root 9r FIFO 253,1 0t0 131113 /var/tmp/portage/net-dns/c-ares-1.9.1/.ipc_in hawking ~ # And ps ax: hawking ~ # ps ax PID TTY STAT TIME COMMAND 1 ? Ss 0:05 init [3] 2 ? S 0:00 [kthreadd] 3 ? S 1:01 [ksoftirqd/0] 5 ? S< 0:00 [kworker/0:0H] 7 ? S 0:29 [rcu_sched] 8 ? S 0:00 [rcu_bh] 9 ? S 0:00 [migration/0] 10 ? S 0:00 [migration/1] 11 ? S 1:12 [ksoftirqd/1] 13 ? S< 0:00 [kworker/1:0H] 14 ? S 0:02 [migration/2] 15 ? S 0:51 [ksoftirqd/2] 17 ? S< 0:00 [kworker/2:0H] 18 ? S 0:02 [migration/3] 19 ? S 1:02 [ksoftirqd/3] 21 ? S< 0:00 [kworker/3:0H] 22 ? S< 0:00 [khelper] 23 ? S 0:00 [kdevtmpfs] 24 ? S< 0:00 [netns] 269 ? S< 0:00 [writeback] 272 ? S< 0:00 [bioset] 273 ? S< 0:00 [crypto] 275 ? S< 0:00 [kblockd] 376 ? S< 0:00 [ata_sff] 387 ? S 0:00 [khubd] 395 ? S< 0:00 [md] 532 ? S 0:11 [kswapd0] 601 ? S 0:00 [fsnotify_mark] 719 ? S 0:00 [scsi_eh_0] 720 ? S< 0:00 [scsi_tmf_0] 723 ? S 0:00 [scsi_eh_1] 724 ? S< 0:00 [scsi_tmf_1] 727 ? S 0:00 [scsi_eh_2] 728 ? S< 0:00 [scsi_tmf_2] 731 ? S 0:00 [scsi_eh_3] 732 ? S< 0:00 [scsi_tmf_3] 745 ? S 0:00 [scsi_eh_4] 746 ? S< 0:00 [scsi_tmf_4] 749 ? S 0:00 [scsi_eh_5] 750 ? S< 0:00 [scsi_tmf_5] 805 ? S< 0:00 [raid5wq] 810 ? S< 0:00 [dm_bufio_cache] 811 ? S< 0:00 [kmpathd] 812 ? S< 0:00 [kmpath_handlerd] 867 ? S< 0:00 [ipv6_addrconf] 887 ? S< 0:00 [deferwq] 971 pts/3 S+ 0:00 [net-dns/c-ares-1.9.1] sandbox /usr/lib64/portage/bin 972 pts/3 S+ 0:00 /bin/bash /usr/lib64/portage/bin/misc-functions.sh in 1246 pts/3 S+ 0:00 /usr/bin/python2.7 /usr/lib64/portage/bin/ebuild-ipc. 1249 pts/3 S+ 0:00 /usr/bin/python2.7 /usr/lib64/portage/bin/ebuild-ipc. 1267 pts/5 R+ 0:00 ps ax 2191 ? Ss 0:00 sshd: steve [priv] 2197 ? Sl 0:01 sshd: steve@pts/2 2202 pts/2 Ss+ 0:00 -bash 2438 ? S< 3:29 [kworker/3:1H] 2439 ? S< 4:07 [kworker/1:1H] 2445 ? S< 0:00 [bioset] 2447 ? S 192:34 [md127_raid5] 2454 ? S< 0:00 [bioset] 2455 ? S 1:53 [md126_raid1] 2457 ? S< 2:49 [kworker/2:1H] 2458 ? S< 3:21 [kworker/0:1H] 2467 ? S< 0:00 [kdmflush] 2468 ? S< 0:00 [bioset] 2472 ? S< 0:00 [kdmflush] 2473 ? S< 0:00 [bioset] 2476 ? S< 0:00 [kdmflush] 2477 ? S< 0:00 [bioset] 2481 ? S< 0:00 [kdmflush] 2482 ? S< 0:00 [ksnaphd] 2483 ? S< 0:00 [kcopyd] 2484 ? S< 0:00 [bioset] 2485 ? S< 0:00 [bioset] 2486 ? S< 0:00 [bioset] 2491 ? S< 0:00 [kdmflush] 2492 ? S< 0:00 [bioset] 2496 ? S< 0:00 [kdmflush] 2497 ? S< 0:00 [bioset] 2501 ? S< 0:00 [kdmflush] 2502 ? S< 0:00 [bioset] 2506 ? S< 0:00 [kdmflush] 2507 ? S< 0:00 [bioset] 2511 ? S< 0:00 [kdmflush] 2512 ? S< 0:00 [bioset] 2516 ? S< 0:00 [kdmflush] 2517 ? S< 0:00 [bioset] 2521 ? S< 0:00 [kdmflush] 2522 ? S< 0:00 [bioset] 2529 ? S 0:07 [jbd2/dm-1-8] 2530 ? S< 0:00 [ext4-rsv-conver] 2709 ? Ss 0:00 /lib/systemd/systemd-udevd --daemon 3007 ? S 0:00 [jbd2/md126-8] 3008 ? S< 0:00 [ext4-rsv-conver] 3009 ? S 0:02 [jbd2/dm-6-8] 3010 ? S< 0:00 [ext4-rsv-conver] 3011 ? S 0:02 [jbd2/dm-9-8] 3012 ? S< 0:00 [ext4-rsv-conver] 3013 ? S 0:00 [jbd2/dm-10-8] 3014 ? S< 0:00 [ext4-rsv-conver] 3116 ? S 0:00 [kworker/2:0] 3395 ? Ss 0:00 /sbin/lvmetad -p /run/lvmetad.pid 3399 ? S 0:00 [kworker/1:2] 3421 ? S 0:00 supervising syslog-ng 3422 ? Ss 0:43 /usr/sbin/syslog-ng --persist-file /var/lib/syslog-ng 3602 ? Ss 0:01 dhcpcd -t 60 -L -m 2 enp3s0 3648 ? Ss 0:01 /usr/sbin/bitlbee -P /var/run/bitlbee/bitlbee.pid -F 3688 ? Ss 0:03 /usr/sbin/crond 3712 ? Sl 7:43 /usr/bin/python2.7 /usr/bin/fail2ban-server -b -s /ru 3747 ? Ss 0:03 SCREEN -D -m -S irc irssi 3775 ? Ss 0:00 mdadm --monitor --scan --daemonise --pid-file /var/ru 3784 ? S 0:00 [kworker/1:1] 3817 pts/0 Ssl+ 1:49 irssi 3820 ? Ss 0:00 nginx: master process /usr/sbin/nginx -c /etc/nginx/n 3821 ? S 0:00 nginx: worker process 3847 ? Ss 0:00 /usr/bin/k5start -b -p /var/run/nss-k5start.pid -f /e 3866 ? Ss 0:00 /usr/sbin/bosserver -nofork 3885 ? S 0:05 /usr/libexec/openafs/buserver 3886 ? S 0:05 /usr/libexec/openafs/ptserver 3887 ? S 0:05 /usr/libexec/openafs/vlserver 3888 ? S<l 0:00 /usr/libexec/openafs/fileserver 3889 ? Sl 0:00 /usr/libexec/openafs/volserver 3890 ? S 0:00 /usr/libexec/openafs/upserver -crypt /etc/openafs/ser 3921 ? S 0:12 /usr/sbin/bitlbee -P /var/run/bitlbee/bitlbee.pid -F 3961 ? Ss 0:00 sshd: steve [priv] 3966 ? Sl 0:01 sshd: steve@pts/3 3971 pts/3 Ss 0:00 -bash 3991 ? S 0:00 /usr/sbin/smartd -p /run/smartd.pid 4013 ? Ss 0:01 /usr/sbin/sshd 4033 pts/1 S+ 0:00 screen -dr irc 4035 tty1 Ss+ 0:00 /sbin/agetty 38400 tty1 linux 4036 tty2 Ss+ 0:00 /sbin/agetty 38400 tty2 linux 4037 tty3 Ss+ 0:00 /sbin/agetty 38400 tty3 linux 4038 tty4 Ss+ 0:00 /sbin/agetty 38400 tty4 linux 4039 tty5 Ss+ 0:00 /sbin/agetty 38400 tty5 linux 4040 tty6 Ss+ 0:00 /sbin/agetty 38400 tty6 linux 4071 pts/3 S 0:00 sudo su - 4072 pts/3 S 0:00 su - 4073 pts/3 S 0:00 -su 4308 ? Ssl 0:29 /usr/lib64/openldap/slapd -u ldap -g ldap -f /etc/ope 4446 ? Ss 0:08 /usr/sbin/krb5kdc 5060 ? S 0:03 [kworker/3:1] 8657 ? Ss 0:00 sshd: steve [priv] 8661 ? Sl 0:00 sshd: steve@pts/5 8667 pts/5 Ss 0:00 -bash 12200 ? Ss 0:00 sshd: steve [priv] 12205 ? Sl 0:00 sshd: steve@pts/1 12210 pts/1 Ss 0:00 -bash 12225 ? S 0:00 [kworker/0:1] 15760 ? S 0:00 [kworker/0:0] 15813 ? S 0:00 [kworker/2:1] 15814 ? S 0:00 [kworker/u8:0] 15821 pts/3 S+ 0:23 /usr/bin/python2.7 /usr/bin/emerge sssd 16249 pts/5 S 0:00 sudo su - 16444 pts/5 S 0:00 su - 16445 pts/5 S 0:00 -su 27220 ? S 0:00 [kworker/u8:1] 30171 ? S 0:00 [kworker/3:0] 32039 ? S 0:00 [kworker/2:2]
(In reply to Zac Medico from comment #13) > Lets see the `lsof | grep /var/tmp/portage/` output while it's hung again, > but this time without the stale processes showing up. This will allow us to > verify whether or not the expected ipc pipes are properly opened by the > expected processes. FYI, I maintain c-ares. I looked through the ebuild and build system. I can't see anything unusual about it. The build system uses autotools and configure.ac is pretty complex with lots of (too many?) custom checks. The Makefile.am is pretty straight forward as is the ebuild. I've tested it with make -jN and I haven't hit a race. So its probably a bug in portage ipc, but why c-ares triggers it is beyond me.
Created attachment 386198 [details] All packages installed + use flags No idea if this is relevant, but it probably can't hurt... The OpenAFS package is from my personal overlay, but is based on the one in mainline portage (I changed the version number and the URIs). I can disable this overlay if that would be helpful.
(In reply to Anthony Basile from comment #15) > FYI, I maintain c-ares. I looked through the ebuild and build system. I > can't see anything unusual about it. The build system uses autotools and > configure.ac is pretty complex with lots of (too many?) custom checks. The > Makefile.am is pretty straight forward as is the ebuild. I've tested it > with make -jN and I haven't hit a race. > > So its probably a bug in portage ipc, but why c-ares triggers it is beyond > me. Yeah, it's really puzzling. The ipc code is pretty simple, and appears to work perfectly on the vast majority of Gentoo systems out there. It will be interesting to find out what's causing this particular hiccup on this one user's server. (In reply to Steven Presser from comment #14) > (In reply to Zac Medico from comment #13) > > Lets see the `lsof | grep /var/tmp/portage/` output while it's hung again, > > but this time without the stale processes showing up. This will allow us to > > verify whether or not the expected ipc pipes are properly opened by the > > expected processes. > > Sure: > hawking ~ # lsof | grep /var/tmp/portage/ > misc-func 972 root cwd DIR 253,1 4096 > 151036 /var/tmp/portage/net-dns/c-ares-1.9.1/image > python2.7 1246 root 3uW REG 253,1 0 > 185311 /var/tmp/portage/net-dns/c-ares-1.9.1/.ipc_lock > python2.7 1246 root 6r FIFO 253,1 0t0 > 131114 /var/tmp/portage/net-dns/c-ares-1.9.1/.ipc_out > python2.7 1249 root 3u REG 253,1 0 > 185311 /var/tmp/portage/net-dns/c-ares-1.9.1/.ipc_lock > python2.7 1249 root 6r FIFO 253,1 0t0 > 131114 /var/tmp/portage/net-dns/c-ares-1.9.1/.ipc_out > emerge 15821 root 5uW REG 253,1 0 > 131103 /var/tmp/portage/net-dns/.c-ares-1.9.1.portage_lockfile > emerge 15821 root 9r FIFO 253,1 0t0 > 131113 /var/tmp/portage/net-dns/c-ares-1.9.1/.ipc_in The lsof output looks exactly as it should here, except that .ipc_in is not opened by the ebuild-ipc (python2.7) processes. This indicates that the ebuild-ipc open(write) call on .ipc_in is blocking. Normally, it should not block, since the emerge process has already opened the read end of .ipc_in (confirmed in the lsof output). This looks like a kernel bug to me, since there's no valid reason for the open(write) call to block indefinitely here.
@Steve: Can you use `strace -p <pid>` to verify that the ebuild-ipc (python2.7) process is blocking on the open call? I'm sure that's where it's blocking, but it would be nice to prove it. Note that there are 2 ebuild-ipc processes, due to a fork, so you want to strace the process with the higher pid (assuming that the kernel's pid counter hasn't wrapped around just before the fork).
(In reply to Zac Medico from comment #18) > @Steve: Can you use `strace -p <pid>` to verify that the ebuild-ipc > (python2.7) process is blocking on the open call? I'm sure that's where it's > blocking, but it would be nice to prove it. Note that there are 2 ebuild-ipc > processes, due to a fork, so you want to strace the process with the higher > pid (assuming that the kernel's pid counter hasn't wrapped around just > before the fork). I hope this was what you meant: hawking ~ # ps ax | grep ebuild-ipc 20518 pts/3 S+ 0:00 /usr/bin/python2.7 /usr/lib64/portage/bin/ebuild-ipc.py exit 0 20521 pts/3 S+ 0:00 /usr/bin/python2.7 /usr/lib64/portage/bin/ebuild-ipc.py exit 0 20523 pts/5 S+ 0:00 grep --colour=auto ebuild-ipc hawking ~ # strace -p 20521 Process 20521 attached futex(0x7fde7a4aa340, FUTEX_WAIT_PRIVATE, 2, NULL Also, as I mentioned in my original report, this is a heavily stripped down and modified kernel configuration. My suspicion is that I stripped something that emerge-ipc depends on. I understand that this maybe makes support harder, but it seems that the system ought to not let me disable required kernel flags.
(In reply to Steven Presser from comment #19) > I hope this was what you meant: Yes, this is perfect. > hawking ~ # ps ax | grep ebuild-ipc > 20518 pts/3 S+ 0:00 /usr/bin/python2.7 > /usr/lib64/portage/bin/ebuild-ipc.py exit 0 > 20521 pts/3 S+ 0:00 /usr/bin/python2.7 > /usr/lib64/portage/bin/ebuild-ipc.py exit 0 > 20523 pts/5 S+ 0:00 grep --colour=auto ebuild-ipc > hawking ~ # strace -p 20521 > Process 20521 attached > futex(0x7fde7a4aa340, FUTEX_WAIT_PRIVATE, 2, NULL It looks like python is deadlocking in a futex call, and a quick google search seems to indicate that this sort of deadlock is common when mixing threads and forks. However, ebuild-ipc uses only forks (no threads used). @kernel: Removing you from CC since now it seems that a python deadlock may be the culprit.
I'm thinking about writing a patch to make it use a thread instead of a fork if python threading is enabled. If it's a python bug involving forks, then using a thread may be a good workaround.
This problem is reproducible also after switching to Python 3.3 or 3.4?
Created attachment 386218 [details, diff] bin/ebuild-ipc.py: use threading if available If threading is available then use a thread instead of a fork for writing to the fifo. This has the advantage of avoiding a possible python futex deadlock following fork. @Steve: Please test if this patch fixes your problem.
@Arfrever: Yes, the issue persists with python 3.3 @Zac: That seems to have unstuck it! Interestingly, I've noticed that my nightly update script (https://github.com/spresse1/system-utilities/tree/master/gentoo-autoupdate should be up to date) is stalling during its update phase. Obviously, it's been getting unstuck the first time each day that I run killall-9 emerge, so I don't think this has had any effect on the behavior, but is non the less interesting. It looks like its stalling on app-shells/bash-4.2_p53. I will verify this (by undoing the patch) once I have sssd emerged and report back.
Welp, glibc bailed on me, so there's another issue to go look at... In the meantime, I can conform that app-shells/bash-4.2_p53 also triggers this issue. I'd be happy to provide a build log if desired, but my suspicion is nothing useful will show up. However, it is notable that they get stuck in the same place: >>> Completed installing bash-4.2_p53 into /var/tmp/portage/app-shells/bash-4.2_p53/image/ strip: x86_64-pc-linux-gnu-strip --strip-unneeded -R .comment -R .GCC.command.line -R .note.gnu.gold-version bin/bash ecompressdir: bzip2 -9 /usr/share/info ecompressdir: bzip2 -9 /usr/share/man ecompressdir: bzip2 -9 /usr/share/doc As I mentioned the patch works and I have updated bash. I'm providing this in case it is somehow useful.
(In reply to Steven Presser from comment #24) > @Zac: That seems to have unstuck it! Great, thanks for testing.
Created attachment 386272 [details, diff] bin/ebuild-ipc.py: use threading if available This updated patch is fixed to prevent ebuild-ipc processes from hanging around in some cases like bug 345455. I'll be maintaining this patch in the following branch: https://github.com/zmedico/portage/tree/bug_524328
Created attachment 386316 [details, diff] bin/ebuild-ipc.py: nonblocking fifo write This updated patch uses nonblocking write so that no thread or fork is necessary. I've tested it on Linux and FreeBSD, and it seems to work fine. Thought I don't think we'll want to use it, the threading patch is still available in this branch: https://github.com/zmedico/portage/tree/bug_524328_threading
This is in git: https://github.com/gentoo/portage/commit/0903a300c9d79900389e59eab6fcfe9f12dbab51
This is fixed in 2.2.14.