Summary: | dev-lang/python-2.7.7: net-dns/c-ares-1.9.1 triggers ebuild-ipc timed out during write, due to python deadlock in futex call after fork (no threads used) | ||
---|---|---|---|
Product: | Portage Development | Reporter: | Steven Presser <steve> |
Component: | Unclassified | Assignee: | Portage team <dev-portage> |
Status: | RESOLVED FIXED | ||
Severity: | normal | CC: | blueness, python |
Priority: | Normal | Keywords: | InVCS, PATCH, TESTED |
Version: | unspecified | ||
Hardware: | All | ||
OS: | Linux | ||
See Also: | https://bugs.gentoo.org/show_bug.cgi?id=334423 | ||
Whiteboard: | |||
Package list: | Runtime testing required: | --- | |
Bug Depends on: | |||
Bug Blocks: | 484436 | ||
Attachments: |
Log of emerge invocation which hangs
Kernel config of the affected machine Kernel config of the UNaffected machine All packages installed + use flags bin/ebuild-ipc.py: use threading if available bin/ebuild-ipc.py: use threading if available bin/ebuild-ipc.py: nonblocking fifo write |
Description
Steven Presser
2014-10-03 03:04:34 UTC
Created attachment 385966 [details]
Log of emerge invocation which hangs
Created attachment 385968 [details]
Kernel config of the affected machine
Created attachment 385970 [details]
Kernel config of the UNaffected machine
While ebuild-ipc is timing out, run `lsof | grep /var/tmp/portage` to see what's going on with the ipc pipes. steve@hawking ~ $ sudo lsof | grep /var/tmp/portage Password: misc-func 1407 root cwd DIR 253,1 4096 151035 /var/tmp/portage/net-dns/c-ares-1.9.1/image python2.7 1681 root 3uW REG 253,1 0 187388 /var/tmp/portage/net-dns/c-ares-1.9.1/.ipc_lock python2.7 1681 root 6r FIFO 253,1 0t0 131113 /var/tmp/portage/net-dns/c-ares-1.9.1/.ipc_out python2.7 1684 root 3u REG 253,1 0 187388 /var/tmp/portage/net-dns/c-ares-1.9.1/.ipc_lock python2.7 1684 root 6r FIFO 253,1 0t0 131113 /var/tmp/portage/net-dns/c-ares-1.9.1/.ipc_out python2.7 2479 root 3u REG 253,1 0 20295 /var/tmp/portage/.app-shells.portage_lockfile python2.7 7955 root 3u REG 253,1 0 200193 /var/tmp/portage/net-dns/c-ares-1.9.1/.ipc_lock (deleted) python2.7 7955 root 6r FIFO 253,1 0t0 198881 /var/tmp/portage/net-dns/c-ares-1.9.1/.ipc_out (deleted) python2.7 9244 root 3u REG 253,1 0 20295 /var/tmp/portage/.app-shells.portage_lockfile python2.7 12360 root 3u REG 253,1 0 151351 /var/tmp/portage/net-dns/c-ares-1.9.1/.ipc_lock (deleted) python2.7 12360 root 6r FIFO 253,1 0t0 131111 /var/tmp/portage/net-dns/c-ares-1.9.1/.ipc_out (deleted) emerge 13313 root 5uW REG 253,1 0 151655 /var/tmp/portage/app-shells/.bash-4.2_p51.portage_lockfile emerge 13313 root 9r FIFO 253,1 0t0 155594 /var/tmp/portage/app-shells/bash-4.2_p51/.ipc_in python2.7 14348 root 3u REG 253,1 0 20295 /var/tmp/portage/.app-shells.portage_lockfile emerge 15981 root 4r FIFO 253,1 0t0 131112 /var/tmp/portage/net-dns/c-ares-1.9.1/.ipc_in emerge 15981 root 8uW REG 253,1 0 131100 /var/tmp/portage/net-dns/.c-ares-1.9.1.portage_lockfile python2.7 22354 root 3u REG 253,1 0 200189 /var/tmp/portage/net-dns/c-ares-1.9.1/.ipc_lock (deleted) python2.7 22354 root 6r FIFO 253,1 0t0 198868 /var/tmp/portage/net-dns/c-ares-1.9.1/.ipc_out (deleted) misc-func 25166 root cwd DIR 253,1 4096 187152 /var/tmp/portage/app-shells/bash-4.2_p51/image python2.7 25494 root 3uW REG 253,1 0 187305 /var/tmp/portage/app-shells/bash-4.2_p51/.ipc_lock python2.7 25494 root 6r FIFO 253,1 0t0 155595 /var/tmp/portage/app-shells/bash-4.2_p51/.ipc_out python2.7 25497 root 3u REG 253,1 0 187305 /var/tmp/portage/app-shells/bash-4.2_p51/.ipc_lock python2.7 25497 root 6r FIFO 253,1 0t0 155595 /var/tmp/portage/app-shells/bash-4.2_p51/.ipc_out python2.7 26652 root 3u REG 253,1 0 151348 /var/tmp/portage/net-dns/c-ares-1.9.1/.ipc_lock (deleted) python2.7 26652 root 6r FIFO 253,1 0t0 131109 /var/tmp/portage/net-dns/c-ares-1.9.1/.ipc_out (deleted) emerge 29980 root 4uW REG 253,1 0 20295 /var/tmp/portage/.app-shells.portage_lockfile python2.7 29986 root 3u REG 253,1 0 151655 /var/tmp/portage/app-shells/.bash-4.2_p51.portage_lockfile (In reply to Steven Presser from comment #5) > 200193 /var/tmp/portage/net-dns/c-ares-1.9.1/.ipc_lock (deleted) > python2.7 7955 root 6r FIFO 253,1 0t0 > 198881 /var/tmp/portage/net-dns/c-ares-1.9.1/.ipc_out (deleted) > python2.7 9244 root 3u REG 253,1 0 > 151351 /var/tmp/portage/net-dns/c-ares-1.9.1/.ipc_lock (deleted) > python2.7 12360 root 6r FIFO 253,1 0t0 > 131111 /var/tmp/portage/net-dns/c-ares-1.9.1/.ipc_out (deleted) > emerge 15981 root 4r FIFO 253,1 0t0 > 131112 /var/tmp/portage/net-dns/c-ares-1.9.1/.ipc_in > emerge 15981 root 8uW REG 253,1 0 > 131100 /var/tmp/portage/net-dns/.c-ares-1.9.1.portage_lockfile > python2.7 22354 root 3u REG 253,1 0 > 200189 /var/tmp/portage/net-dns/c-ares-1.9.1/.ipc_lock (deleted) > python2.7 22354 root 6r FIFO 253,1 0t0 > 198868 /var/tmp/portage/net-dns/c-ares-1.9.1/.ipc_out (deleted) > python2.7 26652 root 3u REG 253,1 0 > 151348 /var/tmp/portage/net-dns/c-ares-1.9.1/.ipc_lock (deleted) > python2.7 26652 root 6r FIFO 253,1 0t0 > 131109 /var/tmp/portage/net-dns/c-ares-1.9.1/.ipc_out (deleted) This clearly shows that something has deleted the .ipc_out and .ipc_lock files after they were opened, which explains the timeout. Maybe you have some script in /etc/portage/bashrc or /etc/portage/env that could have deleted those named pipes? Usually these kind of issues are due to bashrc doing something bad. Neither of those paths exists on my install. It's a brand new install; no modifications made to anything other than package.accept_keywords, and package.use. root@hawking ~ # ls -R /etc/portage/ /etc/portage/: bin make.conf make.conf.catalyst make.profile package.accept_keywords package.use postsync.d profile savedconfig /etc/portage/bin: post_sync /etc/portage/postsync.d: demerge-record q-reinitialize /etc/portage/profile: /etc/portage/savedconfig: sys-apps /etc/portage/savedconfig/sys-apps: busybox-1.21.0 Is net-dns/c-ares the only package that you experience this problem with? After you've killed emerge, does lsof still report processes with open files in /var/tmp/portage? Maybe those stale processes interfere somehow. You should kill them all. (In reply to Zac Medico from comment #8) > Is net-dns/c-ares the only package that you experience this problem with? > > After you've killed emerge, does lsof still report processes with open files > in /var/tmp/portage? Maybe those stale processes interfere somehow. You > should kill them all. So far, yes, c-ares is the only package I have seen this issue with. For kicks I re-emerged grub (I made a use flag update) and it worked. Depending on how quickly I run lsof, I get different results. However, I think they all go to zero within a minute. (In reply to Steven Presser from comment #9) > Depending on how quickly I run lsof, I get different results. However, I > think they all go to zero within a minute. Well, looking again at comment #5, you can see that /var/tmp/portage/net-dns/c-ares-1.9.1/.ipc_out is deleted in some cases and not in others. This means that you have some stale processes that have probably been hanging around for much longer than a minute. Please make sure you kill all those processes, and then try to re-merge c-ares. (In reply to Zac Medico from comment #10) > Well, looking again at comment #5, you can see that > /var/tmp/portage/net-dns/c-ares-1.9.1/.ipc_out is deleted in some cases and > not in others. This means that you have some stale processes that have > probably been hanging around for much longer than a minute. Please make sure > you kill all those processes, and then try to re-merge c-ares. And if ebuild-ipc still times out, lets look at the lsof output again, to see if there are still any ipc files that have been mysteriously deleted. Nothing relevant running: hawking ~ # ps ax PID TTY STAT TIME COMMAND 1 ? Ss 0:05 init [3] 2 ? S 0:00 [kthreadd] 3 ? S 1:01 [ksoftirqd/0] 5 ? S< 0:00 [kworker/0:0H] 7 ? S 0:26 [rcu_sched] 8 ? S 0:00 [rcu_bh] 9 ? S 0:00 [migration/0] 10 ? S 0:00 [migration/1] 11 ? S 1:12 [ksoftirqd/1] 13 ? S< 0:00 [kworker/1:0H] 14 ? S 0:02 [migration/2] 15 ? S 0:51 [ksoftirqd/2] 17 ? S< 0:00 [kworker/2:0H] 18 ? S 0:02 [migration/3] 19 ? S 1:02 [ksoftirqd/3] 21 ? S< 0:00 [kworker/3:0H] 22 ? S< 0:00 [khelper] 23 ? S 0:00 [kdevtmpfs] 24 ? S< 0:00 [netns] 269 ? S< 0:00 [writeback] 272 ? S< 0:00 [bioset] 273 ? S< 0:00 [crypto] 275 ? S< 0:00 [kblockd] 376 ? S< 0:00 [ata_sff] 387 ? S 0:00 [khubd] 395 ? S< 0:00 [md] 532 ? S 0:11 [kswapd0] 601 ? S 0:00 [fsnotify_mark] 719 ? S 0:00 [scsi_eh_0] 720 ? S< 0:00 [scsi_tmf_0] 723 ? S 0:00 [scsi_eh_1] 724 ? S< 0:00 [scsi_tmf_1] 727 ? S 0:00 [scsi_eh_2] 728 ? S< 0:00 [scsi_tmf_2] 731 ? S 0:00 [scsi_eh_3] 732 ? S< 0:00 [scsi_tmf_3] 745 ? S 0:00 [scsi_eh_4] 746 ? S< 0:00 [scsi_tmf_4] 749 ? S 0:00 [scsi_eh_5] 750 ? S< 0:00 [scsi_tmf_5] 805 ? S< 0:00 [raid5wq] 810 ? S< 0:00 [dm_bufio_cache] 811 ? S< 0:00 [kmpathd] 812 ? S< 0:00 [kmpath_handlerd] 867 ? S< 0:00 [ipv6_addrconf] 887 ? S< 0:00 [deferwq] 2191 ? Ss 0:00 sshd: steve [priv] 2197 ? Sl 0:01 sshd: steve@pts/2 2202 pts/2 Ss+ 0:00 -bash 2438 ? S< 3:29 [kworker/3:1H] 2439 ? S< 4:07 [kworker/1:1H] 2445 ? S< 0:00 [bioset] 2447 ? S 192:33 [md127_raid5] 2454 ? S< 0:00 [bioset] 2455 ? S 1:53 [md126_raid1] 2457 ? S< 2:49 [kworker/2:1H] 2458 ? S< 3:21 [kworker/0:1H] 2467 ? S< 0:00 [kdmflush] 2468 ? S< 0:00 [bioset] 2472 ? S< 0:00 [kdmflush] 2473 ? S< 0:00 [bioset] 2476 ? S< 0:00 [kdmflush] 2477 ? S< 0:00 [bioset] 2481 ? S< 0:00 [kdmflush] 2482 ? S< 0:00 [ksnaphd] 2483 ? S< 0:00 [kcopyd] 2484 ? S< 0:00 [bioset] 2485 ? S< 0:00 [bioset] 2486 ? S< 0:00 [bioset] 2491 ? S< 0:00 [kdmflush] 2492 ? S< 0:00 [bioset] 2496 ? S< 0:00 [kdmflush] 2497 ? S< 0:00 [bioset] 2501 ? S< 0:00 [kdmflush] 2502 ? S< 0:00 [bioset] 2506 ? S< 0:00 [kdmflush] 2507 ? S< 0:00 [bioset] 2511 ? S< 0:00 [kdmflush] 2512 ? S< 0:00 [bioset] 2516 ? S< 0:00 [kdmflush] 2517 ? S< 0:00 [bioset] 2521 ? S< 0:00 [kdmflush] 2522 ? S< 0:00 [bioset] 2529 ? S 0:07 [jbd2/dm-1-8] 2530 ? S< 0:00 [ext4-rsv-conver] 2709 ? Ss 0:00 /lib/systemd/systemd-udevd --daemon 3007 ? S 0:00 [jbd2/md126-8] 3008 ? S< 0:00 [ext4-rsv-conver] 3009 ? S 0:02 [jbd2/dm-6-8] 3010 ? S< 0:00 [ext4-rsv-conver] 3011 ? S 0:02 [jbd2/dm-9-8] 3012 ? S< 0:00 [ext4-rsv-conver] 3013 ? S 0:00 [jbd2/dm-10-8] 3014 ? S< 0:00 [ext4-rsv-conver] 3116 ? S 0:00 [kworker/2:0] 3194 ? S 0:00 [kworker/0:2] 3395 ? Ss 0:00 /sbin/lvmetad -p /run/lvmetad.pid 3399 ? S 0:00 [kworker/1:2] 3421 ? S 0:00 supervising syslog-ng 3422 ? Ss 0:42 /usr/sbin/syslog-ng --persist-file /var/lib/syslog-ng 3602 ? Ss 0:01 dhcpcd -t 60 -L -m 2 enp3s0 3648 ? Ss 0:01 /usr/sbin/bitlbee -P /var/run/bitlbee/bitlbee.pid -F 3688 ? Ss 0:03 /usr/sbin/crond 3712 ? Sl 7:40 /usr/bin/python2.7 /usr/bin/fail2ban-server -b -s /ru 3747 ? Ss 0:03 SCREEN -D -m -S irc irssi 3764 ? S 0:00 [kworker/2:1] 3775 ? Ss 0:00 mdadm --monitor --scan --daemonise --pid-file /var/ru 3784 ? S 0:00 [kworker/1:1] 3817 pts/0 Ssl+ 1:49 irssi 3820 ? Ss 0:00 nginx: master process /usr/sbin/nginx -c /etc/nginx/n 3821 ? S 0:00 nginx: worker process 3847 ? Ss 0:00 /usr/bin/k5start -b -p /var/run/nss-k5start.pid -f /e 3866 ? Ss 0:00 /usr/sbin/bosserver -nofork 3871 ? S 0:00 [kworker/3:2] 3885 ? S 0:05 /usr/libexec/openafs/buserver 3886 ? S 0:05 /usr/libexec/openafs/ptserver 3887 ? S 0:05 /usr/libexec/openafs/vlserver 3888 ? S<l 0:00 /usr/libexec/openafs/fileserver 3889 ? Sl 0:00 /usr/libexec/openafs/volserver 3890 ? S 0:00 /usr/libexec/openafs/upserver -crypt /etc/openafs/ser 3921 ? S 0:12 /usr/sbin/bitlbee -P /var/run/bitlbee/bitlbee.pid -F 3961 ? Ss 0:00 sshd: steve [priv] 3966 ? Sl 0:01 sshd: steve@pts/3 3971 pts/3 Ss 0:00 -bash 3991 ? S 0:00 /usr/sbin/smartd -p /run/smartd.pid 4013 ? Ss 0:01 /usr/sbin/sshd 4033 pts/1 S+ 0:00 screen -dr irc 4035 tty1 Ss+ 0:00 /sbin/agetty 38400 tty1 linux 4036 tty2 Ss+ 0:00 /sbin/agetty 38400 tty2 linux 4037 tty3 Ss+ 0:00 /sbin/agetty 38400 tty3 linux 4038 tty4 Ss+ 0:00 /sbin/agetty 38400 tty4 linux 4039 tty5 Ss+ 0:00 /sbin/agetty 38400 tty5 linux 4040 tty6 Ss+ 0:00 /sbin/agetty 38400 tty6 linux 4071 pts/3 S 0:00 sudo su - 4072 pts/3 S 0:00 su - 4073 pts/3 S 0:00 -su 4308 ? Ssl 0:29 /usr/lib64/openldap/slapd -u ldap -g ldap -f /etc/ope 4446 ? Ss 0:08 /usr/sbin/krb5kdc 5060 ? S 0:03 [kworker/3:1] 8657 ? Ss 0:00 sshd: steve [priv] 8661 ? Sl 0:00 sshd: steve@pts/5 8667 pts/5 Ss+ 0:00 -bash 12200 ? Ss 0:00 sshd: steve [priv] 12205 ? Sl 0:00 sshd: steve@pts/1 12210 pts/1 Ss 0:00 -bash 12225 ? S 0:00 [kworker/0:1] 27220 ? S 0:00 [kworker/u8:1] 30113 ? S 0:00 [kworker/u8:0] 30156 pts/3 R+ 0:00 ps ax Old file tree removed: hawking ~ # rm -rf /var/tmp/portage/net-dns/ No stale files: hawking ~ # lsof | grep /var/tmp/portage/ hawking ~ # And it still hangs with no stale files after it is killed (~10sec delay between the kill and the lsof completing). If you want, I'll reboot the machine (it's a server; I'd rather not if I can avoid it) but I did that in my initial debugging and it didn't help. Lets see the `lsof | grep /var/tmp/portage/` output while it's hung again, but this time without the stale processes showing up. This will allow us to verify whether or not the expected ipc pipes are properly opened by the expected processes. (In reply to Zac Medico from comment #13) > Lets see the `lsof | grep /var/tmp/portage/` output while it's hung again, > but this time without the stale processes showing up. This will allow us to > verify whether or not the expected ipc pipes are properly opened by the > expected processes. Sure: hawking ~ # lsof | grep /var/tmp/portage/ misc-func 972 root cwd DIR 253,1 4096 151036 /var/tmp/portage/net-dns/c-ares-1.9.1/image python2.7 1246 root 3uW REG 253,1 0 185311 /var/tmp/portage/net-dns/c-ares-1.9.1/.ipc_lock python2.7 1246 root 6r FIFO 253,1 0t0 131114 /var/tmp/portage/net-dns/c-ares-1.9.1/.ipc_out python2.7 1249 root 3u REG 253,1 0 185311 /var/tmp/portage/net-dns/c-ares-1.9.1/.ipc_lock python2.7 1249 root 6r FIFO 253,1 0t0 131114 /var/tmp/portage/net-dns/c-ares-1.9.1/.ipc_out emerge 15821 root 5uW REG 253,1 0 131103 /var/tmp/portage/net-dns/.c-ares-1.9.1.portage_lockfile emerge 15821 root 9r FIFO 253,1 0t0 131113 /var/tmp/portage/net-dns/c-ares-1.9.1/.ipc_in hawking ~ # And ps ax: hawking ~ # ps ax PID TTY STAT TIME COMMAND 1 ? Ss 0:05 init [3] 2 ? S 0:00 [kthreadd] 3 ? S 1:01 [ksoftirqd/0] 5 ? S< 0:00 [kworker/0:0H] 7 ? S 0:29 [rcu_sched] 8 ? S 0:00 [rcu_bh] 9 ? S 0:00 [migration/0] 10 ? S 0:00 [migration/1] 11 ? S 1:12 [ksoftirqd/1] 13 ? S< 0:00 [kworker/1:0H] 14 ? S 0:02 [migration/2] 15 ? S 0:51 [ksoftirqd/2] 17 ? S< 0:00 [kworker/2:0H] 18 ? S 0:02 [migration/3] 19 ? S 1:02 [ksoftirqd/3] 21 ? S< 0:00 [kworker/3:0H] 22 ? S< 0:00 [khelper] 23 ? S 0:00 [kdevtmpfs] 24 ? S< 0:00 [netns] 269 ? S< 0:00 [writeback] 272 ? S< 0:00 [bioset] 273 ? S< 0:00 [crypto] 275 ? S< 0:00 [kblockd] 376 ? S< 0:00 [ata_sff] 387 ? S 0:00 [khubd] 395 ? S< 0:00 [md] 532 ? S 0:11 [kswapd0] 601 ? S 0:00 [fsnotify_mark] 719 ? S 0:00 [scsi_eh_0] 720 ? S< 0:00 [scsi_tmf_0] 723 ? S 0:00 [scsi_eh_1] 724 ? S< 0:00 [scsi_tmf_1] 727 ? S 0:00 [scsi_eh_2] 728 ? S< 0:00 [scsi_tmf_2] 731 ? S 0:00 [scsi_eh_3] 732 ? S< 0:00 [scsi_tmf_3] 745 ? S 0:00 [scsi_eh_4] 746 ? S< 0:00 [scsi_tmf_4] 749 ? S 0:00 [scsi_eh_5] 750 ? S< 0:00 [scsi_tmf_5] 805 ? S< 0:00 [raid5wq] 810 ? S< 0:00 [dm_bufio_cache] 811 ? S< 0:00 [kmpathd] 812 ? S< 0:00 [kmpath_handlerd] 867 ? S< 0:00 [ipv6_addrconf] 887 ? S< 0:00 [deferwq] 971 pts/3 S+ 0:00 [net-dns/c-ares-1.9.1] sandbox /usr/lib64/portage/bin 972 pts/3 S+ 0:00 /bin/bash /usr/lib64/portage/bin/misc-functions.sh in 1246 pts/3 S+ 0:00 /usr/bin/python2.7 /usr/lib64/portage/bin/ebuild-ipc. 1249 pts/3 S+ 0:00 /usr/bin/python2.7 /usr/lib64/portage/bin/ebuild-ipc. 1267 pts/5 R+ 0:00 ps ax 2191 ? Ss 0:00 sshd: steve [priv] 2197 ? Sl 0:01 sshd: steve@pts/2 2202 pts/2 Ss+ 0:00 -bash 2438 ? S< 3:29 [kworker/3:1H] 2439 ? S< 4:07 [kworker/1:1H] 2445 ? S< 0:00 [bioset] 2447 ? S 192:34 [md127_raid5] 2454 ? S< 0:00 [bioset] 2455 ? S 1:53 [md126_raid1] 2457 ? S< 2:49 [kworker/2:1H] 2458 ? S< 3:21 [kworker/0:1H] 2467 ? S< 0:00 [kdmflush] 2468 ? S< 0:00 [bioset] 2472 ? S< 0:00 [kdmflush] 2473 ? S< 0:00 [bioset] 2476 ? S< 0:00 [kdmflush] 2477 ? S< 0:00 [bioset] 2481 ? S< 0:00 [kdmflush] 2482 ? S< 0:00 [ksnaphd] 2483 ? S< 0:00 [kcopyd] 2484 ? S< 0:00 [bioset] 2485 ? S< 0:00 [bioset] 2486 ? S< 0:00 [bioset] 2491 ? S< 0:00 [kdmflush] 2492 ? S< 0:00 [bioset] 2496 ? S< 0:00 [kdmflush] 2497 ? S< 0:00 [bioset] 2501 ? S< 0:00 [kdmflush] 2502 ? S< 0:00 [bioset] 2506 ? S< 0:00 [kdmflush] 2507 ? S< 0:00 [bioset] 2511 ? S< 0:00 [kdmflush] 2512 ? S< 0:00 [bioset] 2516 ? S< 0:00 [kdmflush] 2517 ? S< 0:00 [bioset] 2521 ? S< 0:00 [kdmflush] 2522 ? S< 0:00 [bioset] 2529 ? S 0:07 [jbd2/dm-1-8] 2530 ? S< 0:00 [ext4-rsv-conver] 2709 ? Ss 0:00 /lib/systemd/systemd-udevd --daemon 3007 ? S 0:00 [jbd2/md126-8] 3008 ? S< 0:00 [ext4-rsv-conver] 3009 ? S 0:02 [jbd2/dm-6-8] 3010 ? S< 0:00 [ext4-rsv-conver] 3011 ? S 0:02 [jbd2/dm-9-8] 3012 ? S< 0:00 [ext4-rsv-conver] 3013 ? S 0:00 [jbd2/dm-10-8] 3014 ? S< 0:00 [ext4-rsv-conver] 3116 ? S 0:00 [kworker/2:0] 3395 ? Ss 0:00 /sbin/lvmetad -p /run/lvmetad.pid 3399 ? S 0:00 [kworker/1:2] 3421 ? S 0:00 supervising syslog-ng 3422 ? Ss 0:43 /usr/sbin/syslog-ng --persist-file /var/lib/syslog-ng 3602 ? Ss 0:01 dhcpcd -t 60 -L -m 2 enp3s0 3648 ? Ss 0:01 /usr/sbin/bitlbee -P /var/run/bitlbee/bitlbee.pid -F 3688 ? Ss 0:03 /usr/sbin/crond 3712 ? Sl 7:43 /usr/bin/python2.7 /usr/bin/fail2ban-server -b -s /ru 3747 ? Ss 0:03 SCREEN -D -m -S irc irssi 3775 ? Ss 0:00 mdadm --monitor --scan --daemonise --pid-file /var/ru 3784 ? S 0:00 [kworker/1:1] 3817 pts/0 Ssl+ 1:49 irssi 3820 ? Ss 0:00 nginx: master process /usr/sbin/nginx -c /etc/nginx/n 3821 ? S 0:00 nginx: worker process 3847 ? Ss 0:00 /usr/bin/k5start -b -p /var/run/nss-k5start.pid -f /e 3866 ? Ss 0:00 /usr/sbin/bosserver -nofork 3885 ? S 0:05 /usr/libexec/openafs/buserver 3886 ? S 0:05 /usr/libexec/openafs/ptserver 3887 ? S 0:05 /usr/libexec/openafs/vlserver 3888 ? S<l 0:00 /usr/libexec/openafs/fileserver 3889 ? Sl 0:00 /usr/libexec/openafs/volserver 3890 ? S 0:00 /usr/libexec/openafs/upserver -crypt /etc/openafs/ser 3921 ? S 0:12 /usr/sbin/bitlbee -P /var/run/bitlbee/bitlbee.pid -F 3961 ? Ss 0:00 sshd: steve [priv] 3966 ? Sl 0:01 sshd: steve@pts/3 3971 pts/3 Ss 0:00 -bash 3991 ? S 0:00 /usr/sbin/smartd -p /run/smartd.pid 4013 ? Ss 0:01 /usr/sbin/sshd 4033 pts/1 S+ 0:00 screen -dr irc 4035 tty1 Ss+ 0:00 /sbin/agetty 38400 tty1 linux 4036 tty2 Ss+ 0:00 /sbin/agetty 38400 tty2 linux 4037 tty3 Ss+ 0:00 /sbin/agetty 38400 tty3 linux 4038 tty4 Ss+ 0:00 /sbin/agetty 38400 tty4 linux 4039 tty5 Ss+ 0:00 /sbin/agetty 38400 tty5 linux 4040 tty6 Ss+ 0:00 /sbin/agetty 38400 tty6 linux 4071 pts/3 S 0:00 sudo su - 4072 pts/3 S 0:00 su - 4073 pts/3 S 0:00 -su 4308 ? Ssl 0:29 /usr/lib64/openldap/slapd -u ldap -g ldap -f /etc/ope 4446 ? Ss 0:08 /usr/sbin/krb5kdc 5060 ? S 0:03 [kworker/3:1] 8657 ? Ss 0:00 sshd: steve [priv] 8661 ? Sl 0:00 sshd: steve@pts/5 8667 pts/5 Ss 0:00 -bash 12200 ? Ss 0:00 sshd: steve [priv] 12205 ? Sl 0:00 sshd: steve@pts/1 12210 pts/1 Ss 0:00 -bash 12225 ? S 0:00 [kworker/0:1] 15760 ? S 0:00 [kworker/0:0] 15813 ? S 0:00 [kworker/2:1] 15814 ? S 0:00 [kworker/u8:0] 15821 pts/3 S+ 0:23 /usr/bin/python2.7 /usr/bin/emerge sssd 16249 pts/5 S 0:00 sudo su - 16444 pts/5 S 0:00 su - 16445 pts/5 S 0:00 -su 27220 ? S 0:00 [kworker/u8:1] 30171 ? S 0:00 [kworker/3:0] 32039 ? S 0:00 [kworker/2:2] (In reply to Zac Medico from comment #13) > Lets see the `lsof | grep /var/tmp/portage/` output while it's hung again, > but this time without the stale processes showing up. This will allow us to > verify whether or not the expected ipc pipes are properly opened by the > expected processes. FYI, I maintain c-ares. I looked through the ebuild and build system. I can't see anything unusual about it. The build system uses autotools and configure.ac is pretty complex with lots of (too many?) custom checks. The Makefile.am is pretty straight forward as is the ebuild. I've tested it with make -jN and I haven't hit a race. So its probably a bug in portage ipc, but why c-ares triggers it is beyond me. Created attachment 386198 [details]
All packages installed + use flags
No idea if this is relevant, but it probably can't hurt...
The OpenAFS package is from my personal overlay, but is based on the one in mainline portage (I changed the version number and the URIs). I can disable this overlay if that would be helpful.
(In reply to Anthony Basile from comment #15) > FYI, I maintain c-ares. I looked through the ebuild and build system. I > can't see anything unusual about it. The build system uses autotools and > configure.ac is pretty complex with lots of (too many?) custom checks. The > Makefile.am is pretty straight forward as is the ebuild. I've tested it > with make -jN and I haven't hit a race. > > So its probably a bug in portage ipc, but why c-ares triggers it is beyond > me. Yeah, it's really puzzling. The ipc code is pretty simple, and appears to work perfectly on the vast majority of Gentoo systems out there. It will be interesting to find out what's causing this particular hiccup on this one user's server. (In reply to Steven Presser from comment #14) > (In reply to Zac Medico from comment #13) > > Lets see the `lsof | grep /var/tmp/portage/` output while it's hung again, > > but this time without the stale processes showing up. This will allow us to > > verify whether or not the expected ipc pipes are properly opened by the > > expected processes. > > Sure: > hawking ~ # lsof | grep /var/tmp/portage/ > misc-func 972 root cwd DIR 253,1 4096 > 151036 /var/tmp/portage/net-dns/c-ares-1.9.1/image > python2.7 1246 root 3uW REG 253,1 0 > 185311 /var/tmp/portage/net-dns/c-ares-1.9.1/.ipc_lock > python2.7 1246 root 6r FIFO 253,1 0t0 > 131114 /var/tmp/portage/net-dns/c-ares-1.9.1/.ipc_out > python2.7 1249 root 3u REG 253,1 0 > 185311 /var/tmp/portage/net-dns/c-ares-1.9.1/.ipc_lock > python2.7 1249 root 6r FIFO 253,1 0t0 > 131114 /var/tmp/portage/net-dns/c-ares-1.9.1/.ipc_out > emerge 15821 root 5uW REG 253,1 0 > 131103 /var/tmp/portage/net-dns/.c-ares-1.9.1.portage_lockfile > emerge 15821 root 9r FIFO 253,1 0t0 > 131113 /var/tmp/portage/net-dns/c-ares-1.9.1/.ipc_in The lsof output looks exactly as it should here, except that .ipc_in is not opened by the ebuild-ipc (python2.7) processes. This indicates that the ebuild-ipc open(write) call on .ipc_in is blocking. Normally, it should not block, since the emerge process has already opened the read end of .ipc_in (confirmed in the lsof output). This looks like a kernel bug to me, since there's no valid reason for the open(write) call to block indefinitely here. @Steve: Can you use `strace -p <pid>` to verify that the ebuild-ipc (python2.7) process is blocking on the open call? I'm sure that's where it's blocking, but it would be nice to prove it. Note that there are 2 ebuild-ipc processes, due to a fork, so you want to strace the process with the higher pid (assuming that the kernel's pid counter hasn't wrapped around just before the fork). (In reply to Zac Medico from comment #18) > @Steve: Can you use `strace -p <pid>` to verify that the ebuild-ipc > (python2.7) process is blocking on the open call? I'm sure that's where it's > blocking, but it would be nice to prove it. Note that there are 2 ebuild-ipc > processes, due to a fork, so you want to strace the process with the higher > pid (assuming that the kernel's pid counter hasn't wrapped around just > before the fork). I hope this was what you meant: hawking ~ # ps ax | grep ebuild-ipc 20518 pts/3 S+ 0:00 /usr/bin/python2.7 /usr/lib64/portage/bin/ebuild-ipc.py exit 0 20521 pts/3 S+ 0:00 /usr/bin/python2.7 /usr/lib64/portage/bin/ebuild-ipc.py exit 0 20523 pts/5 S+ 0:00 grep --colour=auto ebuild-ipc hawking ~ # strace -p 20521 Process 20521 attached futex(0x7fde7a4aa340, FUTEX_WAIT_PRIVATE, 2, NULL Also, as I mentioned in my original report, this is a heavily stripped down and modified kernel configuration. My suspicion is that I stripped something that emerge-ipc depends on. I understand that this maybe makes support harder, but it seems that the system ought to not let me disable required kernel flags. (In reply to Steven Presser from comment #19) > I hope this was what you meant: Yes, this is perfect. > hawking ~ # ps ax | grep ebuild-ipc > 20518 pts/3 S+ 0:00 /usr/bin/python2.7 > /usr/lib64/portage/bin/ebuild-ipc.py exit 0 > 20521 pts/3 S+ 0:00 /usr/bin/python2.7 > /usr/lib64/portage/bin/ebuild-ipc.py exit 0 > 20523 pts/5 S+ 0:00 grep --colour=auto ebuild-ipc > hawking ~ # strace -p 20521 > Process 20521 attached > futex(0x7fde7a4aa340, FUTEX_WAIT_PRIVATE, 2, NULL It looks like python is deadlocking in a futex call, and a quick google search seems to indicate that this sort of deadlock is common when mixing threads and forks. However, ebuild-ipc uses only forks (no threads used). @kernel: Removing you from CC since now it seems that a python deadlock may be the culprit. I'm thinking about writing a patch to make it use a thread instead of a fork if python threading is enabled. If it's a python bug involving forks, then using a thread may be a good workaround. This problem is reproducible also after switching to Python 3.3 or 3.4? Created attachment 386218 [details, diff]
bin/ebuild-ipc.py: use threading if available
If threading is available then use a thread instead of a fork for writing to the fifo. This has the advantage of avoiding a possible python futex deadlock following fork.
@Steve: Please test if this patch fixes your problem.
@Arfrever: Yes, the issue persists with python 3.3 @Zac: That seems to have unstuck it! Interestingly, I've noticed that my nightly update script (https://github.com/spresse1/system-utilities/tree/master/gentoo-autoupdate should be up to date) is stalling during its update phase. Obviously, it's been getting unstuck the first time each day that I run killall-9 emerge, so I don't think this has had any effect on the behavior, but is non the less interesting. It looks like its stalling on app-shells/bash-4.2_p53. I will verify this (by undoing the patch) once I have sssd emerged and report back. Welp, glibc bailed on me, so there's another issue to go look at...
In the meantime, I can conform that app-shells/bash-4.2_p53 also triggers this issue. I'd be happy to provide a build log if desired, but my suspicion is nothing useful will show up. However, it is notable that they get stuck in the same place:
>>> Completed installing bash-4.2_p53 into /var/tmp/portage/app-shells/bash-4.2_p53/image/
strip: x86_64-pc-linux-gnu-strip --strip-unneeded -R .comment -R .GCC.command.line -R .note.gnu.gold-version
bin/bash
ecompressdir: bzip2 -9 /usr/share/info
ecompressdir: bzip2 -9 /usr/share/man
ecompressdir: bzip2 -9 /usr/share/doc
As I mentioned the patch works and I have updated bash. I'm providing this in case it is somehow useful.
(In reply to Steven Presser from comment #24) > @Zac: That seems to have unstuck it! Great, thanks for testing. Created attachment 386272 [details, diff] bin/ebuild-ipc.py: use threading if available This updated patch is fixed to prevent ebuild-ipc processes from hanging around in some cases like bug 345455. I'll be maintaining this patch in the following branch: https://github.com/zmedico/portage/tree/bug_524328 Created attachment 386316 [details, diff] bin/ebuild-ipc.py: nonblocking fifo write This updated patch uses nonblocking write so that no thread or fork is necessary. I've tested it on Linux and FreeBSD, and it seems to work fine. Thought I don't think we'll want to use it, the threading patch is still available in this branch: https://github.com/zmedico/portage/tree/bug_524328_threading This is fixed in 2.2.14. |