Gentoo Websites Logo
Go to: Gentoo Home Documentation Forums Lists Bugs Planet Store Wiki Get Gentoo!

Bug 524328

Summary: dev-lang/python-2.7.7: net-dns/c-ares-1.9.1 triggers ebuild-ipc timed out during write, due to python deadlock in futex call after fork (no threads used)
Product: Portage Development Reporter: Steven Presser <steve>
Component: UnclassifiedAssignee: Portage team <dev-portage>
Status: RESOLVED FIXED    
Severity: normal CC: blueness, python
Priority: Normal Keywords: InVCS, PATCH, TESTED
Version: unspecified   
Hardware: All   
OS: Linux   
See Also: https://bugs.gentoo.org/show_bug.cgi?id=334423
Whiteboard:
Package list:
Runtime testing required: ---
Bug Depends on:    
Bug Blocks: 484436    
Attachments: Log of emerge invocation which hangs
Kernel config of the affected machine
Kernel config of the UNaffected machine
All packages installed + use flags
bin/ebuild-ipc.py: use threading if available
bin/ebuild-ipc.py: use threading if available
bin/ebuild-ipc.py: nonblocking fifo write

Description Steven Presser 2014-10-03 03:04:34 UTC
Emerging net-dns/c-ares (as a dependency of sys-auth/sssd) on one machine, I receive the message "ebuild-ipc timed out during write after 15 seconds, retrying..." ad infinitum (with the number of seconds varying, of course).

I have not seen this issue with any other packages.

Puzzlingly, this happens only on one of two Gentoo servers - the other emerges sssd without issue.  The one which emerges successfully is a completely default kernel config by genkernel for sys-kernel/hardened-sources.  The one which hangs is a very custom config for sys-kernel/gentoo-sources.  I will be attaching both kernels' .config.

This bears a passing resemblance to #337764, but the workaround listed in that bug is not effective.  (rm -rf /var/tmp/portage/*, more or less).

The ipc use flag is forced on for portage in the profile the failing server uses, so I don't think disabling it is an option.

Frankly, I don't know enough about Portage to dig into this more and see why it breaks.  I am however very happy to take instruction to help.

I am attaching a full log of an attempt to build (it is too long for here).

Reproducible: Always

Steps to Reproduce:
1. (Unknown pre-requisites)
2. Attempt to emerge net-dns/c-ares-1.9.1
3. Watch portage hang with the message "ebuild-ipc timed out during write after 15 seconds, retrying..."



emerge --info:
Portage 2.2.8-r2 (default/linux/amd64/13.0, gcc-4.7.3, glibc-2.19-r1, 3.14.14-gentoo x86_64)
=================================================================
System uname: Linux-3.14.14-gentoo-x86_64-Intel-R-_Atom-TM-_CPU_D525_@_1.80GHz-with-gentoo-2.2
KiB Mem:     4041400 total,   3540348 free
KiB Swap:    4194288 total,   4194288 free
Timestamp of tree: Thu, 02 Oct 2014 06:15:01 +0000
ld GNU ld (Gentoo 2.23.2 p1.0) 2.23.2
app-shells/bash:          4.2_p50
dev-lang/python:          2.7.7, 3.3.5-r1
dev-util/pkgconfig:       0.28-r1
sys-apps/baselayout:      2.2
sys-apps/openrc:          0.12.4
sys-apps/sandbox:         2.6-r1
sys-devel/autoconf:       2.69
sys-devel/automake:       1.12.6, 1.13.4
sys-devel/binutils:       2.23.2
sys-devel/gcc:            4.7.3-r1
sys-devel/gcc-config:     1.7.3
sys-devel/libtool:        2.4.2-r1
sys-devel/make:           3.82-r4
sys-kernel/linux-headers: 3.13 (virtual/os-headers)
sys-libs/glibc:           2.19-r1
Repositories: gentoo spresser-generic
ACCEPT_KEYWORDS="amd64"
ACCEPT_LICENSE="* -@EULA"
CBUILD="x86_64-pc-linux-gnu"
CFLAGS="-O2 -pipe"
CHOST="x86_64-pc-linux-gnu"
CONFIG_PROTECT="/etc /usr/share/gnupg/qualified.txt"
CONFIG_PROTECT_MASK="/etc/ca-certificates.conf /etc/env.d /etc/gconf /etc/gentoo-release /etc/revdep-rebuild /etc/sandbox.d /etc/terminfo"
CXXFLAGS="-O2 -pipe"
DISTDIR="/usr/portage/distfiles"
FCFLAGS="-O2 -pipe"
FEATURES="assume-digests binpkg-logs config-protect-if-modified distlocks ebuild-locks fixlafiles merge-sync news parallel-fetch preserve-libs protect-owned sandbox sfperms strict unknown-features-warn unmerge-logs unmerge-orphans userfetch userpriv usersandbox usersync"
FFLAGS="-O2 -pipe"
GENTOO_MIRRORS="http://lug.mtu.edu/gentoo/ rsync://mirrors.rit.edu/gentoo/ ftp://mirrors.rit.edu/gentoo/ http://mirrors.rit.edu/gentoo/"
LDFLAGS="-Wl,-O1 -Wl,--as-needed"
PKGDIR="/usr/portage/packages"
PORTAGE_CONFIGROOT="/"
PORTAGE_RSYNC_OPTS="--recursive --links --safe-links --perms --times --omit-dir-times --compress --force --whole-file --delete --stats --human-readable --timeout=180 --exclude=/distfiles --exclude=/local --exclude=/packages"
PORTAGE_TMPDIR="/var/tmp"
PORTDIR="/usr/portage"
PORTDIR_OVERLAY="/usr/local/portage"
SYNC="rsync://rsync.us.gentoo.org/gentoo-portage"
USE="acl afs amd64 bash-completion berkdb bindist bzip2 cli cracklib crypt cxx dri fortran gdbm iconv ipv6 kerberos mmx modules multilib ncurses nls nptl openmp pam pcre readline session sse sse2 sse3 sse4_1 ssl ssse3 symlink tcpd unicode zlib" ABI_X86="64" ALSA_CARDS="ali5451 als4000 atiixp atiixp-modem bt87x ca0106 cmipci emu10k1x ens1370 ens1371 es1938 es1968 fm801 hda-intel intel8x0 intel8x0m maestro3 trident usb-audio via82xx via82xx-modem ymfpci" APACHE2_MODULES="authn_core authz_core socache_shmcb unixd actions alias auth_basic authn_alias authn_anon authn_dbm authn_default authn_file authz_dbm authz_default authz_groupfile authz_host authz_owner authz_user autoindex cache cgi cgid dav dav_fs dav_lock deflate dir disk_cache env expires ext_filter file_cache filter headers include info log_config logio mem_cache mime mime_magic negotiation rewrite setenvif speling status unique_id userdir usertrack vhost_alias" CALLIGRA_FEATURES="kexi words flow plan sheets stage tables krita karbon braindump author" CAMERAS="ptp2" COLLECTD_PLUGINS="df interface irq load memory rrdtool swap syslog" ELIBC="glibc" GPSD_PROTOCOLS="ashtech aivdm earthmate evermore fv18 garmin garmintxt gpsclock itrax mtk3301 nmea ntrip navcom oceanserver oldstyle oncore rtcm104v2 rtcm104v3 sirf superstar2 timing tsip tripmate tnt ublox ubx" INPUT_DEVICES="keyboard mouse evdev" KERNEL="linux" LCD_DEVICES="bayrad cfontz cfontz633 glk hd44780 lb216 lcdm001 mtxorb ncurses text" LIBREOFFICE_EXTENSIONS="presenter-console presenter-minimizer" OFFICE_IMPLEMENTATION="libreoffice" PHP_TARGETS="php5-5" PYTHON_SINGLE_TARGET="python2_7" PYTHON_TARGETS="python2_7 python3_3" RUBY_TARGETS="ruby19 ruby20" USERLAND="GNU" VIDEO_CARDS="fbdev glint intel mach64 mga nouveau nv r128 radeon savage sis tdfx trident vesa via vmware dummy v4l" XTABLES_ADDONS="quota2 psd pknock lscan length2 ipv4options ipset ipp2p iface geoip fuzzy condition tee tarpit sysrq steal rawnat logmark ipmark dhcpmac delude chaos account"
Unset:  CPPFLAGS, CTARGET, EMERGE_DEFAULT_OPTS, INSTALL_MASK, LANG, LC_ALL, MAKEOPTS, PORTAGE_BUNZIP2_COMMAND, PORTAGE_COMPRESS, PORTAGE_COMPRESS_FLAGS, PORTAGE_RSYNC_EXTRA_OPTS, USE_PYTHON
Comment 1 Steven Presser 2014-10-03 03:05:16 UTC
Created attachment 385966 [details]
Log of emerge invocation which hangs
Comment 2 Steven Presser 2014-10-03 03:07:10 UTC
Created attachment 385968 [details]
Kernel config of the affected machine
Comment 3 Steven Presser 2014-10-03 03:07:40 UTC
Created attachment 385970 [details]
Kernel config of the UNaffected machine
Comment 4 Zac Medico gentoo-dev 2014-10-06 06:37:47 UTC
While ebuild-ipc is timing out, run `lsof | grep /var/tmp/portage` to see what's going on with the ipc pipes.
Comment 5 Steven Presser 2014-10-07 12:00:01 UTC
steve@hawking ~ $ sudo lsof | grep /var/tmp/portage
Password: 
misc-func  1407          root  cwd       DIR              253,1      4096     151035 /var/tmp/portage/net-dns/c-ares-1.9.1/image
python2.7  1681          root    3uW     REG              253,1         0     187388 /var/tmp/portage/net-dns/c-ares-1.9.1/.ipc_lock
python2.7  1681          root    6r     FIFO              253,1       0t0     131113 /var/tmp/portage/net-dns/c-ares-1.9.1/.ipc_out
python2.7  1684          root    3u      REG              253,1         0     187388 /var/tmp/portage/net-dns/c-ares-1.9.1/.ipc_lock
python2.7  1684          root    6r     FIFO              253,1       0t0     131113 /var/tmp/portage/net-dns/c-ares-1.9.1/.ipc_out
python2.7  2479          root    3u      REG              253,1         0      20295 /var/tmp/portage/.app-shells.portage_lockfile
python2.7  7955          root    3u      REG              253,1         0     200193 /var/tmp/portage/net-dns/c-ares-1.9.1/.ipc_lock (deleted)
python2.7  7955          root    6r     FIFO              253,1       0t0     198881 /var/tmp/portage/net-dns/c-ares-1.9.1/.ipc_out (deleted)
python2.7  9244          root    3u      REG              253,1         0      20295 /var/tmp/portage/.app-shells.portage_lockfile
python2.7 12360          root    3u      REG              253,1         0     151351 /var/tmp/portage/net-dns/c-ares-1.9.1/.ipc_lock (deleted)
python2.7 12360          root    6r     FIFO              253,1       0t0     131111 /var/tmp/portage/net-dns/c-ares-1.9.1/.ipc_out (deleted)
emerge    13313          root    5uW     REG              253,1         0     151655 /var/tmp/portage/app-shells/.bash-4.2_p51.portage_lockfile
emerge    13313          root    9r     FIFO              253,1       0t0     155594 /var/tmp/portage/app-shells/bash-4.2_p51/.ipc_in
python2.7 14348          root    3u      REG              253,1         0      20295 /var/tmp/portage/.app-shells.portage_lockfile
emerge    15981          root    4r     FIFO              253,1       0t0     131112 /var/tmp/portage/net-dns/c-ares-1.9.1/.ipc_in
emerge    15981          root    8uW     REG              253,1         0     131100 /var/tmp/portage/net-dns/.c-ares-1.9.1.portage_lockfile
python2.7 22354          root    3u      REG              253,1         0     200189 /var/tmp/portage/net-dns/c-ares-1.9.1/.ipc_lock (deleted)
python2.7 22354          root    6r     FIFO              253,1       0t0     198868 /var/tmp/portage/net-dns/c-ares-1.9.1/.ipc_out (deleted)
misc-func 25166          root  cwd       DIR              253,1      4096     187152 /var/tmp/portage/app-shells/bash-4.2_p51/image
python2.7 25494          root    3uW     REG              253,1         0     187305 /var/tmp/portage/app-shells/bash-4.2_p51/.ipc_lock
python2.7 25494          root    6r     FIFO              253,1       0t0     155595 /var/tmp/portage/app-shells/bash-4.2_p51/.ipc_out
python2.7 25497          root    3u      REG              253,1         0     187305 /var/tmp/portage/app-shells/bash-4.2_p51/.ipc_lock
python2.7 25497          root    6r     FIFO              253,1       0t0     155595 /var/tmp/portage/app-shells/bash-4.2_p51/.ipc_out
python2.7 26652          root    3u      REG              253,1         0     151348 /var/tmp/portage/net-dns/c-ares-1.9.1/.ipc_lock (deleted)
python2.7 26652          root    6r     FIFO              253,1       0t0     131109 /var/tmp/portage/net-dns/c-ares-1.9.1/.ipc_out (deleted)
emerge    29980          root    4uW     REG              253,1         0      20295 /var/tmp/portage/.app-shells.portage_lockfile
python2.7 29986          root    3u      REG              253,1         0     151655 /var/tmp/portage/app-shells/.bash-4.2_p51.portage_lockfile
Comment 6 Zac Medico gentoo-dev 2014-10-07 19:17:31 UTC
(In reply to Steven Presser from comment #5) 
> 200193 /var/tmp/portage/net-dns/c-ares-1.9.1/.ipc_lock (deleted)
> python2.7  7955          root    6r     FIFO              253,1       0t0   
> 198881 /var/tmp/portage/net-dns/c-ares-1.9.1/.ipc_out (deleted)
> python2.7  9244          root    3u      REG              253,1         0    
> 151351 /var/tmp/portage/net-dns/c-ares-1.9.1/.ipc_lock (deleted)
> python2.7 12360          root    6r     FIFO              253,1       0t0   
> 131111 /var/tmp/portage/net-dns/c-ares-1.9.1/.ipc_out (deleted)
> emerge    15981          root    4r     FIFO              253,1       0t0   
> 131112 /var/tmp/portage/net-dns/c-ares-1.9.1/.ipc_in
> emerge    15981          root    8uW     REG              253,1         0   
> 131100 /var/tmp/portage/net-dns/.c-ares-1.9.1.portage_lockfile
> python2.7 22354          root    3u      REG              253,1         0   
> 200189 /var/tmp/portage/net-dns/c-ares-1.9.1/.ipc_lock (deleted)
> python2.7 22354          root    6r     FIFO              253,1       0t0   
> 198868 /var/tmp/portage/net-dns/c-ares-1.9.1/.ipc_out (deleted)
> python2.7 26652          root    3u      REG              253,1         0   
> 151348 /var/tmp/portage/net-dns/c-ares-1.9.1/.ipc_lock (deleted)
> python2.7 26652          root    6r     FIFO              253,1       0t0   
> 131109 /var/tmp/portage/net-dns/c-ares-1.9.1/.ipc_out (deleted)

This clearly shows that something has deleted the .ipc_out and .ipc_lock files after they were opened, which explains the timeout.

Maybe you have some script in /etc/portage/bashrc or /etc/portage/env that could have deleted those named pipes? Usually these kind of issues are due to bashrc doing something bad.
Comment 7 Steven Presser 2014-10-07 20:06:03 UTC
Neither of those paths exists on my install.  It's a brand new install; no modifications made to anything other than package.accept_keywords, and package.use.

root@hawking ~ # ls -R /etc/portage/
/etc/portage/:
bin  make.conf	make.conf.catalyst  make.profile  package.accept_keywords  package.use	postsync.d  profile  savedconfig

/etc/portage/bin:
post_sync

/etc/portage/postsync.d:
demerge-record	q-reinitialize

/etc/portage/profile:

/etc/portage/savedconfig:
sys-apps

/etc/portage/savedconfig/sys-apps:
busybox-1.21.0
Comment 8 Zac Medico gentoo-dev 2014-10-07 21:09:09 UTC
Is net-dns/c-ares the only package that you experience this problem with?

After you've killed emerge, does lsof still report processes with open files in /var/tmp/portage? Maybe those stale processes interfere somehow. You should kill them all.
Comment 9 Steven Presser 2014-10-07 22:05:43 UTC
(In reply to Zac Medico from comment #8)
> Is net-dns/c-ares the only package that you experience this problem with?
> 
> After you've killed emerge, does lsof still report processes with open files
> in /var/tmp/portage? Maybe those stale processes interfere somehow. You
> should kill them all.

So far, yes, c-ares is the only package I have seen this issue with.  For kicks I re-emerged grub (I made a use flag update) and it worked.

Depending on how quickly I run lsof, I get different results.  However, I think they all go to zero within a minute.
Comment 10 Zac Medico gentoo-dev 2014-10-07 22:28:48 UTC
(In reply to Steven Presser from comment #9)
> Depending on how quickly I run lsof, I get different results.  However, I
> think they all go to zero within a minute.

Well, looking again at comment #5, you can see that /var/tmp/portage/net-dns/c-ares-1.9.1/.ipc_out is deleted in some cases and not in others. This means that you have some stale processes that have probably been hanging around for much longer than a minute. Please make sure you kill all those processes, and then try to re-merge c-ares.
Comment 11 Zac Medico gentoo-dev 2014-10-07 22:40:56 UTC
(In reply to Zac Medico from comment #10)
> Well, looking again at comment #5, you can see that
> /var/tmp/portage/net-dns/c-ares-1.9.1/.ipc_out is deleted in some cases and
> not in others. This means that you have some stale processes that have
> probably been hanging around for much longer than a minute. Please make sure
> you kill all those processes, and then try to re-merge c-ares.

And if ebuild-ipc still times out, lets look at the lsof output again, to see if there are still any ipc files that have been mysteriously deleted.
Comment 12 Steven Presser 2014-10-07 22:42:59 UTC
Nothing relevant running:

hawking ~ # ps ax
  PID TTY      STAT   TIME COMMAND
    1 ?        Ss     0:05 init [3]   
    2 ?        S      0:00 [kthreadd]
    3 ?        S      1:01 [ksoftirqd/0]
    5 ?        S<     0:00 [kworker/0:0H]
    7 ?        S      0:26 [rcu_sched]
    8 ?        S      0:00 [rcu_bh]
    9 ?        S      0:00 [migration/0]
   10 ?        S      0:00 [migration/1]
   11 ?        S      1:12 [ksoftirqd/1]
   13 ?        S<     0:00 [kworker/1:0H]
   14 ?        S      0:02 [migration/2]
   15 ?        S      0:51 [ksoftirqd/2]
   17 ?        S<     0:00 [kworker/2:0H]
   18 ?        S      0:02 [migration/3]
   19 ?        S      1:02 [ksoftirqd/3]
   21 ?        S<     0:00 [kworker/3:0H]
   22 ?        S<     0:00 [khelper]
   23 ?        S      0:00 [kdevtmpfs]
   24 ?        S<     0:00 [netns]
  269 ?        S<     0:00 [writeback]
  272 ?        S<     0:00 [bioset]
  273 ?        S<     0:00 [crypto]
  275 ?        S<     0:00 [kblockd]
  376 ?        S<     0:00 [ata_sff]
  387 ?        S      0:00 [khubd]
  395 ?        S<     0:00 [md]
  532 ?        S      0:11 [kswapd0]
  601 ?        S      0:00 [fsnotify_mark]
  719 ?        S      0:00 [scsi_eh_0]
  720 ?        S<     0:00 [scsi_tmf_0]
  723 ?        S      0:00 [scsi_eh_1]
  724 ?        S<     0:00 [scsi_tmf_1]
  727 ?        S      0:00 [scsi_eh_2]
  728 ?        S<     0:00 [scsi_tmf_2]
  731 ?        S      0:00 [scsi_eh_3]
  732 ?        S<     0:00 [scsi_tmf_3]
  745 ?        S      0:00 [scsi_eh_4]
  746 ?        S<     0:00 [scsi_tmf_4]
  749 ?        S      0:00 [scsi_eh_5]
  750 ?        S<     0:00 [scsi_tmf_5]
  805 ?        S<     0:00 [raid5wq]
  810 ?        S<     0:00 [dm_bufio_cache]
  811 ?        S<     0:00 [kmpathd]
  812 ?        S<     0:00 [kmpath_handlerd]
  867 ?        S<     0:00 [ipv6_addrconf]
  887 ?        S<     0:00 [deferwq]
 2191 ?        Ss     0:00 sshd: steve [priv]
 2197 ?        Sl     0:01 sshd: steve@pts/2
 2202 pts/2    Ss+    0:00 -bash
 2438 ?        S<     3:29 [kworker/3:1H]
 2439 ?        S<     4:07 [kworker/1:1H]
 2445 ?        S<     0:00 [bioset]
 2447 ?        S    192:33 [md127_raid5]
 2454 ?        S<     0:00 [bioset]
 2455 ?        S      1:53 [md126_raid1]
 2457 ?        S<     2:49 [kworker/2:1H]
 2458 ?        S<     3:21 [kworker/0:1H]
 2467 ?        S<     0:00 [kdmflush]
 2468 ?        S<     0:00 [bioset]
 2472 ?        S<     0:00 [kdmflush]
 2473 ?        S<     0:00 [bioset]
 2476 ?        S<     0:00 [kdmflush]
 2477 ?        S<     0:00 [bioset]
 2481 ?        S<     0:00 [kdmflush]
 2482 ?        S<     0:00 [ksnaphd]
 2483 ?        S<     0:00 [kcopyd]
 2484 ?        S<     0:00 [bioset]
 2485 ?        S<     0:00 [bioset]
 2486 ?        S<     0:00 [bioset]
 2491 ?        S<     0:00 [kdmflush]
 2492 ?        S<     0:00 [bioset]
 2496 ?        S<     0:00 [kdmflush]
 2497 ?        S<     0:00 [bioset]
 2501 ?        S<     0:00 [kdmflush]
 2502 ?        S<     0:00 [bioset]
 2506 ?        S<     0:00 [kdmflush]
 2507 ?        S<     0:00 [bioset]
 2511 ?        S<     0:00 [kdmflush]
 2512 ?        S<     0:00 [bioset]
 2516 ?        S<     0:00 [kdmflush]
 2517 ?        S<     0:00 [bioset]
 2521 ?        S<     0:00 [kdmflush]
 2522 ?        S<     0:00 [bioset]
 2529 ?        S      0:07 [jbd2/dm-1-8]
 2530 ?        S<     0:00 [ext4-rsv-conver]
 2709 ?        Ss     0:00 /lib/systemd/systemd-udevd --daemon
 3007 ?        S      0:00 [jbd2/md126-8]
 3008 ?        S<     0:00 [ext4-rsv-conver]
 3009 ?        S      0:02 [jbd2/dm-6-8]
 3010 ?        S<     0:00 [ext4-rsv-conver]
 3011 ?        S      0:02 [jbd2/dm-9-8]
 3012 ?        S<     0:00 [ext4-rsv-conver]
 3013 ?        S      0:00 [jbd2/dm-10-8]
 3014 ?        S<     0:00 [ext4-rsv-conver]
 3116 ?        S      0:00 [kworker/2:0]
 3194 ?        S      0:00 [kworker/0:2]
 3395 ?        Ss     0:00 /sbin/lvmetad -p /run/lvmetad.pid
 3399 ?        S      0:00 [kworker/1:2]
 3421 ?        S      0:00 supervising syslog-ng                                
 3422 ?        Ss     0:42 /usr/sbin/syslog-ng --persist-file /var/lib/syslog-ng
 3602 ?        Ss     0:01 dhcpcd -t 60 -L -m 2 enp3s0
 3648 ?        Ss     0:01 /usr/sbin/bitlbee -P /var/run/bitlbee/bitlbee.pid -F
 3688 ?        Ss     0:03 /usr/sbin/crond
 3712 ?        Sl     7:40 /usr/bin/python2.7 /usr/bin/fail2ban-server -b -s /ru
 3747 ?        Ss     0:03 SCREEN -D -m -S irc irssi
 3764 ?        S      0:00 [kworker/2:1]
 3775 ?        Ss     0:00 mdadm --monitor --scan --daemonise --pid-file /var/ru
 3784 ?        S      0:00 [kworker/1:1]
 3817 pts/0    Ssl+   1:49 irssi
 3820 ?        Ss     0:00 nginx: master process /usr/sbin/nginx -c /etc/nginx/n
 3821 ?        S      0:00 nginx: worker process                   
 3847 ?        Ss     0:00 /usr/bin/k5start -b -p /var/run/nss-k5start.pid -f /e
 3866 ?        Ss     0:00 /usr/sbin/bosserver -nofork
 3871 ?        S      0:00 [kworker/3:2]
 3885 ?        S      0:05 /usr/libexec/openafs/buserver
 3886 ?        S      0:05 /usr/libexec/openafs/ptserver
 3887 ?        S      0:05 /usr/libexec/openafs/vlserver
 3888 ?        S<l    0:00 /usr/libexec/openafs/fileserver
 3889 ?        Sl     0:00 /usr/libexec/openafs/volserver
 3890 ?        S      0:00 /usr/libexec/openafs/upserver -crypt /etc/openafs/ser
 3921 ?        S      0:12 /usr/sbin/bitlbee -P /var/run/bitlbee/bitlbee.pid -F
 3961 ?        Ss     0:00 sshd: steve [priv]
 3966 ?        Sl     0:01 sshd: steve@pts/3
 3971 pts/3    Ss     0:00 -bash
 3991 ?        S      0:00 /usr/sbin/smartd -p /run/smartd.pid
 4013 ?        Ss     0:01 /usr/sbin/sshd
 4033 pts/1    S+     0:00 screen -dr irc
 4035 tty1     Ss+    0:00 /sbin/agetty 38400 tty1 linux
 4036 tty2     Ss+    0:00 /sbin/agetty 38400 tty2 linux
 4037 tty3     Ss+    0:00 /sbin/agetty 38400 tty3 linux
 4038 tty4     Ss+    0:00 /sbin/agetty 38400 tty4 linux
 4039 tty5     Ss+    0:00 /sbin/agetty 38400 tty5 linux
 4040 tty6     Ss+    0:00 /sbin/agetty 38400 tty6 linux
 4071 pts/3    S      0:00 sudo su -
 4072 pts/3    S      0:00 su -
 4073 pts/3    S      0:00 -su
 4308 ?        Ssl    0:29 /usr/lib64/openldap/slapd -u ldap -g ldap -f /etc/ope
 4446 ?        Ss     0:08 /usr/sbin/krb5kdc
 5060 ?        S      0:03 [kworker/3:1]
 8657 ?        Ss     0:00 sshd: steve [priv]
 8661 ?        Sl     0:00 sshd: steve@pts/5
 8667 pts/5    Ss+    0:00 -bash
12200 ?        Ss     0:00 sshd: steve [priv]
12205 ?        Sl     0:00 sshd: steve@pts/1
12210 pts/1    Ss     0:00 -bash
12225 ?        S      0:00 [kworker/0:1]
27220 ?        S      0:00 [kworker/u8:1]
30113 ?        S      0:00 [kworker/u8:0]
30156 pts/3    R+     0:00 ps ax

Old file tree removed:
hawking ~ # rm -rf /var/tmp/portage/net-dns/

No stale files:
hawking ~ # lsof | grep /var/tmp/portage/
hawking ~ #

And it still hangs with no stale files after it is killed (~10sec delay between the kill and the lsof completing).  If you want, I'll reboot the machine (it's a server; I'd rather not if I can avoid it) but I did that in my initial debugging and it didn't help.
Comment 13 Zac Medico gentoo-dev 2014-10-07 23:02:29 UTC
Lets see the `lsof | grep /var/tmp/portage/` output while it's hung again, but this time without the stale processes showing up. This will allow us to verify whether or not the expected ipc pipes are properly opened by the expected processes.
Comment 14 Steven Presser 2014-10-07 23:10:51 UTC
(In reply to Zac Medico from comment #13)
> Lets see the `lsof | grep /var/tmp/portage/` output while it's hung again,
> but this time without the stale processes showing up. This will allow us to
> verify whether or not the expected ipc pipes are properly opened by the
> expected processes.

Sure:
hawking ~ # lsof | grep /var/tmp/portage/
misc-func   972          root  cwd       DIR              253,1      4096     151036 /var/tmp/portage/net-dns/c-ares-1.9.1/image
python2.7  1246          root    3uW     REG              253,1         0     185311 /var/tmp/portage/net-dns/c-ares-1.9.1/.ipc_lock
python2.7  1246          root    6r     FIFO              253,1       0t0     131114 /var/tmp/portage/net-dns/c-ares-1.9.1/.ipc_out
python2.7  1249          root    3u      REG              253,1         0     185311 /var/tmp/portage/net-dns/c-ares-1.9.1/.ipc_lock
python2.7  1249          root    6r     FIFO              253,1       0t0     131114 /var/tmp/portage/net-dns/c-ares-1.9.1/.ipc_out
emerge    15821          root    5uW     REG              253,1         0     131103 /var/tmp/portage/net-dns/.c-ares-1.9.1.portage_lockfile
emerge    15821          root    9r     FIFO              253,1       0t0     131113 /var/tmp/portage/net-dns/c-ares-1.9.1/.ipc_in
hawking ~ # 

And ps ax:
hawking ~ # ps ax
  PID TTY      STAT   TIME COMMAND
    1 ?        Ss     0:05 init [3]   
    2 ?        S      0:00 [kthreadd]
    3 ?        S      1:01 [ksoftirqd/0]
    5 ?        S<     0:00 [kworker/0:0H]
    7 ?        S      0:29 [rcu_sched]
    8 ?        S      0:00 [rcu_bh]
    9 ?        S      0:00 [migration/0]
   10 ?        S      0:00 [migration/1]
   11 ?        S      1:12 [ksoftirqd/1]
   13 ?        S<     0:00 [kworker/1:0H]
   14 ?        S      0:02 [migration/2]
   15 ?        S      0:51 [ksoftirqd/2]
   17 ?        S<     0:00 [kworker/2:0H]
   18 ?        S      0:02 [migration/3]
   19 ?        S      1:02 [ksoftirqd/3]
   21 ?        S<     0:00 [kworker/3:0H]
   22 ?        S<     0:00 [khelper]
   23 ?        S      0:00 [kdevtmpfs]
   24 ?        S<     0:00 [netns]
  269 ?        S<     0:00 [writeback]
  272 ?        S<     0:00 [bioset]
  273 ?        S<     0:00 [crypto]
  275 ?        S<     0:00 [kblockd]
  376 ?        S<     0:00 [ata_sff]
  387 ?        S      0:00 [khubd]
  395 ?        S<     0:00 [md]
  532 ?        S      0:11 [kswapd0]
  601 ?        S      0:00 [fsnotify_mark]
  719 ?        S      0:00 [scsi_eh_0]
  720 ?        S<     0:00 [scsi_tmf_0]
  723 ?        S      0:00 [scsi_eh_1]
  724 ?        S<     0:00 [scsi_tmf_1]
  727 ?        S      0:00 [scsi_eh_2]
  728 ?        S<     0:00 [scsi_tmf_2]
  731 ?        S      0:00 [scsi_eh_3]
  732 ?        S<     0:00 [scsi_tmf_3]
  745 ?        S      0:00 [scsi_eh_4]
  746 ?        S<     0:00 [scsi_tmf_4]
  749 ?        S      0:00 [scsi_eh_5]
  750 ?        S<     0:00 [scsi_tmf_5]
  805 ?        S<     0:00 [raid5wq]
  810 ?        S<     0:00 [dm_bufio_cache]
  811 ?        S<     0:00 [kmpathd]
  812 ?        S<     0:00 [kmpath_handlerd]
  867 ?        S<     0:00 [ipv6_addrconf]
  887 ?        S<     0:00 [deferwq]
  971 pts/3    S+     0:00 [net-dns/c-ares-1.9.1] sandbox /usr/lib64/portage/bin
  972 pts/3    S+     0:00 /bin/bash /usr/lib64/portage/bin/misc-functions.sh in
 1246 pts/3    S+     0:00 /usr/bin/python2.7 /usr/lib64/portage/bin/ebuild-ipc.
 1249 pts/3    S+     0:00 /usr/bin/python2.7 /usr/lib64/portage/bin/ebuild-ipc.
 1267 pts/5    R+     0:00 ps ax
 2191 ?        Ss     0:00 sshd: steve [priv]
 2197 ?        Sl     0:01 sshd: steve@pts/2
 2202 pts/2    Ss+    0:00 -bash
 2438 ?        S<     3:29 [kworker/3:1H]
 2439 ?        S<     4:07 [kworker/1:1H]
 2445 ?        S<     0:00 [bioset]
 2447 ?        S    192:34 [md127_raid5]
 2454 ?        S<     0:00 [bioset]
 2455 ?        S      1:53 [md126_raid1]
 2457 ?        S<     2:49 [kworker/2:1H]
 2458 ?        S<     3:21 [kworker/0:1H]
 2467 ?        S<     0:00 [kdmflush]
 2468 ?        S<     0:00 [bioset]
 2472 ?        S<     0:00 [kdmflush]
 2473 ?        S<     0:00 [bioset]
 2476 ?        S<     0:00 [kdmflush]
 2477 ?        S<     0:00 [bioset]
 2481 ?        S<     0:00 [kdmflush]
 2482 ?        S<     0:00 [ksnaphd]
 2483 ?        S<     0:00 [kcopyd]
 2484 ?        S<     0:00 [bioset]
 2485 ?        S<     0:00 [bioset]
 2486 ?        S<     0:00 [bioset]
 2491 ?        S<     0:00 [kdmflush]
 2492 ?        S<     0:00 [bioset]
 2496 ?        S<     0:00 [kdmflush]
 2497 ?        S<     0:00 [bioset]
 2501 ?        S<     0:00 [kdmflush]
 2502 ?        S<     0:00 [bioset]
 2506 ?        S<     0:00 [kdmflush]
 2507 ?        S<     0:00 [bioset]
 2511 ?        S<     0:00 [kdmflush]
 2512 ?        S<     0:00 [bioset]
 2516 ?        S<     0:00 [kdmflush]
 2517 ?        S<     0:00 [bioset]
 2521 ?        S<     0:00 [kdmflush]
 2522 ?        S<     0:00 [bioset]
 2529 ?        S      0:07 [jbd2/dm-1-8]
 2530 ?        S<     0:00 [ext4-rsv-conver]
 2709 ?        Ss     0:00 /lib/systemd/systemd-udevd --daemon
 3007 ?        S      0:00 [jbd2/md126-8]
 3008 ?        S<     0:00 [ext4-rsv-conver]
 3009 ?        S      0:02 [jbd2/dm-6-8]
 3010 ?        S<     0:00 [ext4-rsv-conver]
 3011 ?        S      0:02 [jbd2/dm-9-8]
 3012 ?        S<     0:00 [ext4-rsv-conver]
 3013 ?        S      0:00 [jbd2/dm-10-8]
 3014 ?        S<     0:00 [ext4-rsv-conver]
 3116 ?        S      0:00 [kworker/2:0]
 3395 ?        Ss     0:00 /sbin/lvmetad -p /run/lvmetad.pid
 3399 ?        S      0:00 [kworker/1:2]
 3421 ?        S      0:00 supervising syslog-ng                                
 3422 ?        Ss     0:43 /usr/sbin/syslog-ng --persist-file /var/lib/syslog-ng
 3602 ?        Ss     0:01 dhcpcd -t 60 -L -m 2 enp3s0
 3648 ?        Ss     0:01 /usr/sbin/bitlbee -P /var/run/bitlbee/bitlbee.pid -F
 3688 ?        Ss     0:03 /usr/sbin/crond
 3712 ?        Sl     7:43 /usr/bin/python2.7 /usr/bin/fail2ban-server -b -s /ru
 3747 ?        Ss     0:03 SCREEN -D -m -S irc irssi
 3775 ?        Ss     0:00 mdadm --monitor --scan --daemonise --pid-file /var/ru
 3784 ?        S      0:00 [kworker/1:1]
 3817 pts/0    Ssl+   1:49 irssi
 3820 ?        Ss     0:00 nginx: master process /usr/sbin/nginx -c /etc/nginx/n
 3821 ?        S      0:00 nginx: worker process                   
 3847 ?        Ss     0:00 /usr/bin/k5start -b -p /var/run/nss-k5start.pid -f /e
 3866 ?        Ss     0:00 /usr/sbin/bosserver -nofork
 3885 ?        S      0:05 /usr/libexec/openafs/buserver
 3886 ?        S      0:05 /usr/libexec/openafs/ptserver
 3887 ?        S      0:05 /usr/libexec/openafs/vlserver
 3888 ?        S<l    0:00 /usr/libexec/openafs/fileserver
 3889 ?        Sl     0:00 /usr/libexec/openafs/volserver
 3890 ?        S      0:00 /usr/libexec/openafs/upserver -crypt /etc/openafs/ser
 3921 ?        S      0:12 /usr/sbin/bitlbee -P /var/run/bitlbee/bitlbee.pid -F
 3961 ?        Ss     0:00 sshd: steve [priv]
 3966 ?        Sl     0:01 sshd: steve@pts/3
 3971 pts/3    Ss     0:00 -bash
 3991 ?        S      0:00 /usr/sbin/smartd -p /run/smartd.pid
 4013 ?        Ss     0:01 /usr/sbin/sshd
 4033 pts/1    S+     0:00 screen -dr irc
 4035 tty1     Ss+    0:00 /sbin/agetty 38400 tty1 linux
 4036 tty2     Ss+    0:00 /sbin/agetty 38400 tty2 linux
 4037 tty3     Ss+    0:00 /sbin/agetty 38400 tty3 linux
 4038 tty4     Ss+    0:00 /sbin/agetty 38400 tty4 linux
 4039 tty5     Ss+    0:00 /sbin/agetty 38400 tty5 linux
 4040 tty6     Ss+    0:00 /sbin/agetty 38400 tty6 linux
 4071 pts/3    S      0:00 sudo su -
 4072 pts/3    S      0:00 su -
 4073 pts/3    S      0:00 -su
 4308 ?        Ssl    0:29 /usr/lib64/openldap/slapd -u ldap -g ldap -f /etc/ope
 4446 ?        Ss     0:08 /usr/sbin/krb5kdc
 5060 ?        S      0:03 [kworker/3:1]
 8657 ?        Ss     0:00 sshd: steve [priv]
 8661 ?        Sl     0:00 sshd: steve@pts/5
 8667 pts/5    Ss     0:00 -bash
12200 ?        Ss     0:00 sshd: steve [priv]
12205 ?        Sl     0:00 sshd: steve@pts/1
12210 pts/1    Ss     0:00 -bash
12225 ?        S      0:00 [kworker/0:1]
15760 ?        S      0:00 [kworker/0:0]
15813 ?        S      0:00 [kworker/2:1]
15814 ?        S      0:00 [kworker/u8:0]
15821 pts/3    S+     0:23 /usr/bin/python2.7 /usr/bin/emerge sssd
16249 pts/5    S      0:00 sudo su -
16444 pts/5    S      0:00 su -
16445 pts/5    S      0:00 -su
27220 ?        S      0:00 [kworker/u8:1]
30171 ?        S      0:00 [kworker/3:0]
32039 ?        S      0:00 [kworker/2:2]
Comment 15 Anthony Basile gentoo-dev 2014-10-07 23:14:29 UTC
(In reply to Zac Medico from comment #13)
> Lets see the `lsof | grep /var/tmp/portage/` output while it's hung again,
> but this time without the stale processes showing up. This will allow us to
> verify whether or not the expected ipc pipes are properly opened by the
> expected processes.

FYI, I maintain c-ares.  I looked through the ebuild and build system.  I can't see anything unusual about it.  The build system uses autotools and configure.ac is pretty complex with lots of (too many?) custom checks.  The Makefile.am is pretty straight forward as is the ebuild.  I've tested it with make -jN and I haven't hit a race.

So its probably a bug in portage ipc, but why c-ares triggers it is beyond me.
Comment 16 Steven Presser 2014-10-07 23:26:53 UTC
Created attachment 386198 [details]
All packages installed + use flags

No idea if this is relevant, but it probably can't hurt...
The OpenAFS package is from my personal overlay, but is based on the one in mainline portage (I changed the version number and the URIs).  I can disable this overlay if that would be helpful.
Comment 17 Zac Medico gentoo-dev 2014-10-08 02:01:18 UTC
(In reply to Anthony Basile from comment #15)
> FYI, I maintain c-ares.  I looked through the ebuild and build system.  I
> can't see anything unusual about it.  The build system uses autotools and
> configure.ac is pretty complex with lots of (too many?) custom checks.  The
> Makefile.am is pretty straight forward as is the ebuild.  I've tested it
> with make -jN and I haven't hit a race.
> 
> So its probably a bug in portage ipc, but why c-ares triggers it is beyond
> me.

Yeah, it's really puzzling. The ipc code is pretty simple, and appears to work perfectly on the vast majority of Gentoo systems out there. It will be interesting to find out what's causing this particular hiccup on this one user's server.

(In reply to Steven Presser from comment #14)
> (In reply to Zac Medico from comment #13)
> > Lets see the `lsof | grep /var/tmp/portage/` output while it's hung again,
> > but this time without the stale processes showing up. This will allow us to
> > verify whether or not the expected ipc pipes are properly opened by the
> > expected processes.
> 
> Sure:
> hawking ~ # lsof | grep /var/tmp/portage/
> misc-func   972          root  cwd       DIR              253,1      4096   
> 151036 /var/tmp/portage/net-dns/c-ares-1.9.1/image
> python2.7  1246          root    3uW     REG              253,1         0   
> 185311 /var/tmp/portage/net-dns/c-ares-1.9.1/.ipc_lock
> python2.7  1246          root    6r     FIFO              253,1       0t0   
> 131114 /var/tmp/portage/net-dns/c-ares-1.9.1/.ipc_out
> python2.7  1249          root    3u      REG              253,1         0   
> 185311 /var/tmp/portage/net-dns/c-ares-1.9.1/.ipc_lock
> python2.7  1249          root    6r     FIFO              253,1       0t0   
> 131114 /var/tmp/portage/net-dns/c-ares-1.9.1/.ipc_out
> emerge    15821          root    5uW     REG              253,1         0   
> 131103 /var/tmp/portage/net-dns/.c-ares-1.9.1.portage_lockfile
> emerge    15821          root    9r     FIFO              253,1       0t0   
> 131113 /var/tmp/portage/net-dns/c-ares-1.9.1/.ipc_in

The lsof output looks exactly as it should here, except that .ipc_in is not opened by the ebuild-ipc (python2.7) processes. This indicates that the ebuild-ipc open(write) call on .ipc_in is blocking. Normally, it should not block, since the emerge process has already opened the read end of .ipc_in (confirmed in the lsof output).

This looks like a kernel bug to me, since there's no valid reason for the open(write) call to block indefinitely here.
Comment 18 Zac Medico gentoo-dev 2014-10-08 02:39:41 UTC
@Steve: Can you use `strace -p <pid>` to verify that the ebuild-ipc (python2.7) process is blocking on the open call? I'm sure that's where it's blocking, but it would be nice to prove it. Note that there are 2 ebuild-ipc processes, due to a fork, so you want to strace the process with the higher pid (assuming that the kernel's pid counter hasn't wrapped around just before the fork).
Comment 19 Steven Presser 2014-10-08 02:47:59 UTC
(In reply to Zac Medico from comment #18)
> @Steve: Can you use `strace -p <pid>` to verify that the ebuild-ipc
> (python2.7) process is blocking on the open call? I'm sure that's where it's
> blocking, but it would be nice to prove it. Note that there are 2 ebuild-ipc
> processes, due to a fork, so you want to strace the process with the higher
> pid (assuming that the kernel's pid counter hasn't wrapped around just
> before the fork).

I hope this was what you meant:

hawking ~ # ps ax | grep ebuild-ipc
20518 pts/3    S+     0:00 /usr/bin/python2.7 /usr/lib64/portage/bin/ebuild-ipc.py exit 0
20521 pts/3    S+     0:00 /usr/bin/python2.7 /usr/lib64/portage/bin/ebuild-ipc.py exit 0
20523 pts/5    S+     0:00 grep --colour=auto ebuild-ipc
hawking ~ # strace -p 20521
Process 20521 attached
futex(0x7fde7a4aa340, FUTEX_WAIT_PRIVATE, 2, NULL

Also, as I mentioned in my original report, this is a heavily stripped down and modified kernel configuration.  My suspicion is that I stripped something that emerge-ipc depends on.  I understand that this maybe makes support harder, but it seems that the system ought to not let me disable required kernel flags.
Comment 20 Zac Medico gentoo-dev 2014-10-08 03:29:21 UTC
(In reply to Steven Presser from comment #19)
> I hope this was what you meant:

Yes, this is perfect.

> hawking ~ # ps ax | grep ebuild-ipc
> 20518 pts/3    S+     0:00 /usr/bin/python2.7
> /usr/lib64/portage/bin/ebuild-ipc.py exit 0
> 20521 pts/3    S+     0:00 /usr/bin/python2.7
> /usr/lib64/portage/bin/ebuild-ipc.py exit 0
> 20523 pts/5    S+     0:00 grep --colour=auto ebuild-ipc
> hawking ~ # strace -p 20521
> Process 20521 attached
> futex(0x7fde7a4aa340, FUTEX_WAIT_PRIVATE, 2, NULL

It looks like python is deadlocking in a futex call, and a quick google search seems to indicate that this sort of deadlock is common when mixing threads and forks. However, ebuild-ipc uses only forks (no threads used).

@kernel: Removing you from CC since now it seems that a python deadlock may be the culprit.
Comment 21 Zac Medico gentoo-dev 2014-10-08 03:37:45 UTC
I'm thinking about writing a patch to make it use a thread instead of a fork if python threading is enabled. If it's a python bug involving forks, then using a thread may be a good workaround.
Comment 22 Arfrever Frehtes Taifersar Arahesis 2014-10-08 04:07:55 UTC
This problem is reproducible also after switching to Python 3.3 or 3.4?
Comment 23 Zac Medico gentoo-dev 2014-10-08 05:23:28 UTC
Created attachment 386218 [details, diff]
bin/ebuild-ipc.py: use threading if available

If threading is available then use a thread instead of a fork for writing to the fifo. This has the advantage of avoiding a possible python futex deadlock following fork.

@Steve: Please test if this patch fixes your problem.
Comment 24 Steven Presser 2014-10-08 21:43:10 UTC
@Arfrever: Yes, the issue persists with python 3.3

@Zac: That seems to have unstuck it!

Interestingly, I've noticed that my nightly update script (https://github.com/spresse1/system-utilities/tree/master/gentoo-autoupdate should be up to date) is stalling during its update phase.  Obviously, it's been getting unstuck the first time each day that I run killall-9 emerge, so I don't think this has had any effect on the behavior, but is non the less interesting.  It looks like its stalling on app-shells/bash-4.2_p53.  I will verify this (by undoing the patch) once I have sssd emerged and report back.
Comment 25 Steven Presser 2014-10-08 22:48:40 UTC
Welp, glibc bailed on me, so there's another issue to go look at...

In the meantime, I can conform that app-shells/bash-4.2_p53 also triggers this issue.  I'd be happy to provide a build log if desired, but my suspicion is nothing useful will show up.  However, it is notable that they get stuck in the same place:

>>> Completed installing bash-4.2_p53 into /var/tmp/portage/app-shells/bash-4.2_p53/image/

strip: x86_64-pc-linux-gnu-strip --strip-unneeded -R .comment -R .GCC.command.line -R .note.gnu.gold-version
   bin/bash
ecompressdir: bzip2 -9 /usr/share/info
ecompressdir: bzip2 -9 /usr/share/man
ecompressdir: bzip2 -9 /usr/share/doc


As I mentioned the patch works and I have updated bash.  I'm providing this in case it is somehow useful.
Comment 26 Zac Medico gentoo-dev 2014-10-08 23:14:12 UTC
(In reply to Steven Presser from comment #24)
> @Zac: That seems to have unstuck it!

Great, thanks for testing.
Comment 27 Zac Medico gentoo-dev 2014-10-09 04:12:12 UTC
Created attachment 386272 [details, diff]
bin/ebuild-ipc.py: use threading if available

This updated patch is fixed to prevent ebuild-ipc processes from hanging around in some cases like bug 345455.

I'll be maintaining this patch in the following branch:

	https://github.com/zmedico/portage/tree/bug_524328
Comment 28 Zac Medico gentoo-dev 2014-10-09 18:44:48 UTC
Created attachment 386316 [details, diff]
bin/ebuild-ipc.py: nonblocking fifo write

This updated patch uses nonblocking write so that no thread or fork is necessary. I've tested it on Linux and FreeBSD, and it seems to work fine.

Thought I don't think we'll want to use it, the threading patch is still available in this branch:

	https://github.com/zmedico/portage/tree/bug_524328_threading
Comment 30 Zac Medico gentoo-dev 2014-10-20 19:33:08 UTC
This is fixed in 2.2.14.