This issue is difficult to reproduce, but I've seen it observed in a couple of different contexts now. In one context, I was able to suppress the issue by triggering early evaluation of the lazy portage.data.userpriv_groups attribute. >>> Installing (2 of 343) dev-lang/go-1.6.1::gentoo >>> Jobs: 7 of 343 complete, 1 failed Load avg: 21.8, 17.5, 16.2 * Package: dev-libs/libaio-0.3.110 * Repository: gentoo * Maintainer: gokturk@binghamton.edu proxy-maint@gentoo.org * USE: abi_x86_64 amd64 elibc_glibc kernel_linux userland_GNU * FEATURES: preserve-libs sandbox splitdebug userpriv usersandbox >>> Unpacking source... >>> Unpacking libaio-0.3.110.tar.gz to /var/tmp/portage/dev-libs/libaio-0.3.110/work >>> Source unpacked in /var/tmp/portage/dev-libs/libaio-0.3.110/work >>> Preparing source in /var/tmp/portage/dev-libs/libaio-0.3.110/work/libaio-0.3.110 ... * Applying libaio-0.3.109-install.patch ... [ ok ] * Applying libaio-0.3.109-x32.patch ... [ ok ] * Applying libaio-0.3.109-testcase-8.patch ... [ ok ] * Applying libaio-0.3.110-cppflags.patch ... [ ok ] * Applying libaio-0.3.110-optional-werror.patch ... [ ok ] * Applying libaio-0.3.110-link-stdlib.patch ... [ ok ] * Will copy sources from /var/tmp/portage/dev-libs/libaio-0.3.110/work/libaio-0.3.110 * abi_x86_64.amd64: copying to /var/tmp/portage/dev-libs/libaio-0.3.110/work/libaio-0.3.110-abi_x86_64.amd64 >>> Source prepared. >>> Configuring source in /var/tmp/portage/dev-libs/libaio-0.3.110/work/libaio-0.3.110 ... * abi_x86_64.amd64: running multilib-minimal_abi_src_configure >>> Source configured. [Errno 9] Bad file descriptor: /usr/bin/sandbox /usr/lib/portage/python3.4/ebuild.sh compile Traceback (most recent call last): File "/usr/lib64/python3.4/subprocess.py", line 1418, in _execute_child part = _eintr_retry_call(os.read, errpipe_read, 50000) File "/usr/lib64/python3.4/subprocess.py", line 491, in _eintr_retry_call return func(*args) OSError: [Errno 9] Bad file descriptor During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/usr/lib64/python3.4/site-packages/portage/process.py", line 317, in spawn unshare_net, unshare_ipc, cgroup) File "/usr/lib64/python3.4/site-packages/portage/process.py", line 504, in _exec if groups: File "/usr/lib64/python3.4/site-packages/portage/proxy/objectproxy.py", line 89, in __bool__ return bool(object.__getattribute__(self, '_get_target')()) File "/usr/lib64/python3.4/site-packages/portage/data.py", line 281, in _get_target return _get_global(object.__getattribute__(self, '_name')) File "/usr/lib64/python3.4/site-packages/portage/data.py", line 212, in _get_global stderr=subprocess.STDOUT) File "/usr/lib64/python3.4/subprocess.py", line 859, in __init__ restore_signals, start_new_session) File "/usr/lib64/python3.4/subprocess.py", line 1424, in _execute_child os.close(errpipe_read) OSError: [Errno 9] Bad file descriptor * The ebuild phase 'compile' has exited unexpectedly. This type of * behavior is known to be triggered by things such as failed variable * assignments (bug #190128) or bad substitution errors (bug #200313). * Normally, before exiting, bash should have displayed an error message * above. If bash did not produce an error message above, it's possible * that the ebuild has called `exit` when it should have called `die` * instead. This behavior may also be triggered by a corrupt bash binary or * a hardware problem such as memory or cpu malfunction. If the problem is * not reproducible or it appears to occur randomly, then it is likely to * be triggered by a hardware problem. If you suspect a hardware problem * then you should try some basic hardware diagnostics such as memtest. * Please do not report this as a bug unless it is consistently * reproducible and you are sure that your bash binary and hardware are * functioning properly.
I saw this again today with python-3.4.5. Tried to reproduce it by running this in a shell loop, but could not trigger it: python3.4 -c "import portage; portage.process.spawn(['/bin/true'], uid=portage.data.portage_uid, gid=portage.data.portage_gid, groups=portage.data.userpriv_groups)"
It looks like this is a python bug, which occurs because the subprocess _execute_child function calls os.close(errpipe_write) in the subprocess that is created with its _posixsubprocess.fork_exec() call. The _execute_child function really should check the pid to make sure it's in the correct process before it calls os.close(errpipe_write), much like portage's ForkProcess class checks the pid here: https://gitweb.gentoo.org/proj/portage.git/tree/pym/portage/util/_async/ForkProcess.py?h=portage-2.3.3#n56
The symptom appears close to this issue: http://bugs.python.org/issue16140 However, python-3.4.5 already has the associated patch for that issue: https://hg.python.org/cpython/rev/d51df72dadb7 I not sure that the fork issue described in comment #2 is possible, given that _exit(255) is called if exec fails here: https://hg.python.org/cpython/file/v3.4.5/Modules/_posixsubprocess.c#l689 Maybe after the for, it calls back into the python interpreter and triggers the finally block with the os.close(errpipe_read) somehow?
The fact that EBADF first occurs in _eintr_retry_call(os.read, errpipe_read, 50000), which is not in a finally block, rules out the fork idea from comment #2. So something must be closing the errpipe_read file descriptor prematurely.
It's probably very close to http://bugs.python.org/issue16140. If something closes the file descriptor that belongs to a file object, then that file descriptor can get reused for errpipe_read. Then the garbage collector can come along and close the file descriptor after it has been reused for errpipe_read: https://hg.python.org/cpython/file/v3.4.5/Modules/_io/iobase.c#l231 Whatever is responsible for closing a file descriptor belonging to a file object could be either in portage or a library that portage uses (including the standard library).
Any subprocess.Popen calls that use stdout=subprocess.PIPE and don't explicitly close the associated stdout file can trigger this when that stdout file eventually becomes garbage collected, since the subprocess _execute_child function closes c2pread here: https://hg.python.org/cpython/file/v3.4.5/Lib/subprocess.py#l1407 Meanwhile, c2pread is associated with self.stdout having closefd=True: https://hg.python.org/cpython/file/v3.4.5/Lib/subprocess.py#l840 So, it's crucial for portage to explicitly close all such stdout=subprocess.PIPE files before the garbage collector gets them.
The code flaw is present in all of the latest python versions, so I've created this upstream issue: http://bugs.python.org/issue29373
Actually, os.close(p2cread) closes a different file descriptor than c2pread, so my analysis is not correct. I suppose it could be some other code closing a file descriptor that belongs to a file object. I've closed the upstream python issue.
Regardless of the root cause of the bad file descriptor, it's clear that we don't want the lazy portage.data.userpriv_groups evaluation to happen after the fork (in the portage.process._exec function). Therefore, we should fix that first, and it will mitigate the problem by reducing the number of times that the lazy evaluation occurs.
Patch posted for review: https://archives.gentoo.org/gentoo-portage-dev/message/9fac1da4f92981a022b331a3008bc03a https://github.com/gentoo/portage/pull/98
This is in the master branch: https://gitweb.gentoo.org/proj/portage.git/commit/?id=ccf975296daec92d376c4989e5ffb2a6cdbe8a2d
Fixed in portage-2.3.5.