First of all, this is not directly a bug in emerge/protage, since it just happen if a nfs-client is wrong configured, but it would be really nice if portage would handle this problem a bit nicer, than just a crash :) Actually the problem is quite easy to reproduce. On my server i export via nfs4 a directory, which other maschines should use as portage tempdir. In my /etc/idmapd.conf i set the Domain to "foo.bar.com". As long as all other maschines have the same Domain in their /etc/idmapd.conf, emerge works well. But since its also possible to mount nfs4 shares without setting the right domain, emerge would fail like this: >>> Emerging (1 of 1) sys-apps/less-436 Traceback (most recent call last): File "/usr/bin/emerge", line 40, in <module> retval = _emerge.emerge_main() File "//usr/lib/portage/pym/_emerge/__init__.py", line 15814, in emerge_main myopts, myaction, myfiles, spinner) File "//usr/lib/portage/pym/_emerge/__init__.py", line 14872, in action_build retval = mergetask.merge() File "//usr/lib/portage/pym/_emerge/__init__.py", line 10875, in merge rval = self._merge() File "//usr/lib/portage/pym/_emerge/__init__.py", line 11176, in _merge self._main_loop() File "//usr/lib/portage/pym/_emerge/__init__.py", line 11304, in _main_loop while self._schedule(): File "//usr/lib/portage/pym/_emerge/__init__.py", line 9583, in _schedule return self._schedule_tasks() File "//usr/lib/portage/pym/_emerge/__init__.py", line 11335, in _schedule_tasks if q.schedule(): File "//usr/lib/portage/pym/_emerge/__init__.py", line 9452, in schedule task.start() File "//usr/lib/portage/pym/_emerge/__init__.py", line 1841, in start self._start() File "//usr/lib/portage/pym/_emerge/__init__.py", line 4002, in _start self._start_task(build, self._default_final_exit) File "//usr/lib/portage/pym/_emerge/__init__.py", line 2159, in _start_task task.start() File "//usr/lib/portage/pym/_emerge/__init__.py", line 1841, in start self._start() File "//usr/lib/portage/pym/_emerge/__init__.py", line 2763, in _start self._prefetch_exit(prefetcher) File "//usr/lib/portage/pym/_emerge/__init__.py", line 2787, in _prefetch_exit self._start_task(fetcher, self._fetch_exit) File "//usr/lib/portage/pym/_emerge/__init__.py", line 2159, in _start_task task.start() File "//usr/lib/portage/pym/_emerge/__init__.py", line 1841, in start self._start() File "//usr/lib/portage/pym/_emerge/__init__.py", line 2554, in _start self._build_dir.lock() File "//usr/lib/portage/pym/_emerge/__init__.py", line 2668, in lock mode=070, mask=0) File "//usr/lib/portage/pym/portage/util.py", line 1043, in ensure_dirs perms_modified = apply_permissions(dir_path, *args, **kwargs) File "//usr/lib/portage/pym/portage/util.py", line 743, in apply_permissions os.chown(filename, uid, gid) OSError: [Errno 22] Invalid argument: '/var/tmp/portage' If i set the Domain in the idmapd.conf to the right one (as the server), everything works fine :) It took me quite some time to find the problem. Maybe it is possible to make a nicer output for the future? Reproducible: Always Steps to Reproduce: 1. start /etc/init.d/rpc.idmapd with a wrong domain in /etc/idmapd.conf 2. mount the nfs4 share to /var/tmp/ 3. emerge something Actual Results: emerge crashs with a nice traceback Expected Results: some better information about the problem.
Given the nature of the problem, I'm not sure how much better of an error message could be automatically generated. It's not like emerge would be able to know that nfs4 configuration was the root problem.
I don't know why I didn't find this bug earlier, but I think the problem is deeper or more subtle. I get the same error, but very inconsistently - maybe 10-20% of emerges. In my case, the trace always shows the error as on a chown, but the "invalid argument" always refers to a file that does not exist - it appears to have already been renamed, for example, from CHOST.32255 to CHOST, by the time I look at it. In addition, I DO have the same domain specified for idmapd on both the client and server. In my case, some emerges will work fine the next time, but some will fail many times in a row before succeeding. Usually, however, doing "ebuild path/to/ebuild merge" will successfully complete the install. I also get other errors apparently related to nfs4. sudo (any version I've tried) will always install without setuid if PORTAGE_TMPDIR is nfs4, but with setuid if PROTAGE_TMPDIR is local. Bug 400679 is about problems with a file collision on /usr/share/info/dir when nfs4 is involved. Finally, ALL my emerges end with "rm: cannot remove `path/to/portage/tmpdir/portage/group/package/temp': Directory not empty" even though that directory is always empty by the time I look. (for any group and package) What other information can I provide, or what troubleshooting can I do on my own?
(In reply to comment #2) > Finally, ALL my emerges end with "rm: cannot remove > `path/to/portage/tmpdir/portage/group/package/temp': Directory not empty" even > though that directory is always empty by the time I look. (for any group and > package) Sounds like bug 364143. > What other information can I provide, or what troubleshooting can I do on my > own? All of the issues that you're have seem to be rooted in various kinds of NFS misbehavior. Any time that NFS deviates from local file system behavior, it can cause all kinds of applications to fail. Some minor deviations, like ESTALE behavior in bug 266211, have simple workarounds at the application level. More severe deviations will require fixes in NFS to make it behave more like a local file system, and you'll have to work with upstream NFS developers to make that happen.
I do also see bug 364143 and bug 400679. bug 288211 does look related, but I've never seen an explicit "stale NFS file handle" error. I'm currently on portage 2.1.10.44. Would there be any point in my trying one of the 2.2.0 versions in the tree? I'll be glad to work with upstream NFS, but from my perspective, I don't yet see what behavior differences to ask them to fix. When I get a chown error (as in this bug) the file name listed in "bad parameter" (such as CHOST.25436) has already been renamed to CHOST. At that point, mmediately doing "ebuild path/to/ebuild merge" always works (unless some different error happens, but never the chown issue) and I'm pretty sure it only does the qmerge step. Separate question - is there anything to test here to be able to ad a "|| die" since the current behavior just does a complete abort? (similar to your fix in bug 400679)
(In reply to comment #4) > I do also see bug 364143 and bug 400679. bug 288211 does look related, but > I've never seen an explicit "stale NFS file handle" error. I'm currently on > portage 2.1.10.44. Would there be any point in my trying one of the 2.2.0 > versions in the tree? No, 2.1.10.44 has the same code as 2.2.0_alpha84. The only difference is that some unrelated features are conditionally enabled in 2.2.0_alpha84. > I'll be glad to work with upstream NFS, but from my perspective, I don't yet > see what behavior differences to ask them to fix. When I get a chown error (as > in this bug) the file name listed in "bad parameter" (such as CHOST.25436) has > already been renamed to CHOST. The chown call should complete before the rename, so it doesn't seem like they should be related. However, it's possible that some kind of NFS bug causes them to interfere somehow. > Separate question - is there anything to test here to be able to ad a "|| die" > since the current behavior just does a complete abort? (similar to your fix in > bug 400679) It already aborts, so there's really nothing to change. I don't think we need to handle the error, since it seems indicative of an NFS bug that needs to be fixed rather than handled at the application level.