338547 – emerge crashs if the PORTAGE_TMPDIR directory is on a nfs4 which is wrong configured

Bug 338547 - emerge crashs if the PORTAGE_TMPDIR directory is on a nfs4 which is wrong configured

Summary: emerge crashs if the PORTAGE_TMPDIR directory is on a nfs4 which is wrong con...

Status:	CONFIRMED

Alias:	None

Product:	Portage Development
Classification:	Unclassified
Component:	Core (show other bugs)
Hardware:	All Linux

Importance:	High minor
Assignee:	Portage team

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:	326079
	Show dependency tree

Reported:	2010-09-24 12:56 UTC by Michael Mair-Keimberger (mm1ke)
Modified:	2012-02-02 15:23 UTC (History)
CC List:	1 user (show)

See Also:
Package list:
Runtime testing required:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Michael Mair-Keimberger (mm1ke) 2010-09-24 12:56:35 UTC

First of all, this is not directly a bug in emerge/protage, since it just happen if a nfs-client is wrong configured, but it would be really nice if portage would handle this problem a bit nicer, than just a crash :)

Actually the problem is quite easy to reproduce.
On my server i export via nfs4 a directory, which other maschines should use as portage tempdir. In my /etc/idmapd.conf i set the Domain to "foo.bar.com".
As long as all other maschines have the same Domain in their /etc/idmapd.conf, emerge works well. But since its also possible to mount nfs4 shares without setting the right domain, emerge would fail like this:

>>> Emerging (1 of 1) sys-apps/less-436
Traceback (most recent call last):
  File "/usr/bin/emerge", line 40, in <module>
    retval = _emerge.emerge_main()
  File "//usr/lib/portage/pym/_emerge/__init__.py", line 15814, in emerge_main
    myopts, myaction, myfiles, spinner)
  File "//usr/lib/portage/pym/_emerge/__init__.py", line 14872, in action_build
    retval = mergetask.merge()
  File "//usr/lib/portage/pym/_emerge/__init__.py", line 10875, in merge
    rval = self._merge()
  File "//usr/lib/portage/pym/_emerge/__init__.py", line 11176, in _merge
    self._main_loop()
  File "//usr/lib/portage/pym/_emerge/__init__.py", line 11304, in _main_loop
    while self._schedule():
  File "//usr/lib/portage/pym/_emerge/__init__.py", line 9583, in _schedule
    return self._schedule_tasks()
  File "//usr/lib/portage/pym/_emerge/__init__.py", line 11335, in _schedule_tasks
    if q.schedule():
  File "//usr/lib/portage/pym/_emerge/__init__.py", line 9452, in schedule
    task.start()
  File "//usr/lib/portage/pym/_emerge/__init__.py", line 1841, in start
    self._start()
  File "//usr/lib/portage/pym/_emerge/__init__.py", line 4002, in _start
    self._start_task(build, self._default_final_exit)
  File "//usr/lib/portage/pym/_emerge/__init__.py", line 2159, in _start_task
    task.start()
  File "//usr/lib/portage/pym/_emerge/__init__.py", line 1841, in start
    self._start()
  File "//usr/lib/portage/pym/_emerge/__init__.py", line 2763, in _start
    self._prefetch_exit(prefetcher)
  File "//usr/lib/portage/pym/_emerge/__init__.py", line 2787, in _prefetch_exit
    self._start_task(fetcher, self._fetch_exit)
  File "//usr/lib/portage/pym/_emerge/__init__.py", line 2159, in _start_task
    task.start()
  File "//usr/lib/portage/pym/_emerge/__init__.py", line 1841, in start
    self._start()
  File "//usr/lib/portage/pym/_emerge/__init__.py", line 2554, in _start
    self._build_dir.lock()
  File "//usr/lib/portage/pym/_emerge/__init__.py", line 2668, in lock
    mode=070, mask=0)
  File "//usr/lib/portage/pym/portage/util.py", line 1043, in ensure_dirs
    perms_modified = apply_permissions(dir_path, *args, **kwargs)
  File "//usr/lib/portage/pym/portage/util.py", line 743, in apply_permissions
    os.chown(filename, uid, gid)
OSError: [Errno 22] Invalid argument: '/var/tmp/portage'

If i set the Domain in the idmapd.conf to the right one (as the server), everything works fine :)

It took me quite some time to find the problem. Maybe it is possible to make a nicer output for the future?

Reproducible: Always

Steps to Reproduce:
1. start /etc/init.d/rpc.idmapd with a wrong domain in /etc/idmapd.conf
2. mount the nfs4 share to /var/tmp/ 
3. emerge something

Actual Results:  
emerge crashs with a nice traceback

Expected Results:  
some better information about the problem.

Comment 1 Zac Medico gentoo-dev

2010-09-24 13:10:33 UTC

Given the nature of the problem, I'm not sure how much better of an error message could be automatically generated. It's not like emerge would be able to know that nfs4 configuration was the root problem.

Comment 2 Jack 2012-02-01 23:09:05 UTC

I don't know why I didn't find this bug earlier, but I think the problem is deeper or more subtle.  I get the same error, but very inconsistently - maybe 10-20% of emerges.  In my case, the trace always shows the error as on a chown, but the "invalid argument" always refers to a file that does not exist - it appears to have already been renamed, for example, from CHOST.32255 to CHOST, by the time I look at it.

In addition, I DO have the same domain specified for idmapd on both the client and server.  In my case, some emerges will work fine the next time, but some will fail many times in a row before succeeding.  Usually, however, doing "ebuild path/to/ebuild merge" will successfully complete the install.

I also get other errors apparently related to nfs4.  sudo (any version I've tried) will always install without setuid if PORTAGE_TMPDIR is nfs4, but with setuid if PROTAGE_TMPDIR is local.  Bug 400679 is about problems with a file collision on /usr/share/info/dir when nfs4 is involved.

Finally, ALL my emerges end with "rm: cannot remove `path/to/portage/tmpdir/portage/group/package/temp': Directory not empty" even though that directory is always empty by the time I look. (for any group and package)

What other information can I provide, or what troubleshooting can I do on my own?

Comment 3 Zac Medico gentoo-dev

2012-02-02 09:24:49 UTC

(In reply to comment #2)
> Finally, ALL my emerges end with "rm: cannot remove
> `path/to/portage/tmpdir/portage/group/package/temp': Directory not empty" even
> though that directory is always empty by the time I look. (for any group and
> package)

Sounds like bug 364143.

> What other information can I provide, or what troubleshooting can I do on my
> own?

All of the issues that you're have seem to be rooted in various kinds of NFS misbehavior. Any time that NFS deviates from local file system behavior, it can cause all kinds of applications to fail. Some minor deviations, like ESTALE behavior in bug 266211, have simple workarounds at the application level. More severe deviations will require fixes in NFS to make it behave more like a local file system, and you'll have to work with upstream NFS developers to make that happen.

Comment 4 Jack 2012-02-02 13:57:21 UTC

I do also see bug 364143 and bug 400679.  bug 288211 does look related, but I've never seen an explicit "stale NFS file handle" error.   I'm currently on portage 2.1.10.44.  Would there be any point in my trying one of the 2.2.0 versions in the tree?

I'll be glad to work with upstream NFS, but from my perspective, I don't yet see what behavior differences to ask them to fix.  When I get a chown error (as in this bug) the file name listed in "bad parameter" (such as CHOST.25436) has already been renamed to CHOST.  At that point, mmediately doing "ebuild path/to/ebuild merge" always works (unless some different error happens, but never the chown issue) and I'm pretty sure it only does the qmerge step.

Separate question - is there anything to test here to be able to ad a "|| die" since the current behavior just does a complete abort?  (similar to your fix in bug 400679)

Comment 5 Zac Medico gentoo-dev

2012-02-02 14:06:26 UTC

(In reply to comment #4)
> I do also see bug 364143 and bug 400679.  bug 288211 does look related, but
> I've never seen an explicit "stale NFS file handle" error.   I'm currently on
> portage 2.1.10.44.  Would there be any point in my trying one of the 2.2.0
> versions in the tree?

No, 2.1.10.44 has the same code as 2.2.0_alpha84. The only difference is that some unrelated features are conditionally enabled in 2.2.0_alpha84.

> I'll be glad to work with upstream NFS, but from my perspective, I don't yet
> see what behavior differences to ask them to fix.  When I get a chown error (as
> in this bug) the file name listed in "bad parameter" (such as CHOST.25436) has
> already been renamed to CHOST.

The chown call should complete before the rename, so it doesn't seem like they should be related. However, it's possible that some kind of NFS bug causes them to interfere somehow.

> Separate question - is there anything to test here to be able to ad a "|| die"
> since the current behavior just does a complete abort?  (similar to your fix in
> bug 400679)

It already aborts, so there's really nothing to change. I don't think we need to handle the error, since it seems indicative of an NFS bug that needs to be fixed rather than handled at the application level.