212882 – IOError 13: Permission Denied when portage is shared via CIFS, and distlocks is used

Bug 212882 - IOError 13: Permission Denied when portage is shared via CIFS, and distlocks is used

Summary: IOError 13: Permission Denied when portage is shared via CIFS, and distlocks ...

Status:	RESOLVED WORKSFORME

Alias:	None

Product:	Portage Development
Classification:	Unclassified
Component:	Core (show other bugs)
Hardware:	All Linux

Importance:	High normal
Assignee:	Portage team

URL:
Whiteboard:
Keywords:	InVCS

Depends on:
Blocks:	216231
	Show dependency tree

Reported:	2008-03-10 00:06 UTC by MCassaniti
Modified:	2010-09-23 02:05 UTC (History)
CC List:	0 users

See Also:
Package list:
Runtime testing required:	---

Attachments
handle EACCES like EAGAIN (eacces.patch,460 bytes, patch) 2008-03-10 16:51 UTC, Zac Medico	Details \| Diff
also handle ENOENT from fstat calls (lock-cifs.patch,1.66 KB, patch) 2008-03-17 02:33 UTC, Zac Medico	Details \| Diff
Show Obsolete (1) View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description MCassaniti 2008-03-10 00:06:31 UTC

The current lockfile system works fine with a CIFS shared portage directory while only one instance of emerge is using the same lock. However, once a second emerge process attempts to use the same lockfile, portage dies with an IOError 13: Permission Denied error. This is due to the line "fcntl.lockf(myfd,fcntl.LOCK_EX|fcntl.LOCK_NB)" in /usr/lib/portage/pym/portage_locks.py

There is code in place to catch exceptions related to failing locks, but their is not one for a permission denied exception. One thread in the forum: (http://forums.gentoo.org/viewtopic-t-412722.html) suggests that adding dir_mode=02775 to the CIFS mount options may resolve the issue. If this truly is the case, then I believe that portage should throw an exception, but paste a warning about CIFS and dir_mode=02775. If using dir_mode=02775 is only a workaround, then a solution is currently not known to resolve this issue.

I have checked that the Samba server is not causing problems with file permissions. It isn't causing any problems at present, and I have checked as thoroughly as I can.

Thread http://forums.gentoo.org/viewtopic-t-514872.html may assist with some details.

Reproducible: Always

Steps to Reproduce:
1. Make sure portage directory is mounted on a host via CIFS
2. Run emerge -f package, where package download time is large enough so that you can start another emerge process
3. Start a second emerge -f package. It should _normally_ wait for the lock to be released, but instead it crashes.

Actual Results:
Threw exception due to code line fcntl.lockf(myfd,fcntl.LOCK_EX|fcntl.LOCK_NB)

Complained about IOError 13: Permission Denied

Expected Results:
At minimum suggested a work around that would cause this exception not to occur. At best, found that the underlying file-system was CIFS and attempted a different locking strategy (HARDLINK?)

Comment 1 SpanKY gentoo-dev

2008-03-10 15:30:56 UTC

if CIFS returns EPERM when attempting to FLOCK an already FLOCK-ed file, then CFIS is broken.  it should be setting errno to EACCES or EAGAIN.

first we need a distilled test case ...

Comment 2 Zac Medico gentoo-dev

2008-03-10 16:51:42 UTC

Created attachment 145743 [details, diff]
handle EACCES like EAGAIN

(In reply to comment #1)
> if CIFS returns EPERM when attempting to FLOCK an already FLOCK-ed file, then
> CFIS is broken.  it should be setting errno to EACCES or EAGAIN.

EACCES == "Permission denied
EPERM == "Operation not permitted"

Comment 3 MCassaniti 2008-03-11 00:27:10 UTC

Please specify what you mean by a distilled test case. I can give you all the smb.conf details you need, and all the relavent permissions, but I still don't see  the issue as being the Samba server. The server returns EACCESS problems, which is probably the correct thing to do under CIFS, but may not be the expected response under a POSIX style filesystem.

Comment 4 Zac Medico gentoo-dev

2008-03-11 08:53:52 UTC

The attached patch should handle it. If you could test it then that would be great.

Comment 5 MCassaniti 2008-03-12 03:49:56 UTC

(In reply to comment #4)
> The attached patch should handle it. If you could test it then that would be
> great.
> 
Tested a slightly modified version of the patch. It seems that (errno.EAGAIN, errno.EACCES) did not successfully work, but the line:
if e.errno == errno.EAGAIN or e.errno == errno.EACCES: will work correctly.

The client that should be waiting on the lock file will now successully wait until the lock is released.

There is a new problem which occurs when a client waiting on the lock file notices that the lock is released. This does not always occur, which suggests that it could be a race condition of some sort.

  File "/usr/lib/portage/pym/portage_locks.py", line 113, in lockfile
    if type(lockfilename) == types.StringType and \
OSError: [Errno 2] No such file or directory

This section of code deals with lock files that suddenly disappear by accident. The full section of code is:
        if type(lockfilename) == types.StringType and \
                myfd != HARDLINK_FD and os.fstat(myfd).st_nlink == 0:
                # The file was deleted on us... Keep trying to make one...
                os.close(myfd)

I can post a new bug report if you would like.

Comment 6 MCassaniti 2008-03-12 06:43:43 UTC

After some review, it seems that the OSError 2 exception occurs as a race condition between the process owning the lock, and the process attempting to gain the lock. In the unlockfile function of portage_locks, there is the following commented block:

# This sleep call was added to allow other processes that are
# waiting for a lock to be able to grab it before it is deleted.
# lockfile() already accounts for this situation, however, and
# the sleep here adds more time than is saved overall, so am
# commenting until it is proved necessary.
# time.sleep(0.0001)

I would suggest that although lockfile() may account for this situation, what happens when two clients on a network are after the same lock. When one goes to release the lock, the other may attempt to grab a lock to a (very) recently deleted file. Of course, if the file is not being unlinked, then this is not a problem, but generally it will be unlinked.

The timeout needs to be chosen appropriately to allow for network delays and the like. If the server happened to have initiated the lock, and a CIFS client is waiting on the lock, then the timeout will need to be quite high from what I can gather. Unless I misconfigured part of my kernel, the CIFS client takes about 10 seconds to notice the file is unlocked. In this special case, a recursive call to lockfile() should take place. The if statement requires modification to fix this.

In summary I suggest:
Uncomment timeout in unlockfile() and set the value to 0.1 for now
Change the last if statement of lockfile() from this:
if type(lockfilename) == types.StringType and \
    myfd != HARDLINK_FD and os.fstat(myfd).st_nlink == 0:

to the following:
if type(lockfilename) == types.StringType and \
    myfd != HARDLINK_FD:
        if not os.path.exists(lockfilename) or os.fstat(myfd).st_nlink == 0:

This will handle the case when the lock file does not exist, or when it does exist, but there are no links to it. I don't truly know the difference.

Comment 7 Zac Medico gentoo-dev

2008-03-16 00:05:19 UTC

(In reply to comment #5)
> There is a new problem which occurs when a client waiting on the lock file
> notices that the lock is released. This does not always occur, which suggests
> that it could be a race condition of some sort.
> 
>   File "/usr/lib/portage/pym/portage_locks.py", line 113, in lockfile
>     if type(lockfilename) == types.StringType and \
> OSError: [Errno 2] No such file or directory

So, are you saying that the fstat call raised "OSError: [Errno 2] No such file or directory"? If so, that seems to go against the spec:

http://www.opengroup.org/onlinepubs/009695399/functions/fstat.html

According to that, the valid errors are EBADF, EIO, or EOVERFLOW. If the kernel's CIFS implementation isn't conforming to the spec then it seems like they should fix it to do so (or else the spec needs to be updated).

Comment 8 MCassaniti 2008-03-16 22:28:57 UTC

You need to remember that CIFS is designed to work with Microsoft Windows. Any error messages returned by a Samba server, or other server running CIFS will conform to what is expected under Microsoft's implementation of CIFS.

This is not to discredit what you are saying a fstat() should return. The problem is that CIFS itself (not Linux or Samba's implementation) does not seem to conform to all the standards.

If I am wrong about this, than I am sorry and will need to file a bug report with the Samba team. If I have it right, could someone please test my changes and have these modifications included in a subsequent release of portage.

To my understanding, the file descriptor is valid on the client, but the file does not exist on the server. Attempting to reference the file on the server returns a no file or directory error [IOError 2], while the file descriptor is still valid on the client, which means an EBADF error is not returned. Like I said, this is to my own understanding.

What happens on NFS? I assume that NFS is more POSIX compliant, so does the fstat() run correctly over NFS?

Comment 9 Zac Medico gentoo-dev

2008-03-16 23:31:31 UTC

(In reply to comment #8)
> To my understanding, the file descriptor is valid on the client, but the file
> does not exist on the server. Attempting to reference the file on the server
> returns a no file or directory error [IOError 2], while the file descriptor is
> still valid on the client, which means an EBADF error is not returned. Like I
> said, this is to my own understanding.

We can cleanly handle this case by catching the ENOENT exception and handling it appropriately. This case should be essentially equivalent to the case where the fstat call succeeds and the number of links is found to be zero.

Comment 10 Zac Medico gentoo-dev

2008-03-17 02:33:41 UTC

Created attachment 146366 [details, diff]
also handle ENOENT from fstat calls

(In reply to comment #5)
> Tested a slightly modified version of the patch. It seems that (errno.EAGAIN,
> errno.EACCES) did not successfully work, but the line:
> if e.errno == errno.EAGAIN or e.errno == errno.EACCES: will work correctly.

I think you should test my version again because 'e.errno in (errno.EACCES, errno.EAGAIN)' should work just fine.

Comment 11 MCassaniti 2008-03-25 03:19:31 UTC

I have tested this patch in its entirety, and I have not had any more trouble with CIFS and portage distlocks.
I personally think that the additional function to determine the number of links to a file might be overkill, but it does do the job.

I have marked this patch as "worksforme". I don't know if it is appropriate yet to mark the bug as fixed. You may wish to have more test cases.

Comment 12 Zac Medico gentoo-dev

2008-03-25 16:29:56 UTC

(In reply to comment #11)
> I personally think that the additional function to determine the number of
> links to a file might be overkill, but it does do the job.

Well, the os.path.exists() approach alone would be flawed because it does not ensure that the existing file corresponds to the same file as the open file descriptor. Use of fstat() avoids this issue since it operates on the file descriptor instead of the file path.

Comment 13 MCassaniti 2008-03-26 00:02:42 UTC

I should have known that. My bad...
What is the next step in the process to getting this patch pushed into the portage package? I did notice that the patch is in SVN.

Does portage have any regression tests? If so, then a unit test for this bug will be required, which could be quite difficult to test without appropriate infrastructure.

What does it take to mark this bug as closed?

Comment 14 Zac Medico gentoo-dev

2008-03-26 04:51:10 UTC

I'm planning to do a 2.1.4.5 release soon and I'll close the bug when that's been released for testing.

We do have a number of tests that run when FEATURES=test is enabled, but there are currently no tests for the locking api. However, the patch only makes simple behavioral changes in a couple of well defined cases. Your and my testing seem to have covered all relevant use cases, so I'm confident that there are no regressions.