Bug List: (This bug is not in your last search results)   Show last search results      Search page      Enter new bug
Bug#: 212882
Alias:
Product:
Component:
Status: RESOLVED
Resolution: WORKSFORME
Assigned To: Portage team <dev-portage@gentoo.org>
Hardware:
OS:
Version:
Priority:
Severity:
Reporter: casso <m.cassaniti@gmail.com>
Add CC:
CC:
URL:
Summary:
Status Whiteboard:
Keywords:

Filename Description Type Creator Created Size Actions
eacces.patch handle EACCES like EAGAIN patch Zac Medico 2008-03-10 16:51 0000 460 bytes Details | Diff
lock-cifs.patch also handle ENOENT from fstat calls patch Zac Medico 2008-03-17 02:33 0000 1.66 KB Details | Diff
Create a New Attachment (proposed patch, testcase, etc.) View All

Bug 212882 depends on: Show dependency tree
Bug 212882 blocks: 210077 216231
Votes: 0    Show votes for this bug    Vote for this bug

Additional Comments: (this is where you put emerge --info)


Not eligible to see or edit group visibility for this bug.






View Bug Activity   |   Format For Printing   |   XML   |   Clone This Bug


Description:   Opened: 2008-03-10 00:06 0000
The current lockfile system works fine with a CIFS shared portage directory
while only one instance of emerge is using the same lock. However, once a
second emerge process attempts to use the same lockfile, portage dies with an
IOError 13: Permission Denied error. This is due to the line
"fcntl.lockf(myfd,fcntl.LOCK_EX|fcntl.LOCK_NB)" in
/usr/lib/portage/pym/portage_locks.py

There is code in place to catch exceptions related to failing locks, but their
is not one for a permission denied exception. One thread in the forum:
(http://forums.gentoo.org/viewtopic-t-412722.html) suggests that adding
dir_mode=02775 to the CIFS mount options may resolve the issue. If this truly
is the case, then I believe that portage should throw an exception, but paste a
warning about CIFS and dir_mode=02775. If using dir_mode=02775 is only a
workaround, then a solution is currently not known to resolve this issue.

I have checked that the Samba server is not causing problems with file
permissions. It isn't causing any problems at present, and I have checked as
thoroughly as I can.

Thread http://forums.gentoo.org/viewtopic-t-514872.html may assist with some
details.

Reproducible: Always

Steps to Reproduce:
1. Make sure portage directory is mounted on a host via CIFS
2. Run emerge -f package, where package download time is large enough so that
you can start another emerge process
3. Start a second emerge -f package. It should _normally_ wait for the lock to
be released, but instead it crashes.

Actual Results:  
Threw exception due to code line fcntl.lockf(myfd,fcntl.LOCK_EX|fcntl.LOCK_NB)

Complained about IOError 13: Permission Denied

Expected Results:  
At minimum suggested a work around that would cause this exception not to
occur. At best, found that the underlying file-system was CIFS and attempted a
different locking strategy (HARDLINK?)

------- Comment #1 From SpanKY 2008-03-10 15:30:56 0000 -------
if CIFS returns EPERM when attempting to FLOCK an already FLOCK-ed file, then
CFIS is broken.  it should be setting errno to EACCES or EAGAIN.

first we need a distilled test case ...

------- Comment #2 From Zac Medico 2008-03-10 16:51:42 0000 -------
Created an attachment (id=145743) [details]
handle EACCES like EAGAIN

(In reply to comment #1)
> if CIFS returns EPERM when attempting to FLOCK an already FLOCK-ed file, then
> CFIS is broken.  it should be setting errno to EACCES or EAGAIN.

EACCES == "Permission denied
EPERM == "Operation not permitted"

------- Comment #3 From casso 2008-03-11 00:27:10 0000 -------
Please specify what you mean by a distilled test case. I can give you all the
smb.conf details you need, and all the relavent permissions, but I still don't
see  the issue as being the Samba server. The server returns EACCESS problems,
which is probably the correct thing to do under CIFS, but may not be the
expected response under a POSIX style filesystem.

------- Comment #4 From Zac Medico 2008-03-11 08:53:52 0000 -------
The attached patch should handle it. If you could test it then that would be
great.

------- Comment #5 From casso 2008-03-12 03:49:56 0000 -------
(In reply to comment #4)
> The attached patch should handle it. If you could test it then that would be
> great.
> 
Tested a slightly modified version of the patch. It seems that (errno.EAGAIN,
errno.EACCES) did not successfully work, but the line:
if e.errno == errno.EAGAIN or e.errno == errno.EACCES: will work correctly.

The client that should be waiting on the lock file will now successully wait
until the lock is released.

There is a new problem which occurs when a client waiting on the lock file
notices that the lock is released. This does not always occur, which suggests
that it could be a race condition of some sort.

  File "/usr/lib/portage/pym/portage_locks.py", line 113, in lockfile
    if type(lockfilename) == types.StringType and \
OSError: [Errno 2] No such file or directory

This section of code deals with lock files that suddenly disappear by accident.
The full section of code is:
        if type(lockfilename) == types.StringType and \
                myfd != HARDLINK_FD and os.fstat(myfd).st_nlink == 0:
                # The file was deleted on us... Keep trying to make one...
                os.close(myfd)

I can post a new bug report if you would like.

------- Comment #6 From casso 2008-03-12 06:43:43 0000 -------
After some review, it seems that the OSError 2 exception occurs as a race
condition between the process owning the lock, and the process attempting to
gain the lock. In the unlockfile function of portage_locks, there is the
following commented block:

# This sleep call was added to allow other processes that are
# waiting for a lock to be able to grab it before it is deleted.
# lockfile() already accounts for this situation, however, and
# the sleep here adds more time than is saved overall, so am
# commenting until it is proved necessary.
# time.sleep(0.0001)

I would suggest that although lockfile() may account for this situation, what
happens when two clients on a network are after the same lock. When one goes to
release the lock, the other may attempt to grab a lock to a (very) recently
deleted file. Of course, if the file is not being unlinked, then this is not a
problem, but generally it will be unlinked.

The timeout needs to be chosen appropriately to allow for network delays and
the like. If the server happened to have initiated the lock, and a CIFS client
is waiting on the lock, then the timeout will need to be quite high from what I
can gather. Unless I misconfigured part of my kernel, the CIFS client takes
about 10 seconds to notice the file is unlocked. In this special case, a
recursive call to lockfile() should take place. The if statement requires
modification to fix this.

In summary I suggest:
Uncomment timeout in unlockfile() and set the value to 0.1 for now
Change the last if statement of lockfile() from this:
if type(lockfilename) == types.StringType and \
    myfd != HARDLINK_FD and os.fstat(myfd).st_nlink == 0:

to the following:
if type(lockfilename) == types.StringType and \
    myfd != HARDLINK_FD:
        if not os.path.exists(lockfilename) or os.fstat(myfd).st_nlink == 0:

This will handle the case when the lock file does not exist, or when it does
exist, but there are no links to it. I don't truly know the difference.

------- Comment #7 From Zac Medico 2008-03-16 00:05:19 0000 -------
(In reply to comment #5)
> There is a new problem which occurs when a client waiting on the lock file
> notices that the lock is released. This does not always occur, which suggests
> that it could be a race condition of some sort.
> 
>   File "/usr/lib/portage/pym/portage_locks.py", line 113, in lockfile
>     if type(lockfilename) == types.StringType and \
> OSError: [Errno 2] No such file or directory

So, are you saying that the fstat call raised "OSError: [Errno 2] No such file
or directory"? If so, that seems to go against the spec:

http://www.opengroup.org/onlinepubs/009695399/functions/fstat.html

According to that, the valid errors are EBADF, EIO, or EOVERFLOW. If the
kernel's CIFS implementation isn't conforming to the spec then it seems like
they should fix it to do so (or else the spec needs to be updated).

------- Comment #8 From casso 2008-03-16 22:28:57 0000 -------
You need to remember that CIFS is designed to work with Microsoft Windows. Any
error messages returned by a Samba server, or other server running CIFS will
conform to what is expected under Microsoft's implementation of CIFS.

This is not to discredit what you are saying a fstat() should return. The
problem is that CIFS itself (not Linux or Samba's implementation) does not seem
to conform to all the standards.

If I am wrong about this, than I am sorry and will need to file a bug report
with the Samba team. If I have it right, could someone please test my changes
and have these modifications included in a subsequent release of portage.

To my understanding, the file descriptor is valid on the client, but the file
does not exist on the server. Attempting to reference the file on the server
returns a no file or directory error [IOError 2], while the file descriptor is
still valid on the client, which means an EBADF error is not returned. Like I
said, this is to my own understanding.

What happens on NFS? I assume that NFS is more POSIX compliant, so does the
fstat() run correctly over NFS?

------- Comment #9 From Zac Medico 2008-03-16 23:31:31 0000 -------
(In reply to comment #8)
> To my understanding, the file descriptor is valid on the client, but the file
> does not exist on the server. Attempting to reference the file on the server
> returns a no file or directory error [IOError 2], while the file descriptor is
> still valid on the client, which means an EBADF error is not returned. Like I
> said, this is to my own understanding.

We can cleanly handle this case by catching the ENOENT exception and handling
it appropriately. This case should be essentially equivalent to the case where
the fstat call succeeds and the number of links is found to be zero.

------- Comment #10 From Zac Medico 2008-03-17 02:33:41 0000 -------
Created an attachment (id=146366) [details]
also handle ENOENT from fstat calls

(In reply to comment #5)
> Tested a slightly modified version of the patch. It seems that (errno.EAGAIN,
> errno.EACCES) did not successfully work, but the line:
> if e.errno == errno.EAGAIN or e.errno == errno.EACCES: will work correctly.

I think you should test my version again because 'e.errno in (errno.EACCES,
errno.EAGAIN)' should work just fine.

------- Comment #11 From casso 2008-03-25 03:19:31 0000 -------
I have tested this patch in its entirety, and I have not had any more trouble
with CIFS and portage distlocks.
I personally think that the additional function to determine the number of
links to a file might be overkill, but it does do the job.

I have marked this patch as "worksforme". I don't know if it is appropriate yet
to mark the bug as fixed. You may wish to have more test cases.

------- Comment #12 From Zac Medico 2008-03-25 16:29:56 0000 -------
(In reply to comment #11)
> I personally think that the additional function to determine the number of
> links to a file might be overkill, but it does do the job.

Well, the os.path.exists() approach alone would be flawed because it does not
ensure that the existing file corresponds to the same file as the open file
descriptor. Use of fstat() avoids this issue since it operates on the file
descriptor instead of the file path.

------- Comment #13 From casso 2008-03-26 00:02:42 0000 -------
I should have known that. My bad...
What is the next step in the process to getting this patch pushed into the
portage package? I did notice that the patch is in SVN.

Does portage have any regression tests? If so, then a unit test for this bug
will be required, which could be quite difficult to test without appropriate
infrastructure.

What does it take to mark this bug as closed?

------- Comment #14 From Zac Medico 2008-03-26 04:51:10 0000 -------
I'm planning to do a 2.1.4.5 release soon and I'll close the bug when that's
been released for testing.

We do have a number of tests that run when FEATURES=test is enabled, but there
are currently no tests for the locking api. However, the patch only makes
simple behavioral changes in a couple of well defined cases. Your and my
testing seem to have covered all relevant use cases, so I'm confident that
there are no regressions.

Bug List: (This bug is not in your last search results)   Show last search results      Search page      Enter new bug