Gentoo Websites Logo
Go to: Gentoo Home Documentation Forums Lists Bugs Planet Store Wiki Get Gentoo!
Bug 636798 - Lockfile removal on portage nfs share causing simultaneous emerge to fail
Summary: Lockfile removal on portage nfs share causing simultaneous emerge to fail
Status: RESOLVED FIXED
Alias: None
Product: Portage Development
Classification: Unclassified
Component: Core (show other bugs)
Hardware: All Linux
: Normal normal (vote)
Assignee: Portage team
URL:
Whiteboard:
Keywords: InVCS
Depends on:
Blocks: 671498
  Show dependency tree
 
Reported: 2017-11-07 15:47 UTC by William L. Thomson Jr.
Modified: 2019-04-10 04:36 UTC (History)
0 users

See Also:
Package list:
Runtime testing required: ---


Attachments
test case (test-lock.py,1.38 KB, text/x-python)
2019-02-05 01:29 UTC, Zac Medico
Details

Note You need to log in before you can comment on or make changes to this bug.
Description William L. Thomson Jr. 2017-11-07 15:47:25 UTC
I have portage main repository on a NFS share. I went to update a few systems basically at the same time. I ran into the following error.

 * waiting for lock on /usr/portage/packages/amd64/.Packages.portage_lockfile ...        [ ok ]
Traceback (most recent call last):
  File "/usr/lib/python-exec/python3.5/emerge", line 50, in <module>
    retval = emerge_main()
  File "/usr/lib64/python3.5/site-packages/_emerge/main.py", line 1250, in emerge_main
    return run_action(emerge_config)
  File "/usr/lib64/python3.5/site-packages/_emerge/actions.py", line 2930, in run_action
    getbinpkgs="--getbinpkg" in emerge_config.opts)
  File "/usr/lib64/python3.5/site-packages/portage/dbapi/bintree.py", line 530, in populate
    wantnewlockfile=1)
  File "/usr/lib64/python3.5/site-packages/portage/locks.py", line 282, in lockfile
    waiting_msg=waiting_msg, flags=flags)
  File "/usr/lib64/python3.5/site-packages/portage/locks.py", line 152, in lockfile
    myfd = os.open(lockfilename, os.O_CREAT|os.O_RDWR, 0o660)
  File "/usr/lib64/python3.5/site-packages/portage/__init__.py", line 250, in __call__
    rval = self._func(*wrapped_args, **wrapped_kwargs)
FileNotFoundError: [Errno 2] No such file or directory: b'/usr/portage/packages/amd64/.Packages.portage_lockfile'


I guess it needs the lock file to have the same name to prevent like multiple downloads of same file, etc. Though on removal caused others to fail. Seems after waiting for lock, then being able to obtain lock file. Another system removed the lock file, causing the above error on a few other systems.

If the lock file is system specific vs portage/repository specific. Ideally it would have a name more unique to the system. Otherwise maybe the lock file needs to have a shared lock. So if other references are still open to the file, it is not removed when another tries to remove it.

Not sure exactly on resolution here. Ideally one could run simultaneous merges using a shared portage without such error. If I wait till it starts actually emerging packages before I start the next. I do not run into such. It seems to only occur when more than one emerge process is doing dependency calculation or something. Its way before any downloads, etc.
Comment 1 Mike Gilbert gentoo-dev 2017-11-07 16:26:58 UTC
> myfd = os.open(lockfilename, os.O_CREAT|os.O_RDWR, 0o660)

This is telling the system to open or create a file; it seems strange that this call could fail with "no such file or directory".

Some possibilities:

1. The parent directory got removed somehow. Not sure how that would happen.

2. Maybe NFS has some racy behavior when you call open(..., O_CREAT).
Comment 2 William L. Thomson Jr. 2017-11-08 01:26:43 UTC
(In reply to Mike Gilbert from comment #1)
> > myfd = os.open(lockfilename, os.O_CREAT|os.O_RDWR, 0o660)
> 
> This is telling the system to open or create a file; it seems strange that
> this call could fail with "no such file or directory".
> 
> Some possibilities:
> 
> 1. The parent directory got removed somehow. Not sure how that would happen.

Pretty sure it was just the file. I can see about replicating and confirming file is gone. If parent directory was missing I would have major problems, missing binaries.

> 2. Maybe NFS has some racy behavior when you call open(..., O_CREAT).

I have seen some other oddities which I think are more nfs + python than portage specific. Just triggered by portage.
https://bugs.gentoo.org/499528

Though it could be as simple as missing os.O_SHLOCK
https://docs.python.org/2/library/os.html#os.O_SHLOCK
Comment 3 William L. Thomson Jr. 2017-11-13 02:36:58 UTC
Just ran into this again on the same system. I was updating some packages, went to install another and in between packages I was updating, it took out the single emerge process. In the middle of an emerge...


Saving to: ‘/usr/portage/distfiles/access-modifier-1.12.tar.gz’

/usr/portage/distfi     [ <=>                ]  14.78K  --.-KB/s    in 0.04s

2017-11-12 21:33:10 (364 KB/s) - ‘/usr/portage/distfiles/access-modifier-1.12.tar.gz’ saved [15139]

>>> Creating Manifest for /usr/portage/local/os-xtoo/dev-java/access-modifier-annotation

 * IMPORTANT: 13 news items need reading for repository 'gentoo'.
 * Use eselect news read to view new items

>>> Verifying ebuild manifests
>>> Emerging (1 of 1) dev-java/access-modifier-annotation-1.12::os-xtoo
>>> Jobs: 0 of 1 complete, 1 running                Load avg: 2.29, 1.11, 0.70
Traceback (most recent call last):
  File "/usr/lib/python-exec/python3.5/emerge", line 50, in <module>
    retval = emerge_main()
  File "/usr/lib64/python3.5/site-packages/_emerge/main.py", line 1250, in emerge_main
    return run_action(emerge_config)
  File "/usr/lib64/python3.5/site-packages/_emerge/actions.py", line 3297, in run_action
    retval = action_build(emerge_config, spinner=spinner)
  File "/usr/lib64/python3.5/site-packages/_emerge/actions.py", line 540, in action_build
    retval = mergetask.merge()
  File "/usr/lib64/python3.5/site-packages/_emerge/Scheduler.py", line 1039, in merge
    rval = self._merge()
  File "/usr/lib64/python3.5/site-packages/_emerge/Scheduler.py", line 1444, in _merge
    self._main_loop()
  File "/usr/lib64/python3.5/site-packages/_emerge/Scheduler.py", line 1422, in _main_loop
    self._event_loop.iteration()
  File "/usr/lib64/python3.5/site-packages/portage/util/_eventloop/EventLoop.py", line 333, in iteration
    if not x.callback(f, event, *x.args):
  File "/usr/lib64/python3.5/site-packages/portage/util/_async/PipeLogger.py", line 92, in _output_handler
    self.wait()
  File "/usr/lib64/python3.5/site-packages/_emerge/AsynchronousTask.py", line 57, in wait
    self._wait_hook()
  File "/usr/lib64/python3.5/site-packages/_emerge/AsynchronousTask.py", line 175, in _wait_hook
    self._exit_listener_stack.pop()(self)
  File "/usr/lib64/python3.5/site-packages/_emerge/SpawnProcess.py", line 173, in _pipe_logger_exit
    self._async_waitpid()
  File "/usr/lib64/python3.5/site-packages/_emerge/SubProcess.py", line 113, in _async_waitpid
    self.pid, self._async_waitpid_cb)
  File "/usr/lib64/python3.5/site-packages/portage/util/_eventloop/EventLoop.py", line 411, in child_watch_add
    self._poll_child_processes()
  File "/usr/lib64/python3.5/site-packages/portage/util/_eventloop/EventLoop.py", line 455, in _poll_child_processes
    x.callback(x.pid, wait_retval[1], x.data)
  File "/usr/lib64/python3.5/site-packages/_emerge/SubProcess.py", line 119, in _async_waitpid_cb
    self.wait()
  File "/usr/lib64/python3.5/site-packages/_emerge/AsynchronousTask.py", line 57, in wait
    self._wait_hook()
  File "/usr/lib64/python3.5/site-packages/_emerge/AsynchronousTask.py", line 175, in _wait_hook
    self._exit_listener_stack.pop()(self)
  File "/usr/lib64/python3.5/site-packages/_emerge/EbuildPhase.py", line 270, in _ebuild_exit
    self.wait()
  File "/usr/lib64/python3.5/site-packages/_emerge/AsynchronousTask.py", line 57, in wait
    self._wait_hook()
  File "/usr/lib64/python3.5/site-packages/_emerge/AsynchronousTask.py", line 175, in _wait_hook
    self._exit_listener_stack.pop()(self)
  File "/usr/lib64/python3.5/site-packages/_emerge/EbuildBinpkg.py", line 46, in _package_phase_exit
    filename=self._binpkg_tmpfile)
  File "/usr/lib64/python3.5/site-packages/portage/dbapi/bintree.py", line 1090, in inject
    wantnewlockfile=1)
  File "/usr/lib64/python3.5/site-packages/portage/locks.py", line 282, in lockfile
    waiting_msg=waiting_msg, flags=flags)
  File "/usr/lib64/python3.5/site-packages/portage/locks.py", line 152, in lockfile
    myfd = os.open(lockfilename, os.O_CREAT|os.O_RDWR, 0o660)
  File "/usr/lib64/python3.5/site-packages/portage/__init__.py", line 250, in __call__
    rval = self._func(*wrapped_args, **wrapped_kwargs)
FileNotFoundError: [Errno 2] No such file or directory: b'/usr/portage/packages/amd64/.Packages.portage_lockfile'
Comment 4 William L. Thomson Jr. 2017-11-15 21:16:56 UTC
I think this is a result of a change to a recent version of portage. I am not experience this in production. I really do not want to experience this on stable systems. I have a few in my dev environment. But even there its time consuming to not be able to update systems at the same time. They have to be updated one by one as the first will take out all others.

This will be very frustrating if it makes it to stable as I have many servers that update from binaries made. That will really be time consuming to have to update each one by one.

Like I cannot use anisble to run updates pulling in binaries on more than one system at a time. That is not really ideal... If I can help resolve in any way let me know. I am not very familiar with python. Happy to test anything.
Comment 5 Zac Medico gentoo-dev 2017-11-15 21:36:38 UTC
There haven't been any recent changes to either the locking code or the related binary package code that calls it.

It looks like the filesystem is misbehaving somehow. What client kernel version are you experiencing this with? Maybe the server kernel version also matters.

We could mitigate the issue by skipping the lock in cases where there's nothing to update in the $PKGDIR/Packages file.
Comment 6 Zac Medico gentoo-dev 2017-11-15 21:50:54 UTC
For the mitigation, we could split out a separate method for everything prior to the self._pkgindex_write(pkgindex) call inside the binarytree._populate method. The new method would be called once without a lock, and if it detects anything to update in $PKGDIR/Packages then it would be called again with the lock acquired.
Comment 7 William L. Thomson Jr. 2017-11-18 01:56:54 UTC
Server and all clients are 4.13.11-gentoo. Went through and updated all them the other day. I guess that could explain change in behavior. My work flow has been pretty much the same for the last year or so. I work with binaries and multiple simultaneous merges etc.

Maybe a python thing? I did switch to 3.5 from 3.4. I was trying to get portage on pypy but did nothing special other than adding that to PYTHON_TARGETS.
Comment 8 Zac Medico gentoo-dev 2017-11-21 01:41:18 UTC
Avoid lock when possible (commit message references the wrong bug):

https://gitweb.gentoo.org/proj/portage.git/commit/?id=8267445cf2f8697f12f1424ecdb4f495dc19f27f
Comment 9 William L. Thomson Jr. 2017-11-21 03:48:11 UTC
Thank you! I will see about testing that out ASAP.
Comment 10 Zac Medico gentoo-dev 2017-11-21 05:19:55 UTC
The fix is released in portage-2.3.15. I've noticed that it continuously re-writes the Packages file if the MD5 digest is missing for any of the packages, but that can be corrected by running `emaint --fix binhost`.
Comment 11 William L. Thomson Jr. 2018-01-23 17:54:17 UTC
Presently on 2.3.19 need to update to 2.3.20. Still running into this. Though I finally have a way to replicate and test for fix. Merge two packages on the same or different systems, both making binaries, with portage on NFS mount. That seems to reliably replicate the issue. When both go to generate the tarball of the just compiled package, they seem to collide on lock file and one removes it and causes the other to fail. Same output as first report.

If you can put portage on NFS mount, and then do 2 binary builds on different systems, may work on the same system not sure. It maybe tricky to get both to generate the tarball at the same time.
Comment 12 William L. Thomson Jr. 2018-01-26 01:32:46 UTC
Looks like it happens when one system is creating a bin package, install phase I think. Then on another any merge involving binaries. That seems to be another way to replicate. Easier to time, just start a binary merge during install of a from source binary build.

Not sure NFS even matters, but havne't tried it on the same system.
Comment 13 Zac Medico gentoo-dev 2019-02-05 01:17:32 UTC
(In reply to William L. Thomson Jr. from comment #0)
> I have portage main repository on a NFS share. I went to update a few
>     myfd = os.open(lockfilename, os.O_CREAT|os.O_RDWR, 0o660)
>   File "/usr/lib64/python3.5/site-packages/portage/__init__.py", line 250,
> in __call__
>     rval = self._func(*wrapped_args, **wrapped_kwargs)
> FileNotFoundError: [Errno 2] No such file or directory:
> b'/usr/portage/packages/amd64/.Packages.portage_lockfile'

The FileNotFoundError is strange because we're using the O_CREAT flag which is supposed to create the file automatically. This may be an nfs quirk or bug.

I'm able to reproduce this using a test case with 2 processes competing for the same lock.
Comment 14 Zac Medico gentoo-dev 2019-02-05 01:29:46 UTC
Created attachment 563766 [details]
test case

There are 2 failure modes, one is the FileNotFoundError and the other is the AssertionError which is raised when locked contend is overwritten.
Comment 15 Zac Medico gentoo-dev 2019-02-05 20:24:50 UTC
(In reply to Zac Medico from comment #14)
> Created attachment 563766 [details]
> test case
> 
> There are 2 failure modes, one is the FileNotFoundError and the other is the
> AssertionError which is raised when locked contend is overwritten.

If I add retry to the O_CREAT open call, it fixes the FileNotFoundError, but the AssertionError case is still triggered.

On the same nfs share, hardlink_lockfile/unhardlink_lockfile works reliably with the same test case.
Comment 16 Zac Medico gentoo-dev 2019-02-05 21:30:31 UTC
The test case works reliably for me on nfs if I replace portage's lockfile/unlockfile implementation with simplified implementations like these which never remove the lock file:

> @contextmanager
> def flock(lock_path):
> 	with open(lock_path + '.lockfile', 'wb+') as f:
> 		try:
> 			fcntl.flock(f, fcntl.LOCK_EX)
> 			yield
> 		finally:
> 			pass
> 
> 
> @contextmanager
> def lockf(lock_path):
> 	with open(lock_path + '.lockfile', 'wb+') as f:
> 		try:
> 			fcntl.lockf(f, fcntl.LOCK_EX)
> 			yield
> 		finally:
> 			pass

So, there must be some flaw in the handling of removed lock files.
Comment 17 Zac Medico gentoo-dev 2019-02-06 00:50:49 UTC
I've found that there's a race involving removal of lock files. After a lock is acquired, there is a window of time were it's not possible to know whether the file has actually been removed. Due to the ac mount option, comparison of stat/fstat results is not reliable during this time window.

Meanwhile, the "hardlink" locking approach appears to be completely reliable.
Comment 19 Larry the Git Cow gentoo-dev 2019-02-11 19:42:18 UTC
The bug has been referenced in the following commit(s):

https://gitweb.gentoo.org/repo/gentoo.git/commit/?id=d57fade44620fdd58e4e75d77136887a617ce674

commit d57fade44620fdd58e4e75d77136887a617ce674
Author:     Zac Medico <zachary.medico@sony.com>
AuthorDate: 2019-02-11 19:35:37 +0000
Commit:     Zac Medico <zmedico@gentoo.org>
CommitDate: 2019-02-11 19:42:07 +0000

    sys-apps/portage: version bump to 2.3.60
    
     #636798 handle lock file removal on NFS
    
    Bug: https://bugs.gentoo.org/636798
    Bug: https://bugs.gentoo.org/671498
    Package-Manager: Portage-2.3.60, Repoman-2.3.12
    Copyright: Sony Interactive Entertainment Inc.
    Signed-off-by: Zac Medico <zmedico@gentoo.org>

 sys-apps/portage/Manifest              |   1 +
 sys-apps/portage/portage-2.3.60.ebuild | 271 +++++++++++++++++++++++++++++++++
 2 files changed, 272 insertions(+)
Comment 20 Larry the Git Cow gentoo-dev 2019-02-11 19:46:43 UTC
The bug has been referenced in the following commit(s):

https://gitweb.gentoo.org/proj/portage.git/commit/?id=16d3d31dbc971ce95d80d6ec0645d6ee92b6baa2

commit 16d3d31dbc971ce95d80d6ec0645d6ee92b6baa2
Author:     Zac Medico <zachary.medico@sony.com>
AuthorDate: 2019-02-06 22:21:47 +0000
Commit:     Zac Medico <zmedico@gentoo.org>
CommitDate: 2019-02-07 19:00:08 +0000

    locks: handle lock file removal on NFS (bug 636798)
    
    Handle cases where a lock file on NFS has been removed by a concurrent
    process that held the lock earlier. Since stat is not reliable for
    removed files on NFS with the default file attribute cache behavior
    ('ac' mount option), create a temporary hardlink in order to prove
    that the file path exists on the NFS server.
    
    Bug: https://bugs.gentoo.org/636798
    Copyright: Sony Interactive Entertainment Inc.
    Signed-off-by: Zac Medico <zmedico@gentoo.org>

 lib/portage/locks.py | 78 +++++++++++++++++++++++++++++++++++++++++++---------
 1 file changed, 65 insertions(+), 13 deletions(-)
Comment 21 Zac Medico gentoo-dev 2019-04-10 04:36:02 UTC
Fixed in portage-2.3.62.