I have portage main repository on a NFS share. I went to update a few systems basically at the same time. I ran into the following error. * waiting for lock on /usr/portage/packages/amd64/.Packages.portage_lockfile ... [ ok ] Traceback (most recent call last): File "/usr/lib/python-exec/python3.5/emerge", line 50, in <module> retval = emerge_main() File "/usr/lib64/python3.5/site-packages/_emerge/main.py", line 1250, in emerge_main return run_action(emerge_config) File "/usr/lib64/python3.5/site-packages/_emerge/actions.py", line 2930, in run_action getbinpkgs="--getbinpkg" in emerge_config.opts) File "/usr/lib64/python3.5/site-packages/portage/dbapi/bintree.py", line 530, in populate wantnewlockfile=1) File "/usr/lib64/python3.5/site-packages/portage/locks.py", line 282, in lockfile waiting_msg=waiting_msg, flags=flags) File "/usr/lib64/python3.5/site-packages/portage/locks.py", line 152, in lockfile myfd = os.open(lockfilename, os.O_CREAT|os.O_RDWR, 0o660) File "/usr/lib64/python3.5/site-packages/portage/__init__.py", line 250, in __call__ rval = self._func(*wrapped_args, **wrapped_kwargs) FileNotFoundError: [Errno 2] No such file or directory: b'/usr/portage/packages/amd64/.Packages.portage_lockfile' I guess it needs the lock file to have the same name to prevent like multiple downloads of same file, etc. Though on removal caused others to fail. Seems after waiting for lock, then being able to obtain lock file. Another system removed the lock file, causing the above error on a few other systems. If the lock file is system specific vs portage/repository specific. Ideally it would have a name more unique to the system. Otherwise maybe the lock file needs to have a shared lock. So if other references are still open to the file, it is not removed when another tries to remove it. Not sure exactly on resolution here. Ideally one could run simultaneous merges using a shared portage without such error. If I wait till it starts actually emerging packages before I start the next. I do not run into such. It seems to only occur when more than one emerge process is doing dependency calculation or something. Its way before any downloads, etc.
> myfd = os.open(lockfilename, os.O_CREAT|os.O_RDWR, 0o660) This is telling the system to open or create a file; it seems strange that this call could fail with "no such file or directory". Some possibilities: 1. The parent directory got removed somehow. Not sure how that would happen. 2. Maybe NFS has some racy behavior when you call open(..., O_CREAT).
(In reply to Mike Gilbert from comment #1) > > myfd = os.open(lockfilename, os.O_CREAT|os.O_RDWR, 0o660) > > This is telling the system to open or create a file; it seems strange that > this call could fail with "no such file or directory". > > Some possibilities: > > 1. The parent directory got removed somehow. Not sure how that would happen. Pretty sure it was just the file. I can see about replicating and confirming file is gone. If parent directory was missing I would have major problems, missing binaries. > 2. Maybe NFS has some racy behavior when you call open(..., O_CREAT). I have seen some other oddities which I think are more nfs + python than portage specific. Just triggered by portage. https://bugs.gentoo.org/499528 Though it could be as simple as missing os.O_SHLOCK https://docs.python.org/2/library/os.html#os.O_SHLOCK
Just ran into this again on the same system. I was updating some packages, went to install another and in between packages I was updating, it took out the single emerge process. In the middle of an emerge... Saving to: ‘/usr/portage/distfiles/access-modifier-1.12.tar.gz’ /usr/portage/distfi [ <=> ] 14.78K --.-KB/s in 0.04s 2017-11-12 21:33:10 (364 KB/s) - ‘/usr/portage/distfiles/access-modifier-1.12.tar.gz’ saved [15139] >>> Creating Manifest for /usr/portage/local/os-xtoo/dev-java/access-modifier-annotation * IMPORTANT: 13 news items need reading for repository 'gentoo'. * Use eselect news read to view new items >>> Verifying ebuild manifests >>> Emerging (1 of 1) dev-java/access-modifier-annotation-1.12::os-xtoo >>> Jobs: 0 of 1 complete, 1 running Load avg: 2.29, 1.11, 0.70 Traceback (most recent call last): File "/usr/lib/python-exec/python3.5/emerge", line 50, in <module> retval = emerge_main() File "/usr/lib64/python3.5/site-packages/_emerge/main.py", line 1250, in emerge_main return run_action(emerge_config) File "/usr/lib64/python3.5/site-packages/_emerge/actions.py", line 3297, in run_action retval = action_build(emerge_config, spinner=spinner) File "/usr/lib64/python3.5/site-packages/_emerge/actions.py", line 540, in action_build retval = mergetask.merge() File "/usr/lib64/python3.5/site-packages/_emerge/Scheduler.py", line 1039, in merge rval = self._merge() File "/usr/lib64/python3.5/site-packages/_emerge/Scheduler.py", line 1444, in _merge self._main_loop() File "/usr/lib64/python3.5/site-packages/_emerge/Scheduler.py", line 1422, in _main_loop self._event_loop.iteration() File "/usr/lib64/python3.5/site-packages/portage/util/_eventloop/EventLoop.py", line 333, in iteration if not x.callback(f, event, *x.args): File "/usr/lib64/python3.5/site-packages/portage/util/_async/PipeLogger.py", line 92, in _output_handler self.wait() File "/usr/lib64/python3.5/site-packages/_emerge/AsynchronousTask.py", line 57, in wait self._wait_hook() File "/usr/lib64/python3.5/site-packages/_emerge/AsynchronousTask.py", line 175, in _wait_hook self._exit_listener_stack.pop()(self) File "/usr/lib64/python3.5/site-packages/_emerge/SpawnProcess.py", line 173, in _pipe_logger_exit self._async_waitpid() File "/usr/lib64/python3.5/site-packages/_emerge/SubProcess.py", line 113, in _async_waitpid self.pid, self._async_waitpid_cb) File "/usr/lib64/python3.5/site-packages/portage/util/_eventloop/EventLoop.py", line 411, in child_watch_add self._poll_child_processes() File "/usr/lib64/python3.5/site-packages/portage/util/_eventloop/EventLoop.py", line 455, in _poll_child_processes x.callback(x.pid, wait_retval[1], x.data) File "/usr/lib64/python3.5/site-packages/_emerge/SubProcess.py", line 119, in _async_waitpid_cb self.wait() File "/usr/lib64/python3.5/site-packages/_emerge/AsynchronousTask.py", line 57, in wait self._wait_hook() File "/usr/lib64/python3.5/site-packages/_emerge/AsynchronousTask.py", line 175, in _wait_hook self._exit_listener_stack.pop()(self) File "/usr/lib64/python3.5/site-packages/_emerge/EbuildPhase.py", line 270, in _ebuild_exit self.wait() File "/usr/lib64/python3.5/site-packages/_emerge/AsynchronousTask.py", line 57, in wait self._wait_hook() File "/usr/lib64/python3.5/site-packages/_emerge/AsynchronousTask.py", line 175, in _wait_hook self._exit_listener_stack.pop()(self) File "/usr/lib64/python3.5/site-packages/_emerge/EbuildBinpkg.py", line 46, in _package_phase_exit filename=self._binpkg_tmpfile) File "/usr/lib64/python3.5/site-packages/portage/dbapi/bintree.py", line 1090, in inject wantnewlockfile=1) File "/usr/lib64/python3.5/site-packages/portage/locks.py", line 282, in lockfile waiting_msg=waiting_msg, flags=flags) File "/usr/lib64/python3.5/site-packages/portage/locks.py", line 152, in lockfile myfd = os.open(lockfilename, os.O_CREAT|os.O_RDWR, 0o660) File "/usr/lib64/python3.5/site-packages/portage/__init__.py", line 250, in __call__ rval = self._func(*wrapped_args, **wrapped_kwargs) FileNotFoundError: [Errno 2] No such file or directory: b'/usr/portage/packages/amd64/.Packages.portage_lockfile'
I think this is a result of a change to a recent version of portage. I am not experience this in production. I really do not want to experience this on stable systems. I have a few in my dev environment. But even there its time consuming to not be able to update systems at the same time. They have to be updated one by one as the first will take out all others. This will be very frustrating if it makes it to stable as I have many servers that update from binaries made. That will really be time consuming to have to update each one by one. Like I cannot use anisble to run updates pulling in binaries on more than one system at a time. That is not really ideal... If I can help resolve in any way let me know. I am not very familiar with python. Happy to test anything.
There haven't been any recent changes to either the locking code or the related binary package code that calls it. It looks like the filesystem is misbehaving somehow. What client kernel version are you experiencing this with? Maybe the server kernel version also matters. We could mitigate the issue by skipping the lock in cases where there's nothing to update in the $PKGDIR/Packages file.
For the mitigation, we could split out a separate method for everything prior to the self._pkgindex_write(pkgindex) call inside the binarytree._populate method. The new method would be called once without a lock, and if it detects anything to update in $PKGDIR/Packages then it would be called again with the lock acquired.
Server and all clients are 4.13.11-gentoo. Went through and updated all them the other day. I guess that could explain change in behavior. My work flow has been pretty much the same for the last year or so. I work with binaries and multiple simultaneous merges etc. Maybe a python thing? I did switch to 3.5 from 3.4. I was trying to get portage on pypy but did nothing special other than adding that to PYTHON_TARGETS.
Avoid lock when possible (commit message references the wrong bug): https://gitweb.gentoo.org/proj/portage.git/commit/?id=8267445cf2f8697f12f1424ecdb4f495dc19f27f
Thank you! I will see about testing that out ASAP.
The fix is released in portage-2.3.15. I've noticed that it continuously re-writes the Packages file if the MD5 digest is missing for any of the packages, but that can be corrected by running `emaint --fix binhost`.
Presently on 2.3.19 need to update to 2.3.20. Still running into this. Though I finally have a way to replicate and test for fix. Merge two packages on the same or different systems, both making binaries, with portage on NFS mount. That seems to reliably replicate the issue. When both go to generate the tarball of the just compiled package, they seem to collide on lock file and one removes it and causes the other to fail. Same output as first report. If you can put portage on NFS mount, and then do 2 binary builds on different systems, may work on the same system not sure. It maybe tricky to get both to generate the tarball at the same time.
Looks like it happens when one system is creating a bin package, install phase I think. Then on another any merge involving binaries. That seems to be another way to replicate. Easier to time, just start a binary merge during install of a from source binary build. Not sure NFS even matters, but havne't tried it on the same system.
(In reply to William L. Thomson Jr. from comment #0) > I have portage main repository on a NFS share. I went to update a few > myfd = os.open(lockfilename, os.O_CREAT|os.O_RDWR, 0o660) > File "/usr/lib64/python3.5/site-packages/portage/__init__.py", line 250, > in __call__ > rval = self._func(*wrapped_args, **wrapped_kwargs) > FileNotFoundError: [Errno 2] No such file or directory: > b'/usr/portage/packages/amd64/.Packages.portage_lockfile' The FileNotFoundError is strange because we're using the O_CREAT flag which is supposed to create the file automatically. This may be an nfs quirk or bug. I'm able to reproduce this using a test case with 2 processes competing for the same lock.
Created attachment 563766 [details] test case There are 2 failure modes, one is the FileNotFoundError and the other is the AssertionError which is raised when locked contend is overwritten.
(In reply to Zac Medico from comment #14) > Created attachment 563766 [details] > test case > > There are 2 failure modes, one is the FileNotFoundError and the other is the > AssertionError which is raised when locked contend is overwritten. If I add retry to the O_CREAT open call, it fixes the FileNotFoundError, but the AssertionError case is still triggered. On the same nfs share, hardlink_lockfile/unhardlink_lockfile works reliably with the same test case.
The test case works reliably for me on nfs if I replace portage's lockfile/unlockfile implementation with simplified implementations like these which never remove the lock file: > @contextmanager > def flock(lock_path): > with open(lock_path + '.lockfile', 'wb+') as f: > try: > fcntl.flock(f, fcntl.LOCK_EX) > yield > finally: > pass > > > @contextmanager > def lockf(lock_path): > with open(lock_path + '.lockfile', 'wb+') as f: > try: > fcntl.lockf(f, fcntl.LOCK_EX) > yield > finally: > pass So, there must be some flaw in the handling of removed lock files.
I've found that there's a race involving removal of lock files. After a lock is acquired, there is a window of time were it's not possible to know whether the file has actually been removed. Due to the ac mount option, comparison of stat/fstat results is not reliable during this time window. Meanwhile, the "hardlink" locking approach appears to be completely reliable.
Patch posted for review: https://archives.gentoo.org/gentoo-portage-dev/message/31ee552214d34c3832c0ef62526c658d https://github.com/gentoo/portage/pull/408
The bug has been referenced in the following commit(s): https://gitweb.gentoo.org/repo/gentoo.git/commit/?id=d57fade44620fdd58e4e75d77136887a617ce674 commit d57fade44620fdd58e4e75d77136887a617ce674 Author: Zac Medico <zachary.medico@sony.com> AuthorDate: 2019-02-11 19:35:37 +0000 Commit: Zac Medico <zmedico@gentoo.org> CommitDate: 2019-02-11 19:42:07 +0000 sys-apps/portage: version bump to 2.3.60 #636798 handle lock file removal on NFS Bug: https://bugs.gentoo.org/636798 Bug: https://bugs.gentoo.org/671498 Package-Manager: Portage-2.3.60, Repoman-2.3.12 Copyright: Sony Interactive Entertainment Inc. Signed-off-by: Zac Medico <zmedico@gentoo.org> sys-apps/portage/Manifest | 1 + sys-apps/portage/portage-2.3.60.ebuild | 271 +++++++++++++++++++++++++++++++++ 2 files changed, 272 insertions(+)
The bug has been referenced in the following commit(s): https://gitweb.gentoo.org/proj/portage.git/commit/?id=16d3d31dbc971ce95d80d6ec0645d6ee92b6baa2 commit 16d3d31dbc971ce95d80d6ec0645d6ee92b6baa2 Author: Zac Medico <zachary.medico@sony.com> AuthorDate: 2019-02-06 22:21:47 +0000 Commit: Zac Medico <zmedico@gentoo.org> CommitDate: 2019-02-07 19:00:08 +0000 locks: handle lock file removal on NFS (bug 636798) Handle cases where a lock file on NFS has been removed by a concurrent process that held the lock earlier. Since stat is not reliable for removed files on NFS with the default file attribute cache behavior ('ac' mount option), create a temporary hardlink in order to prove that the file path exists on the NFS server. Bug: https://bugs.gentoo.org/636798 Copyright: Sony Interactive Entertainment Inc. Signed-off-by: Zac Medico <zmedico@gentoo.org> lib/portage/locks.py | 78 +++++++++++++++++++++++++++++++++++++++++++--------- 1 file changed, 65 insertions(+), 13 deletions(-)
Fixed in portage-2.3.62.