Whenever an catalyst run is started using `catalyst -f stage-foo.conf`, the snapcache is "refreshed" by unpacking it from the portage snapshot tarball and then bind mounting the snapcache inside the chroot before begining emerge. During this period, a lock file is created: /var/tmp/catalyst/snapshot_cache/.catalyst_lock That code can be found in modules/generic_stage_target.py where we have if unpack: if "SNAPCACHE" in self.settings: self.snapshot_lock_object.write_lock() .... if "SNAPCACHE" in self.settings: self.snapshot_lock_object.unlock() If a second instance of catalyst is started before the "refreshing" is done, then an exception is raised on the locked file. This seems like a bad design since we'd like to 1) do the bind mount rather than unpacking the portage snapshot tarball for multiple instances of catalyst running together, 2) not have to refresh the snapcache each time if its not needed. A better design might be to 1) add the creation of the snapcache to `catalyst -s current` and 2) supply a flag which turns off the refreshing of the snapcache with each run of catalyst. I can supply a patch, but what are your thoughts before I do this?
This all seems correct to me. You can run two catalyst instances at once no problem, you just can't have the second on start while the first one is unpacking the snapshot. Once the snapshot is unpacked this works fine. I don't think this is a bug, the lock prevents you from trying to unpack the snapcache while another catalyst is already doing it. I see no bug here.
(In reply to Rick Farina (Zero_Chaos) from comment #1) > This all seems correct to me. > > You can run two catalyst instances at once no problem, you just can't have > the second on start while the first one is unpacking the snapshot. Once the > snapshot is unpacked this works fine. I don't think this is a bug, the lock > prevents you from trying to unpack the snapcache while another catalyst is > already doing it. > > I see no bug here. That's not what I'm saying. Of course you can't have the second start while the first is unpacking. My point is that the snapshot unpacking should be separated from the stage build. With snapshot_cache on, currently have: catalyst -s current # creates $storedir/snapshots catalyst -f stage1.conf # unpacks snapshot to $storedir/snapshot_cache, unpacks stage, bind mounts snapshot_cache catalyst -f stage2.conf # unpacks snapshot to $storedir/snapshot_cache, unpacks stage, bind mounts snapshot_cache catalyst -f stage2.conf # unpacks snapshot to $storedir/snapshot_cache, unpacks stage, bind mounts snapshot_cache If the snapshot hasn't changes, then "unpacks snapshot to $storedir/snapshot_cache" is redundant. I suggest: catalyst -s current # creates $storedir/snapshots catalyst -S # unpacks snapshot to $storedir/snapshot_cache catalyst -f stage1.conf # unpacks stage, bind mounts snapshot_cache catalyst -f stage2.conf # unpacks stage, bind mounts snapshot_cache catalyst -f stage2.conf # unpacks stage, bind mounts snapshot_cache This avoids redundant work and isolates the critical region during which the snapshot_cache is created.
(In reply to Anthony Basile from comment #0) > 2) not have to refresh the snapcache each time if > its not needed. > Actually, an even better design is to place a timestamp file in there and check the timestamp. If it hasn't changed, then don't unpack again. This reduces the critical region to just the time it takes to check the timestamp.
(In reply to Anthony Basile from comment #3) > (In reply to Anthony Basile from comment #0) > > 2) not have to refresh the snapcache each time if > > its not needed. > > > > Actually, an even better design is to place a timestamp file in there and > check the timestamp. If it hasn't changed, then don't unpack again. This > reduces the critical region to just the time it takes to check the timestamp. Yes, and as far as I'm aware it already does this something like this. It checks the snapcache to see if it is still valid, I don't remember how, and then it doesn't unpack. Because you started catalyst twice at nearly the same time the first was still unpacking the snapcache and the second determined it was invalid, etc. Again, I see no bug here.
(In reply to Rick Farina (Zero_Chaos) from comment #4) > (In reply to Anthony Basile from comment #3) > > (In reply to Anthony Basile from comment #0) > > > 2) not have to refresh the snapcache each time if > > > its not needed. > > > > > > > Actually, an even better design is to place a timestamp file in there and > > check the timestamp. If it hasn't changed, then don't unpack again. This > > reduces the critical region to just the time it takes to check the timestamp. > > Yes, and as far as I'm aware it already does this something like this. It > checks the snapcache to see if it is still valid, I don't remember how, and > then it doesn't unpack. Because you started catalyst twice at nearly the > same time the first was still unpacking the snapcache and the second > determined it was invalid, etc. > > Again, I see no bug here. Okay I'll look for the checking code. Still, you shouldn't crash on a busy resource, you should wait on the lock.
(In reply to Anthony Basile from comment #5) > Okay I'll look for the checking code. Still, you shouldn't crash on a busy > resource, you should wait on the lock. Yes and no. What if the first catalyst crashes while the second is waiting? it will wait forever...
(In reply to Rick Farina (Zero_Chaos) from comment #6) > (In reply to Anthony Basile from comment #5) > > Okay I'll look for the checking code. Still, you shouldn't crash on a busy > > resource, you should wait on the lock. > > Yes and no. What if the first catalyst crashes while the second is waiting? > it will wait forever... Similarly when you run two emerge's together and one locks the other. These sort of situations can be dealt with.
(In reply to Rick Farina (Zero_Chaos) from comment #6) > Yes and no. What if the first catalyst crashes while the second is waiting? > it will wait forever... That's just a matter of common sense: the crashing process needs to be supervised by a stable process that sets and unsets the locks so it can be cleaned up after. Why would catalyst itself crash in that case?
Created attachment 382972 [details] POC: wait and recognize if inherited state is clean Here's a little poc to demonstrate how one instance of `catalyst -f stage_B.conf` can use a lock file to wait while another instance of `catalyst -f stage_A.conf` is unpacking the snapcache. Also, if instance A crashes, instance B knows that it is inheriting an unclean state. Start up ./test_lock_crash.py in one shell, then quickly start it up in another, crash the first and see what happens. Try all combinations of starting one while the other is still going and crashing and not crashing. You can even just start up one instance, crash it, and start it up again. It knows that its inheriting an unclean state from the previous time it ran.
(In reply to Anthony Basile from comment #9) > Created attachment 382972 [details] > POC: wait and recognize if inherited state is clean > > Here's a little poc to demonstrate how one instance of `catalyst -f > stage_B.conf` can use a lock file to wait while another instance of > `catalyst -f stage_A.conf` is unpacking the snapcache. Also, if instance A > crashes, instance B knows that it is inheriting an unclean state. > > Start up ./test_lock_crash.py in one shell, then quickly start it up in > another, crash the first and see what happens. Try all combinations of > starting one while the other is still going and crashing and not crashing. > You can even just start up one instance, crash it, and start it up again. > It knows that its inheriting an unclean state from the previous time it ran. I'm certainly no expert in locking, and as you clearly demonstrate, you can do this properly. Great, make a patch for catalyst to do it and have at least one team member (or one of our useful users) who can read this level of python ack it. I'm all for the idea if it's done right.
Anthony, I think the unpack phase only lock would be good to implement. I also think it would be prudent to add a timeout setting in the spec file that can be set by users to adjust for their machine's speed. That way it won't be stuck forever and won't involve other monitoring processes. If a run sees the lock in place, it waits to the maximum of the timeout (& exits) or the lock is cleared (& operation resumes). The whole lock code needs some thorough scrutiny and cleanup. If you wish to take that on, we would all be grateful.
(In reply to Brian Dolbec from comment #11) > Anthony, I think the unpack phase only lock would be good to implement. I > also think it would be prudent to add a timeout setting in the spec file > that can be set by users to adjust for their machine's speed. That way it > won't be stuck forever and won't involve other monitoring processes. If a > run sees the lock in place, it waits to the maximum of the timeout (& exits) > or the lock is cleared (& operation resumes). > > The whole lock code needs some thorough scrutiny and cleanup. If you wish > to take that on, we would all be grateful. This sounds like at least two commits: 1) the locking code based on the above. 2) A configurable timeout. hard_lock() in class LockDir has that, but no where is it actually used. You'd need something similar in fcntl_lock().
Created attachment 383972 [details] POC: timeout if you wait on a lock too long Okay the values are hard coded (this is just a POC) but now the second instance will only wait 5 seconds before giving up. The next step is to see how to insert all this in the catalyst code.
I'll take this on, as my ARM stages use a shared snapcache to save on disk space. Glad to find the bug, as we'd had discussions in IRC -releng about moving this forward (specifically myself and Zero_Chaos). Ack to iamben for bringing to my attention! :D
See https://wiki.gentoo.org/wiki/User:Veremit/Catalyst_upgrades#Fix_snapshot.2Fsnapcache_locking also
The bug has been referenced in the following commit(s): https://gitweb.gentoo.org/repo/gentoo.git/commit/?id=6ef7061d6e1f7ba02cdacb98636a9abbfeed26f0 commit 6ef7061d6e1f7ba02cdacb98636a9abbfeed26f0 Author: Matt Turner <mattst88@gentoo.org> AuthorDate: 2021-06-11 04:15:15 +0000 Commit: Matt Turner <mattst88@gentoo.org> CommitDate: 2021-06-11 04:21:40 +0000 dev-util/catalyst: Version bump to 3.0.19 Bug: https://bugs.gentoo.org/519656 Closes: https://bugs.gentoo.org/791583 Signed-off-by: Matt Turner <mattst88@gentoo.org> dev-util/catalyst/Manifest | 1 + dev-util/catalyst/catalyst-3.0.19.ebuild | 71 ++++++++++++++++++++++++++++++++ 2 files changed, 72 insertions(+)
https://gitweb.gentoo.org/proj/catalyst.git/commit/?h=catalyst-3.0-stable&id=9221e327aecfd8585cbb9796add9224123d1c53a > For the catalyst-3.0-stable branch, just remove all the snapcache locking. It's broken (bug #519656) and no one is going to fix it, since snapcache has been removed from the master branch.