519656 – dev-util/catalyst: stage build fails if two instances of catalyst are run while using snapcache

Bug 519656 - dev-util/catalyst: stage build fails if two instances of catalyst are run while using snapcache

Summary: dev-util/catalyst: stage build fails if two instances of catalyst are run whi...

Status:	RESOLVED WONTFIX

Alias:	None

Product:	Gentoo Hosted Projects
Classification:	Unclassified
Component:	Catalyst (show other bugs)
Hardware:	All Linux

Importance:	Normal normal (vote)
Assignee:	Gentoo Catalyst Developers

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2014-08-11 16:36 UTC by Anthony Basile
Modified:	2021-06-11 06:29 UTC (History)
CC List:	5 users (show)

See Also:	759685
Package list:
Runtime testing required:	---

Attachments
POC: wait and recognize if inherited state is clean (test_lock_crash.py,1.41 KB, text/x-python) 2014-08-16 13:38 UTC, Anthony Basile	Details
POC: timeout if you wait on a lock too long (test_lock_crash.py,1.59 KB, text/x-python) 2014-08-30 17:42 UTC, Anthony Basile	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Anthony Basile gentoo-dev

2014-08-11 16:36:54 UTC

Whenever an catalyst run is started using `catalyst -f stage-foo.conf`, the snapcache is "refreshed" by unpacking it from the portage snapshot tarball and then bind mounting the snapcache inside the chroot before begining emerge.  During this period, a lock file is created:

  /var/tmp/catalyst/snapshot_cache/.catalyst_lock

That code can be found in modules/generic_stage_target.py where we have

    if unpack:
        if "SNAPCACHE" in self.settings:
            self.snapshot_lock_object.write_lock()

                ....


    if "SNAPCACHE" in self.settings:
        self.snapshot_lock_object.unlock()


If a second instance of catalyst is started before the "refreshing" is done, then an exception is raised on the locked file.

This seems like a bad design since we'd like to 1) do the bind mount rather than unpacking the portage snapshot tarball for multiple instances of catalyst running together, 2) not have to refresh the snapcache each time if its not needed.

A better design might be to 1) add the creation of the snapcache to `catalyst -s current` and 2) supply a flag which turns off the refreshing of the snapcache with each run of catalyst.

I can supply a patch, but what are your thoughts before I do this?

Comment 1 Rick Farina (Zero_Chaos) gentoo-dev

2014-08-14 22:19:37 UTC

This all seems correct to me.

You can run two catalyst instances at once no problem, you just can't have the second on start while the first one is unpacking the snapshot.  Once the snapshot is unpacked this works fine.  I don't think this is a bug, the lock prevents you from trying to unpack the snapcache while another catalyst is already doing it.

I see no bug here.

Comment 2 Anthony Basile gentoo-dev

2014-08-14 22:46:36 UTC

(In reply to Rick Farina (Zero_Chaos) from comment #1)
> This all seems correct to me.
> 
> You can run two catalyst instances at once no problem, you just can't have
> the second on start while the first one is unpacking the snapshot.  Once the
> snapshot is unpacked this works fine.  I don't think this is a bug, the lock
> prevents you from trying to unpack the snapcache while another catalyst is
> already doing it.
> 
> I see no bug here.

That's not what I'm saying.  Of course you can't have the second start while the first is unpacking.  My point is that the snapshot unpacking should be separated from the stage build.

With snapshot_cache on, currently have:

catalyst -s current      # creates $storedir/snapshots
catalyst -f stage1.conf  # unpacks snapshot to $storedir/snapshot_cache, unpacks stage, bind mounts snapshot_cache
catalyst -f stage2.conf  # unpacks snapshot to $storedir/snapshot_cache, unpacks stage, bind mounts snapshot_cache
catalyst -f stage2.conf  # unpacks snapshot to $storedir/snapshot_cache, unpacks stage, bind mounts snapshot_cache

If the snapshot hasn't changes, then "unpacks snapshot to $storedir/snapshot_cache" is redundant.

I suggest:

catalyst -s current      # creates $storedir/snapshots
catalyst -S              # unpacks snapshot to $storedir/snapshot_cache
catalyst -f stage1.conf  # unpacks stage, bind mounts snapshot_cache
catalyst -f stage2.conf  # unpacks stage, bind mounts snapshot_cache
catalyst -f stage2.conf  # unpacks stage, bind mounts snapshot_cache


This avoids redundant work and isolates the critical region during which the snapshot_cache is created.

Comment 3 Anthony Basile gentoo-dev

2014-08-14 23:20:00 UTC

(In reply to Anthony Basile from comment #0) 
>  2) not have to refresh the snapcache each time if
> its not needed.
> 

Actually, an even better design is to place a timestamp file in there and check the timestamp.  If it hasn't changed, then don't unpack again.  This reduces the critical region to just the time it takes to check the timestamp.

Comment 4 Rick Farina (Zero_Chaos) gentoo-dev

2014-08-15 00:09:41 UTC

(In reply to Anthony Basile from comment #3)
> (In reply to Anthony Basile from comment #0) 
> >  2) not have to refresh the snapcache each time if
> > its not needed.
> > 
> 
> Actually, an even better design is to place a timestamp file in there and
> check the timestamp.  If it hasn't changed, then don't unpack again.  This
> reduces the critical region to just the time it takes to check the timestamp.

Yes, and as far as I'm aware it already does this something like this.  It checks the snapcache to see if it is still valid, I don't remember how, and then it doesn't unpack.  Because you started catalyst twice at nearly the same time the first was still unpacking the snapcache and the second determined it was invalid, etc.

Again, I see no bug here.

Comment 5 Anthony Basile gentoo-dev

2014-08-15 00:38:27 UTC

(In reply to Rick Farina (Zero_Chaos) from comment #4)
> (In reply to Anthony Basile from comment #3)
> > (In reply to Anthony Basile from comment #0) 
> > >  2) not have to refresh the snapcache each time if
> > > its not needed.
> > > 
> > 
> > Actually, an even better design is to place a timestamp file in there and
> > check the timestamp.  If it hasn't changed, then don't unpack again.  This
> > reduces the critical region to just the time it takes to check the timestamp.
> 
> Yes, and as far as I'm aware it already does this something like this.  It
> checks the snapcache to see if it is still valid, I don't remember how, and
> then it doesn't unpack.  Because you started catalyst twice at nearly the
> same time the first was still unpacking the snapcache and the second
> determined it was invalid, etc.
> 
> Again, I see no bug here.

Okay I'll look for the checking code.  Still, you shouldn't crash on a busy resource, you should wait on the lock.

Comment 6 Rick Farina (Zero_Chaos) gentoo-dev

2014-08-15 00:54:38 UTC

(In reply to Anthony Basile from comment #5)
> Okay I'll look for the checking code.  Still, you shouldn't crash on a busy
> resource, you should wait on the lock.

Yes and no.  What if the first catalyst crashes while the second is waiting? it will wait forever...

Comment 7 Anthony Basile gentoo-dev

2014-08-15 11:52:28 UTC

(In reply to Rick Farina (Zero_Chaos) from comment #6)
> (In reply to Anthony Basile from comment #5)
> > Okay I'll look for the checking code.  Still, you shouldn't crash on a busy
> > resource, you should wait on the lock.
> 
> Yes and no.  What if the first catalyst crashes while the second is waiting?
> it will wait forever...

Similarly when you run two emerge's together and one locks the other.  These sort of situations can be dealt with.

Comment 8 Jeroen Roovers (RETIRED) gentoo-dev

2014-08-15 12:07:52 UTC

(In reply to Rick Farina (Zero_Chaos) from comment #6)

> Yes and no.  What if the first catalyst crashes while the second is waiting?
> it will wait forever...

That's just a matter of common sense: the crashing process needs to be supervised by a stable process that sets and unsets the locks so it can be cleaned up after. Why would catalyst itself crash in that case?

Comment 9 Anthony Basile gentoo-dev

2014-08-16 13:38:11 UTC

Created attachment 382972 [details]
POC: wait and recognize if inherited state is clean

Here's a little poc to demonstrate how one instance of `catalyst -f stage_B.conf` can use a lock file to wait while another instance of `catalyst -f stage_A.conf` is unpacking the snapcache.  Also, if instance A crashes, instance B knows that it is inheriting an unclean state.

Start up ./test_lock_crash.py in one shell, then quickly start it up in another, crash the first and see what happens.  Try all combinations of starting one while the other is still going and crashing and not crashing.  You can even just start up one instance, crash it, and start it up again.  It knows that its inheriting an unclean state from the previous time it ran.

Comment 10 Rick Farina (Zero_Chaos) gentoo-dev

2014-08-18 05:17:20 UTC

(In reply to Anthony Basile from comment #9)
> Created attachment 382972 [details]
> POC: wait and recognize if inherited state is clean
> 
> Here's a little poc to demonstrate how one instance of `catalyst -f
> stage_B.conf` can use a lock file to wait while another instance of
> `catalyst -f stage_A.conf` is unpacking the snapcache.  Also, if instance A
> crashes, instance B knows that it is inheriting an unclean state.
> 
> Start up ./test_lock_crash.py in one shell, then quickly start it up in
> another, crash the first and see what happens.  Try all combinations of
> starting one while the other is still going and crashing and not crashing. 
> You can even just start up one instance, crash it, and start it up again. 
> It knows that its inheriting an unclean state from the previous time it ran.

I'm certainly no expert in locking, and as you clearly demonstrate, you can do this properly.  Great, make a patch for catalyst to do it and have at least one team member (or one of our useful users) who can read this level of python ack it.

I'm all for the idea if it's done right.

Comment 11 Brian Dolbec (RETIRED) gentoo-dev

2014-08-19 21:39:36 UTC

Anthony, I think the unpack phase only lock would be good to implement.  I also think it would be prudent to add a timeout setting in the spec file that can be set by users to adjust for their machine's speed.  That way it won't be stuck forever and won't involve other monitoring processes.  If a run sees the lock in place, it waits to the maximum of the timeout (& exits) or the lock is cleared (& operation resumes).

The whole lock code needs some thorough scrutiny and cleanup.  If you wish to take that on, we would all be grateful.

Comment 12 Anthony Basile gentoo-dev

2014-08-20 15:38:38 UTC

(In reply to Brian Dolbec from comment #11)
> Anthony, I think the unpack phase only lock would be good to implement.  I
> also think it would be prudent to add a timeout setting in the spec file
> that can be set by users to adjust for their machine's speed.  That way it
> won't be stuck forever and won't involve other monitoring processes.  If a
> run sees the lock in place, it waits to the maximum of the timeout (& exits)
> or the lock is cleared (& operation resumes).
> 
> The whole lock code needs some thorough scrutiny and cleanup.  If you wish
> to take that on, we would all be grateful.

This sounds like at least two commits: 1) the locking code based on the above.  2) A configurable timeout.  hard_lock() in class LockDir has that, but no where is it actually used.  You'd need something similar in fcntl_lock().

Comment 13 Anthony Basile gentoo-dev

2014-08-30 17:42:26 UTC

Created attachment 383972 [details]
POC: timeout if you wait on a lock too long

Okay the values are hard coded (this is just a POC) but now the second instance will only wait 5 seconds before giving up.  The next step is to see how to insert all this in the catalyst code.

Comment 14 Michael 'veremitz' Everitt 2019-02-14 21:34:57 UTC

I'll take this on, as my ARM stages use a shared snapcache to save on disk space. Glad to find the bug, as we'd had discussions in IRC -releng about moving this forward (specifically myself and Zero_Chaos).
Ack to iamben for bringing to my attention! :D

Comment 15 Michael 'veremitz' Everitt 2019-02-15 20:02:57 UTC

See https://wiki.gentoo.org/wiki/User:Veremit/Catalyst_upgrades#Fix_snapshot.2Fsnapcache_locking also

Comment 16 Larry the Git Cow gentoo-dev

2021-06-11 04:21:48 UTC

The bug has been referenced in the following commit(s):

https://gitweb.gentoo.org/repo/gentoo.git/commit/?id=6ef7061d6e1f7ba02cdacb98636a9abbfeed26f0

commit 6ef7061d6e1f7ba02cdacb98636a9abbfeed26f0
Author:     Matt Turner <mattst88@gentoo.org>
AuthorDate: 2021-06-11 04:15:15 +0000
Commit:     Matt Turner <mattst88@gentoo.org>
CommitDate: 2021-06-11 04:21:40 +0000

    dev-util/catalyst: Version bump to 3.0.19
    
    Bug: https://bugs.gentoo.org/519656
    Closes: https://bugs.gentoo.org/791583
    Signed-off-by: Matt Turner <mattst88@gentoo.org>

 dev-util/catalyst/Manifest               |  1 +
 dev-util/catalyst/catalyst-3.0.19.ebuild | 71 ++++++++++++++++++++++++++++++++
 2 files changed, 72 insertions(+)

Comment 17 Matt Turner gentoo-dev

2021-06-11 04:22:56 UTC

https://gitweb.gentoo.org/proj/catalyst.git/commit/?h=catalyst-3.0-stable&id=9221e327aecfd8585cbb9796add9224123d1c53a

> For the catalyst-3.0-stable branch, just remove all the snapcache
locking. It's broken (bug #519656) and no one is going to fix it, since
snapcache has been removed from the master branch.