Bug 118418

Summary:	[baselayout] Locking functionality - implementation attached
Product:	Gentoo Linux	Reporter:	Oldrich Jedlicka <oldium.pro>
Component:	[OLD] baselayout	Assignee:	Gentoo's Team for Core System packages <base-system>
Status:	VERIFIED NEEDINFO
Severity:	enhancement	CC:	dsd, uberlord
Priority:	High
Version:	unspecified
Hardware:	All
OS:	Linux
Whiteboard:
Package list:		Runtime testing required:	---
Attachments:	Locking implementation Removes race conditions from runscript.sh Locking implementation, updated Removes race conditions using locking

Description Oldrich Jedlicka 2006-01-09 08:19:37 UTC

I created a locking functionality, if somebody is interrested. It works completely in BASH and it is completely safe to use it - if the shell terminates with Ctrl-C or a subshell is leaved, locks are released automatically. The only thing is to clear the locks directory at the boot.

Example:

lock "name"
... do something exclusively
(
    lock "name"
    ... do something and forget to unlock
)
... here is the lock from subshell already released
... main lock is still active
unlock "name"
... here is no lock active

Other useful examples and functionality is described in attached file.

Comment 1 Oldrich Jedlicka 2006-01-09 08:20:41 UTC

Created attachment 76646 [details]
Locking implementation

Comment 2 Oldrich Jedlicka 2006-01-09 08:40:48 UTC

This is helpful, because since now there was no real locking in rc scripts. My implementation introduces three functions:

* lock
* trylock
* unlock

If you need to do some cleanup, you can call add_cleanup_function - in case that the lock() function is interrupted by Ctrl-C. This is not documented in examples, but it is easy to understand.

I was inspired to do this during my investigation of a problem described in bug #118419.

Comment 3 Roy Marples (RETIRED) gentoo-dev

2006-01-09 09:04:24 UTC

We're currently using fifo's to lock files when called from rc - or rc-services.sh start_service function (baselayout-1.12 only).

We have a similar system used by runscript.sh which every service uses, but uses symlinks in starting/started/stopping/inactive directories instead. Your script is similar to this. Checkout /lib/rcscripts/sh/rc-services.sh (all baselayout versions).

Comment 4 Oldrich Jedlicka 2006-01-09 09:15:14 UTC

Actually this is not a good solution with a race condition:

if service_starting "${myservice}" ; then
    ...
elif service_stopping "${myservice}" ; then
    ...
elif service_inactive "${myservice}" ; then
    ...
elif service_started "${myservice}" ; then
    ...
fi

What if the system switches to other task in the middle of service_* checking? There should be one lock at the beginning and unlock at the end.

Comment 5 Oldrich Jedlicka 2006-01-09 09:45:13 UTC

There are several points in svc_start that can get wrong without locking. The only thing using locking is start_service/stop_service (using begin_service), which is clearly not enough.

What I'm saying is that the current implementation has several holes and can be improved. Also the mentioned bug shows the need for some better locking handling - I remember warnings during parallel boot-up like "Service already started". And imagine that the coldplug is starting some networking service in the background while they are in the default runlevel...

Comment 6 Roy Marples (RETIRED) gentoo-dev

2006-01-09 15:12:27 UTC

Created attachment 76673 [details, diff]
Removes race conditions from runscript.sh

Comment 7 Roy Marples (RETIRED) gentoo-dev

2006-01-09 15:15:39 UTC

OK, this is a preliminary patch to remove the race conditions from runscript.sh

The only caveat is that if something is starting, we must still allow stopping. But we don't allow starting if it's stopping.

Comment 8 Oldrich Jedlicka 2006-01-10 00:50:37 UTC

Check for the whole status has to be atomic to remove (one) race condition. Another thing is that the work suggesting some status up to the setting of different status should be under a lock, just to be sure that the whole job is not influenced by other threads - example is changing status from nothing/stopped into starting: checking for the status and marking the service "starting" should be done under lock, the real starting can be done without any lock (as the status is already marked correctly). 

This way you should be able to stop already starting service, because it will not be under lock. Then the "starting" thread should check atomically the status for "stopping"/"stopped" before changing the status to "started"/"failed".

And one example from the top of your patch (one TAB removed):

# Do not try to stop if it had already failed to do so on runlevel change
if is_runlevel_stop && service_failed "${myservice}" ; then
	return 1
fi
+
+if service_stopped "${myservice}" ; then
+	ewarn "WARNING:  \"${myservice}\" has not yet been started."
+	return 0
+fi

Service can be marked as failed after you checked for the failing status. There is no way how to do this correctly without some kind of locking. Do not waste your valuable time with reordering things, it cannot help in this case. Either leave it as it is or use "real" locking - my implementation or any other simpler/more complex implementation.

My suggestions are:

1) Accept and test some kind of locking functionality to fit all needs
2) Improve the status checking and changing.

Without 1) there is always some situation like in the example from the patch.

Comment 9 Roy Marples (RETIRED) gentoo-dev

2006-01-10 03:30:15 UTC

(In reply to comment #8)
> Check for the whole status has to be atomic to remove (one) race condition.
> Another thing is that the work suggesting some status up to the setting of
> different status should be under a lock, just to be sure that the whole job is
> not influenced by other threads - example is changing status from
> nothing/stopped into starting: checking for the status and marking the service
> "starting" should be done under lock, the real starting can be done without any
> lock (as the status is already marked correctly).

Actually, that's done just to provide users with nice messages.
The actual lock happens with this

Starting lock

if ! mark_service_starting "${myservice}" ; then
    ewarn "WARNING: \"${myservice}\" is already starting."
    return 0
fi

Stopping lock

if ! mark_service_stopping "${myservice}" ; then
    ewarn "WARNING:  \"${myservice}\" is already stopping."
    return 0
fi
# Lock service starting too ...
mark_service_starting "${myservice}"

As you can see, there is still the possibility to race, but it is now much smaller. I cannot see how to avoid this as we need to be able to stop a starting service though ......

Comment 10 Oldrich Jedlicka 2006-01-10 04:33:40 UTC

For service starting, the (very simple) schema can be as follows:

    acquire status lock
    check status of service
    if status is A or B or ... then
      release status lock
      write warning
      return
    end if

    set status to "starting"
    release status lock

    do some initialization job

    acquire status lock
    check status of service
    if status is "starting" then
      set status to "started" or "failed"
      # All other status change was made by someone else
    end if
    release status lock

As the lock is needed only for status changing, the rest (initialization job) can be done completely without locking. And you can use the same locks in mark_service_*/service_* - just to be sure that somebody didn't forget to acquire the status lock. My implementation allows multiple locks and as it is a named lock, the "acquire status lock" can be tied closely to one service.

I can look at it more closely in the evening (in 5-6 hours), this was just for a discussion.

Comment 11 Oldrich Jedlicka 2006-01-11 00:12:05 UTC

Created attachment 76813 [details]
Locking implementation, updated

Fixed and updated rc-locking.sh for use in rc scripts.

Comment 12 Oldrich Jedlicka 2006-01-11 00:17:50 UTC

Created attachment 76815 [details, diff]
Removes race conditions using locking

Removes race conditions from /lib/rcscripts/sh/rc-services.sh and /sbin/runscript.sh. Changed files are also /etc/conf.d/rc and /sbin/rc.

Comment 13 Oldrich Jedlicka 2006-01-11 00:45:02 UTC

The last patch does not remove all race conditions. There is one situation:

1) runscript.sh:svc_stop() in dependency checking checks for !service_started and then calls stop_service (without any lock).
2) rc-services.sh:stop_service() checks, if the service is stopped/stopping. If not, begin_service is called (in lock) and "/etc/init.d/{service} stop" is started (without any lock).
3) runscript.sh:svc_stop() checks status and marks service as stopping (in lock, that's ok).

Notes for steps:

1) The lock has to be released before the call to stop_service, or stop_service has to release the lock itself. This will not fix the main thing - see note for step 2).
2) There is no status update before the "/etc/init.d/{service} stop" is called. Only lock is acquired to do it only once (with begin_service).

There is a race condition when stop_service starts "/etc/init.d/{service} stop". Automatical multiple starting (by rc script) is prohibited by begin_service lock, but there can be manual action stopping the same script just a while before. The warning "ERROR... is stopping" can still be reached. Same for starting.

Small note: begin_service() can be replaced by trylock() and wait_service() can be replaced by lock()+unlock(). end_service() can be omitted, if begin_service() and "/etc/init.d/{service} stop" are called in the same new subshell.

Comment 14 Oldrich Jedlicka 2006-01-11 01:26:18 UTC

Comment on attachment 76815 [details, diff]
Removes race conditions using locking

Patch is against baselayout 1.12.0_pre13-r1

Comment 15 Oldrich Jedlicka 2006-01-12 04:58:52 UTC

To get rid of the warning there can be solution: "/etc/init.d/{service} --quiet stop" (disable only the warnings that the service is already started/stopped/...). The script rc-services.sh can use this parameter to disable the warning. And everything else should work just fine :-)

Comment 16 Oldrich Jedlicka 2006-01-16 02:12:22 UTC

I think that the concept of begin_service is not completely ok. It should be moved into runscript.sh or not used at all:

1) It correctly prohibits simultaneous executions of the script only for dependencies (as they are started with start_service), not for the script itself
2) The lock does not distinguish between the script starting and stopping (!)
3) It looks as a workaround for the "already starting/started" warning to me

Notes for some points:
1) Fixed with my locking implementation
2) Not a big problem, but if somebody is experimenting with starting/stopping simultaneously, this can cause troubles. For serious starting/stopping this is not a big deal. I can imagine this - when you start something (with some dependencies not started) just before the computer starts to shut down.
3) I think that the option "--quiet" (or "--no-warning", or anything better) is cleaner way to get rid of the warning - the startup/stopping will be decided at one place in runscript.sh.

With the full locking and "--no-warning" option, there is no need to use begin_service any more.

I will have a look at the new baselayout version this evening. Should I present my results here? Is there some interrest?

Comment 17 Roy Marples (RETIRED) gentoo-dev

2006-01-16 02:26:38 UTC

begin_serice is runscript.sh now
It's also in rc-services.sh so that we avoid spamming the tty with messages. Yes, this could be regarded as a hack for already starting/started messages.

I think that we are locking as well as can be. I would rather any locking issues were addressed with code that we have as I think your locking code, whilst looks of good quality, is too overkill for our needs.

begin_service is needed so that wait_service can work - we need the fifos so can effectively "wait" for services started outside of our realm to complete which is essential for parallel startup.

Comment 18 Oldrich Jedlicka 2006-01-16 13:51:21 UTC

Only one point for now (as the description is a bit longer):

Situation: Service is started and stopped at the same time. Function svc_start() already marked the service starting, so svc_stop() passed over service_stopped() and calls mark_service_stopping(). After this there is begin_service() call without any check for return value (and there cannot be any check as it is normally executed before), but this is another story.

Yes, there is a question, why anybody would try to start and stop the service at the same time (the service stopping has to come from the command line, starting is not limited), but...

Solutions (?):
1) There can be a check for the return value of mark_service_starting() in svc_stop(), but there is a question what to do then.
2) Use mark_service_starting() first in both svc_start() and svc_stop() as an additional lock. But it is never good to set a wrong status.

Well, I think I will leave the discussion as all troubles cannot be solved cleanly for all possibilities (to become fool-proof), only for commonly used scenarios. And I love clean solutions.

Comment 19 Oldrich Jedlicka 2006-05-25 00:29:45 UTC

If somebody finds it necessary, I still have it. Closing.

Comment 20 Daniel Drake (RETIRED) gentoo-dev

2006-12-11 08:12:07 UTC

Hi Oldrich,

Are you still interested in this area? By demonstrating a very repeatable bug in baselayout I think I've convinced Roy that we do need a locking strategy. I'm interested in your earlier work and hope to find some time to look at it soon. See bug #154670

Comment 21 Oldrich Jedlicka 2006-12-16 04:46:10 UTC

(In reply to comment #20)
> Are you still interested in this area? By demonstrating a very repeatable bug
> in baselayout I think I've convinced Roy that we do need a locking strategy.
> I'm interested in your earlier work and hope to find some time to look at it
> soon. See bug #154670

If you want to use this kind of locking, I can help. The updated implementation (in attachment from January) should be ready for inclusion.

It needs write access to /tmp directory (for temporary files) and ${svclock} directory (for lock files, svclock variable needs to be defined, preferably in /etc/conf.d/rc).

Usage is as easy as writing

lock "name"
... do something ...
unlock "name"

or

trylock "name" || return
... do something ...
unlock "name"

More complicated scenarios are in the description of the file (I've found that trylock is not used in any example, but it can be used freely). It automatically unlocks, if you forget to unlock (or when you exit from bash unexpectedly).