Bug 501364

Summary:	supervisor plugin design and a runit example
Product:	Gentoo Hosted Projects	Reporter:	Benda Xu <heroxbd>
Component:	OpenRC	Assignee:	OpenRC Team <openrc>
Status:	RESOLVED FIXED
Severity:	enhancement	CC:	CasperVector, ccx, dlan, eivind, jakub, lu_zero, sorin.panca, tokiclover, xaionaro
Priority:	Normal
Version:	unspecified
Hardware:	All
OS:	Linux
Whiteboard:
Package list:		Runtime testing required:	---
Attachments:	runit.patch runit.patch runit.patch

Description Benda Xu gentoo-dev

2014-02-15 09:00:36 UTC

Dear fellows,

Here is a rebase of supervisor plugin system to the lastest OpenRC git repo.

    1. separate background and foreground execution modes
    2. split start-stop-daemon template into an addon, add runit along with it
    3. init script can just set command{,_args}, arg_{foreground,background} and use the default start and stop defined in the addons.
    4. command_foreground does not need to be set when $command starts foreground by default. Same applies to command_background.
    5. $command defaults to an executable found in PATH with the SVCNAME.

@Alexander, I think overriding start/stop/status via the addon mechanism is clearner than introducing an array of _pres/pros. I'd like to see your preference here. If it is positive, I'd like to implement monit in the addon scheme for you to test out. I can work on specifications and manpages afterwards.

@William, this is your long waited rebase. Sorry I've procrastinated for a whole year. Hope you'll like the patch.

Reproducible: Always

Comment 1 Benda Xu gentoo-dev

2014-02-15 09:01:09 UTC

Created attachment 370460 [details, diff]
runit.patch

Comment 2 Benda Xu gentoo-dev

2014-02-15 09:04:46 UTC

Created attachment 370462 [details, diff]
runit.patch

Comment 3 Benda Xu gentoo-dev

2014-02-15 09:05:49 UTC

sorry, the patch comment should be (without 5.)

    1. separate background and foreground execution modes
    2. split start-stop-daemon template into an addon, add runit along with it
    3. init script can just set command{,_args}, arg_{foreground,background} and use the default start and stop defined in the addons.
    4. command_foreground does not need to be set when $command starts foreground by default. Same applies to command_background.

Comment 4 Benda Xu gentoo-dev

2014-02-15 09:10:13 UTC

How to test it out on rsync daemon:

1. install runit
2. runsvchdir /var/runit
3. runsvdir-start (manually or via inittab)
4. rc-service rsyncd stop
5. add these two lines to /etc/conf.d/rsyncd

VIA=runit
arg_foreground="--no-detach"

6. rc-service rsyncd start

confirm rsyncd is started by runit.

Comment 5 Jan Pobrislo 2014-02-17 22:40:22 UTC

This changes the semantics of "started" to "scheduled to start"; scripts
relying on this including dependencies will break, including dependencies.
Sleeping for set timeout is silly.

It's better to create the whole directory and move/symlink it to the watched
directory atomically, otherwise you have race conditions.

I don't think that having scripts and logs in the same place is the best idea.
Ideally the watched directory would be in /run which is tmpfs so we don't get
junk from crashed boot.

You create run script that executes single command. That might or might not be
sufficient. We quite often have cleanup / prepare in start() eg. postgresql
notoriously leaves behind unix socket when killed and refuses to start if it is
present, so you need to remove that.

You could run specific exported function (eg. start_foreground) from the actual
initscript so you could reuse the code for cleanup / prepare, but that'd require
modification of runscript so you don't get superfluous fork() obscuring the
way, eg. like I did in http://bpaste.net/show/37565/.

Curiously, you force full shutdown on reload, instead of sending some kind of
reload signal (most frequently SIGHUP). Also you don't regenerate the run file
in that case.

Sleeping for 5s and hoping it will be enough really isn't the way. Runit
provides a mechanism for waiting for daemon to start, including custom check
script that you can use to ascertain whether the daemon is running. It's
slightly deficient though, so you really have to wait for supervise/ok fifo to
appear (inotifywait can help where we have it, otherwise sleep loop with say
0.5s interval will be way nicer than just 5s plain) and then you issue
"sv check foobar" to run whatever test you have on the service and wait up to
configurable time until it succeeds.

This would require writing check scripts for services we want supervised.  This
isn't as hard as it may sound, for a lot of daemons we can consider them up
when some socket is open. We can extract this either from configfile/conf.d
where we have this information already or we can (non-posix though!) check just
/proc/pid/fd whether any unix or inet sockets are open.

HTH

PS: sv start also waits for ./check so if we precreate all service directories
at once with ./down in place, runsv is going to be already running for all
except the earliest scripts and we can just use "sv start" in place of
"sv check".

Comment 6 Jan Pobrislo 2014-02-18 01:22:59 UTC

Ok, what I wrote above about requiring a patch to runscript itself is slightly
incorrect. I just find it to be way cleaner approach. To make start_pre (aka
prepare in above post) work, you might need to extract current values out of
configuration file. If you do so you also need to have current $command and
$command_args. If parts of ./run are hardcoded and parts call into initscript
you will get into trouble whenever you change anything significant in
configuration file and the daemon gets autorestarted.

There are about three different ways you can call the initscript to avoid this
problem and reuse the code. The ./run script can be either:

rc-service foo start_pre && exec $(rc-service foo print_command)

where the exported commands set up the daemon for running and print the command
that is supposed to be executed respectively.

You can merge it to one function but you have to be sure to not print any
status/errors to stdout, which is what einfo/ebegin family of commands does.

Or you can do what I wrote patch for: have your ./run be just

exec rc-service foo start_foreground 

so you can have custom start_foreground command, which does actually handle
some corner cases better, eg. have arguments with whitespace, which is not
really possible with the way $command and $command_args are normally handled.

Comment 7 William Hubbs gentoo-dev

2014-02-18 01:40:13 UTC

Instead of sleeping 5 seconds, if you are waiting for a file to appear, you can use waitfile (see the openrc-run manpage if you are looking at openrc-git).

Also, I highly discourage using the addon code and making these addons. I would put the start-stop-daemon.sh and runit.sh files in the sh directory.
You import them into runscript.sh.in using some variant of the sourcex call.

Comment 8 Benda Xu gentoo-dev

2014-02-20 07:24:06 UTC

Hey Jan,

Long time no see!

exec rc-service foo start_foreground feels like a cool solution. If we can derive foo from environmental variable or directory name, there will an universal runit directory! (The cat >> run in my patch is really ugly)

For other parts, I am sure I had done something suboptimal (sleep 5s, state, etc.). But I could not fully catch your words for I am not that experienced with process supervising and specifically runit. Could you please paste some code to aid your argument? (patch or git repo, perfect if based on my patch) Thanks a lot.

@William, thanks for the input. I'll replace addons with sourcex in my next patch.

Benda

Comment 9 William Hubbs gentoo-dev

2014-03-16 04:46:18 UTC

Created attachment 372786 [details, diff]
runit.patch

All,

here is a slightly updated version of this patch.

This includes updates to the openrc-run man page for the new variables
as well.

I had to remove the command_env variable, because it could only hold one
environment variable. Also I removed start_wait because that is specific
to start-stop-daemon.

Benda, your name will be listed as the primary author; I just
made some modifications.

I need input on this version, in particular, I'm not following how to
make sure a service was successfully started.

Thanks,

William

Comment 10 Jan Pobrislo 2014-04-09 13:21:09 UTC

For the record, it seems we can signal to runsvdir that we want to reload the service directory by sending it the CONT signal.

http://comments.gmane.org/gmane.comp.sysutils.supervision.general/99

we will ofc. still need to waitfile on the supervise directory's content but this should get us rid of unpleasant delay when adding a service

Comment 11 William Hubbs gentoo-dev

2014-07-18 18:06:13 UTC

I have some questions about this that are still not clear to me.

1. How do we start runsvdir to begin with?
2. How do we make sure runsvdir is restarted if it dies?

I see only two ways we can do this. We can figure out a way to add this capability to the patch, or we can just write a document telling users how to set it up.

Comment 12 Benda Xu gentoo-dev

2014-07-22 08:43:02 UTC

(In reply to William Hubbs from comment #11)
> I have some questions about this that are still not clear to me.
> 
> 1. How do we start runsvdir to begin with?
> 2. How do we make sure runsvdir is restarted if it dies?

The present thinking is to start runsvdir from inittab, so that it gets persist.  The demerit is that the user has to add it manually, we should document it.

> I see only two ways we can do this. We can figure out a way to add this
> capability to the patch, or we can just write a document telling users how
> to set it up.

Can we really achieve this inside OpenRC?  I prefer documenting to the users how to start runsvdir from inittab.

(In reply to William Hubbs from comment #9)
> Created attachment 372786 [details, diff] [details, diff]
> runit.patch
> 
> All,
> 
> here is a slightly updated version of this patch.
> 
> This includes updates to the openrc-run man page for the new variables
> as well.
> 
> I had to remove the command_env variable, because it could only hold one
> environment variable. Also I removed start_wait because that is specific
> to start-stop-daemon.

Well done, William. Thank you.  I will base the next patch on this.

> I need input on this version, in particular, I'm not following how to
> make sure a service was successfully started.

More on this later.

Comment 13 Benda Xu gentoo-dev

2014-07-23 07:33:35 UTC

On problem here, we don't know the pid of runsvdir to send the CONT signal.

Comment 14 Benda Xu gentoo-dev

2014-07-23 08:07:54 UTC

(In reply to Benda Xu from comment #13)
> On problem here, we don't know the pid of runsvdir to send the CONT signal.

One probability is to start runsvdir with a separate runsv, and use the runsv interface such as supervise/pid to send signals to runsvdir reliably.  A bonus is that inittab needs not to be modified to respawn runsvdir.

Comment 15 William Hubbs gentoo-dev

2014-11-13 19:58:01 UTC

All,

I just added the most recent version of runit to the tree, so now we can
talk more about this bug.

Upstream runit documents how runit should be used with sysvinit [1].

This sounds like we should put runsvdir-start in inittab and actually
write runit scripts for the services we would want runit to handle
instead of having openrc generate them dynamically.

I would like some comments. What do people think?

[1] http://www.smarden.org/runit/useinit.html

Comment 16 Benda Xu gentoo-dev

2014-11-14 03:38:43 UTC

(In reply to William Hubbs from comment #15)

> This sounds like we should put runsvdir-start in inittab and actually
> write runit scripts for the services we would want runit to handle
> instead of having openrc generate them dynamically.

That means we use OpenRC and runit separately, doesn't it?

Then do we focus on a general interface for OpenRC to interact with the supervisors instead?

Comment 17 James L. Hammons 2014-12-02 15:15:56 UTC

After looking at runit and how it works, it seems to me that the domain that it and openrc work in are not exactly orthogonal, and it would be more work than it's worth to get them to play nicely together.

That said, I'm working on a small init replacement (inspired by https://felipec.wordpress.com/2013/11/04/init/) and have it where it can boot and shut down the system reliably using OpenRC to do all the heavy lifting. It should be fairly easy to add functionality to it, like process supervision and the like, but in a way that interfaces cleanly with OpenRC.

Just my 2¢.

Comment 18 Jakub Jirutka 2014-12-02 15:36:37 UTC

(In reply to James L. Hammons from comment #17)
> After looking at runit and how it works, it seems to me that the domain that
> it and openrc work in are not exactly orthogonal, and it would be more work
> than it's worth to get them to play nicely together.
> 
> That said, I'm working on a small init replacement (inspired by
> https://felipec.wordpress.com/2013/11/04/init/) and have it where it can
> boot and shut down the system reliably using OpenRC to do all the heavy
> lifting. It should be fairly easy to add functionality to it, like process
> supervision and the like, but in a way that interfaces cleanly with OpenRC.
> 
> Just my 2¢.

Great! Thanks for the link to very interesting article.

Comment 19 James L. Hammons 2014-12-08 13:17:38 UTC

The ball is now rolling! Check out http://forums.gentoo.org/viewtopic-p-7664146.html for more information.

Testing is needed and appreciated; advice & code are always welcome.

Comment 20 Benda Xu gentoo-dev

2014-12-08 23:38:07 UTC

Hello James,

(In reply to James L. Hammons from comment #19)
> The ball is now rolling! Check out
> http://forums.gentoo.org/viewtopic-p-7664146.html for more information.
> 
> Testing is needed and appreciated; advice & code are always welcome.

Very nice post! I am impressed by how short the Ruby code is.

Benda

Comment 21 tokiclover 2014-12-27 10:45:47 UTC

I've just discovered this bug... when trying to fill a bug in OpenRC repository. (I've already forgot about that bug already;-)

I've filled this bug #533418 after AntP. bug #521918 (shutdown is broken), bug #522204 (login shells are broken - patch obsoleted by above fix) and bug #522786 (2.1.2 version bump) in order to have an out of the box "Just Works(TM)" replacement of SysVinit `init' by `runit-init'.

[ Parenthesis of the above:
I've runned into an issue described here in topic #998478 (http://forums.gentoo.org/viewtopic-t-998478-start-25.html) when using `runit-init' as PID 1. In short, when a process or daemon (runned in the foreground) hangs in stage 1. What happen afterwards? `runit-init' wait forever with inactive C-ALT-DEL.]

To make a _short_ story short, that topic and other topics related to PID 1 and SystemD have interesting discussions on service supervision.

I think a sane approach is DO NOT SUPERVISE EVERY DAEMON/SERVICE because this can be very dangerous be it in desktop, server or what else. Nobody would want to have dead service/daemon restarted for everything. Supervision can be beneficial if used for particular daemon/service.

So the choice of supervision _should_ be available in init service script but not set globaly in `rc.conf' because this is potentially dangerous.

Another issue is with start/stop process/daemon itself. Taking runit case, `runsvdir' supervisor scans the root service directory every 5 seconds or so. So, making a symlink and waiting `runsvdir' to pick up the service later is not pratical in boot/shutdown cases because a service _should_ be started/stopped ASAP and not in 5--or whatever the delay is--because the need to start/stop a service right away is necessary.

runit provide `sv' binary to do this. So, the current patch should implement a less flawed start/stop mechanism.

--
...
# This `sv_dir="${RC_SVCDIR}/sv/${RC_SVCNAME}"' shoudl be replaced
# by the following because separating runit/OpenRC service directory
# makes sense rather than stuffing `init.d'
sv_dir="${RC_SERVICE}/../sv/${RC_SVCNAME}"

# Define this handy env variable to avoid unecessary heavyness
sv_rundir="${RC_SVCDIR}/runit/${RC_SVCNAME}"

start()
{
	do_service
	ebegin "Starting runit supervised ${RC_SVCNAME}"
	ln -s "${sv_dir}" "${sv_rundir}"
	sv start "${sv_rundir}
	eend $?
}

stop()
{
	ebegin "Stopping runit supervised ${RC_SVCNAME}"
	rc stop "${sv_rundir}"
	eend $?
	# Is this really needed? (tmpfs)
	#rm -f "${sv_rundir}"
}

status()
{
	sv status "${sv_rundir}"
}

reload()
{
	sv reload "${sv_rundir}"
	eend $?
}
--

Notice I renamed `make_sv_dir' by `do_service'. `make_sv_dir' sound a little... hard to grasp. This is just a cosmetic change. Another cosmetic change is the start/stop message.

Thanks.

Comment 22 tokiclover 2014-12-28 14:30:17 UTC

Well, I played a little with a modified variant of William's patch and it is indeed impossible to get runit & OpenRC play nicely together because of... solely of the `start()' function. The other functions are no problem with `sv' around. It's just that `sv' is completly useless to start a new service which wasn't running before its invocation. It will sipmly fails in this case.

[Note: I am actually using runit-init as PID 1, but other than getting getty supervised, I don't use it for other purpose at the moment.]

So, I included a line to start a new instance of runsvdir in the start function for testing.

Using `runsv' to start a service in a subshell (because it does run in the foreground) bring a great deal of race condition which can end up by launching dozen of daemons. And stopping them with `sv' would not be enough without a kill all command.

And indeed, the log directory is more of a bother tan anything else and I had to remove it quickly to avoid useless hassles.

s6 developer released an new 2.0 of s6 suite and he's going to simplify and clea a little of he code in this upcoming year. I took a look at it... and get issue after another to merge cleanly the package. Actually, 2.0 introduced a more standard build/installation (configure/Makefile) but it's not quite a neat for now. So I will wait a little to experiment this patch with s6.

Comment 23 James L. Hammons 2014-12-31 16:41:18 UTC

I agree that not every service should be supervised; that this way madness lies. But I can also see the use cases for having services be supervised, like agetty and sshd (if you are in a remote session and accidentally kill the root sshd process, for example).

The funny thing is that Sys V init already does process supervision by setting the appropriate lines in /etc/inittab (otherwise, your login shells wouldn't respawn after you logged out of them!), so the mechanism is already there--the problem is that it doesn't integrate *at all* with OpenRC.

So, basically what I'm doing with my small init replacement is adding the ability to monitor processes and relaunch them using OpenRC if they die (only if the user wants the service monitored). This requires a small patch to start-stop-daemon, but the impact on OpenRC overall is very minor and requires no changes to existing scripts. I have proof-of-concept code working right now, but need to refine it.

The nice thing about all of this is that even with my small patches in start-stop-daemon, you can still run regular Sys V init with it without any problems. The patches only do anything if my small init is running.

FWIW, my 2¢.

Comment 24 William Hubbs gentoo-dev

2015-05-09 16:02:47 UTC

Folks,

I am back to working on this bug again, and commit abef2fc adds the ability to override the start, stop and status functions.

there is another approach for making runit available to supervise services on an OpenRC system, without replacing init, which we haven't considered.

The theory is that runit itself will be an OpenRC service, then any services that are supervised by runit will have a need dependency for the runit service.

The down side of this is that runit will not be supervised. However, I think the runit process itself should be stable enough that we don't have to worry too much about it crashing.

Also, I don't think we should be trying to automatically generate runit services; I think building the services should be left to the service authors.

I will post a new patch shortly.

Comment 25 William Hubbs gentoo-dev

2015-05-09 19:04:41 UTC

The commit I just cited has an issue. It forces the supervisor to be set
in /etc/conf.d/* or /etc/rc.conf.

This is not really correct since we want script authors to control this,
not users.

I will make a change in git soon to deal with this issue.

Comment 26 William Hubbs gentoo-dev

2016-07-27 21:28:52 UTC

https://github.com/openrc/openrc/commit/f62253b

This adds runit support that is very similar to the s6 support. This
will be included in 0.22.