335398 – net-dns/bind: restart fails after a longer uptime

Bug 335398 - net-dns/bind: restart fails after a longer uptime

Summary: net-dns/bind: restart fails after a longer uptime

Status:	RESOLVED FIXED

Alias:	None

Product:	Gentoo Linux
Classification:	Unclassified
Component:	[OLD] Server (show other bugs)
Hardware:	All Linux

Importance:	High normal (vote)
Assignee:	Christian Ruppert (idl0r)

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2010-08-31 12:11 UTC by Guido Jäkel
Modified:	2011-01-09 20:57 UTC (History)
CC List:	3 users (show)

See Also:
Package list:
Runtime testing required:	---

Attachments
Patch against named.init-r7 (as per comment 11) (named.init-r7-bug-335398.patch,1.60 KB, patch) 2010-12-04 17:54 UTC, kfm	Details \| Diff
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Guido Jäkel 2010-08-31 12:11:43 UTC

Using the normal restart command of the init script, the daemon is stopped, but not startet anymore. 

Reproducible: Sometimes

Steps to Reproduce:
Use of  /etc/init.d/named restart  after a longer uptime of the named.

Actual Results:  
Either the start is reported to fail. Or it's sometimes even reported being started after the stop but the process isn't actually running.

Expected Results:  
a ewn running named process.

It's caused by an race condition in the init script:

If configured (i.e. if a key file is found), the "stop"-part of the script currently uses rndc to shut down the named. The return from rndc is asynchronous to the real termination of the named.

Therefore the following  "start"-part will be to fast; it may see the last PID-file or get into resource-conflicts with the terminating process.


I suggest to resign the use of rndc for the stop. Because it's already the fallback in case of no communication key is defined, one have just to strip an if-then-clause and to leave the corresponging else-clause.

stop() {
        ebegin "Stopping ${CHROOT:+chrooted }named"
        checkconfig || return 2

#       if [ -f $KEY ] ; then
#               rndc -k $KEY stop &>/dev/null
#       else
#               start-stop-daemon --stop --quiet --pidfile $PIDFILE \
#                       --exec /usr/sbin/named -- stop
                start-stop-daemon --stop --quiet --pidfile $PIDFILE \
                        --exec /usr/sbin/named -R 10
#       fi

        eend $?
}



I also suggest to modify the stop command itself, because named don't accept a 'stop' parameter. Instead on may add a shutdown monitoring and timeout.

Comment 1 Christian Ruppert (idl0r) gentoo-dev

2010-10-09 21:29:15 UTC

This if fixed in 9.4.3_p5-r1, 9.6.2_p2-r1, 9.7.1_p2-r1 and above. You can unset RNDC_KEY to avoid rndc. Thanks.

Comment 2 Guido Jäkel 2010-10-11 07:46:32 UTC

Thanks for fixing this, i'll advance to such a version.

Comment 3 Guido Jäkel 2010-10-18 10:46:20 UTC

I'm sorry, but i can't see that this is fixed (looking today at net-dns/bind-9.7.2_p2). There I still find:

        if [ -n "${RNDC_KEY}" ] && [ -f "${RNDC_KEY}" ]; then
                rndc $SERVER -k $RNDC_KEY stop 1>/dev/null
        else
                # -R 10, bug 335398
                start-stop-daemon --stop --retry 10 --pidfile $PIDFILE \
                        --exec /usr/sbin/named
        fi

I.e. if there is a RNDC_KEY, rndc is still used to send the stop command. The rndc tool is typically used, if you want to controll the named from another host. I need this e.g. to get statistics. From that i *need* to set a RNDC_KEY.

But for local stopping of the named, the start-stop-deamon should be used, because this should be a synchronous action: This script have to *wait* for the shutdown or the next start will fail. And the shutdown *will* take some seconds, if named was up and "real used" -- in my case as a member of a internal resolver farm -- for a longer time.

Thank you for revise.

Comment 4 Christian Ruppert (idl0r) gentoo-dev

2010-10-18 17:36:34 UTC

You're now able to simply uncomment/comment the RNDC_KEY variable to enable/disable the use of rndc.

I might even comment it by default.

Comment 5 Guido Jäkel 2010-10-19 08:03:22 UTC

Dear Christian,

i'm very sorry for my bad English - it seems i can't point out the problem, but i'll try again:

I *need* to use remote commanding, therefore i *need* the mechanism to control bind from another host and from the local host, too. Therefore RNDC_KEY has to be set.

But in the framework of the init scripts, the stop command should not use rndc to shut down bind because this is an asynchronous call and the rndc-call in the init script will continue without waiting for the real shutdown of named. That's no problem if you just want to stop the daemon but in case of a *restart*, the implicite start command will fail because the new named process can't start because some resources are still in use.

To my obervance this only will happen after a longer runtime of a real busy named, because in this case the shutdown of bind will take noticeable fractions of seconds or even longer:

19-Oct-2010 09:48:43.161 shutting down: flushing changes
19-Oct-2010 09:48:43.167 stopping command channel on 127.0.0.1#953
19-Oct-2010 09:48:43.913 exiting

You see: Even if in case of the stop command rndc will wait until the connection breaks (must happen at bind logs "stopping command channel"), this is "far away" from the real end of shutdown.

Using the 'normal' start-stop-daemon mechanism will cause a much more synchronous stop based on the pid file (which seems to be release in the "exiting" phase) and should be used therefore even if a RNDC_KEY is defined. Therefore the use of rndc should be stripped out.

To my opinion, even for the reload command there is no reason to use rndc if it might be available: The script run's in the local root context and the sendout of a SIGHUP to the process.

With respect to the KISS principle, the use of rndc should be here stripped out, too.


Thanks for discussion

Guido

Comment 6 Christian Ruppert (idl0r) gentoo-dev

2010-11-09 21:45:44 UTC

Sorry but it seems I really don't get it...
I asked a few others but they seem to understand it like me.

So I have commented out the default RNDC variables to not use rndc by default.

> I *need* to use remote commanding, therefore i *need* the mechanism to control
bind from another host and from the local host, too. Therefore RNDC_KEY has to
be set.

You can still use it from CLI and init script. If you use the init script then either make a copy of the original init script and give it a different name to be able to use rndc for reload/stop for remote server or if you need more rndc commands you have to edit it yourself anyway.

And if I understood you correct then you want me to remove all rndc calls from the init script so it contradicts with the quoted sentence above and that's why I am a bit confused.

Comment 7 Guido Jäkel 2010-11-10 08:14:35 UTC

> Sorry but it seems I really don't get it...

Dear Christian,
first I would like to thank you for discussion. I'll try it again to get thin{g,k}s clear.


> And if I understood you correct then you want me to remove all rndc calls from
the init script so it contradicts with the quoted sentence above and that's why
I am a bit confused.

Yes, i want to have tor removed all usage of rndc, the *remote* name daemon controll utility from the *local* used init scripts. I think we're common in the idea, that init script framework is intended to mechanize the services on a concrete host itself. 

In the case of the bind named, all you need to archive this may be done by "communicating" with itself in a very commmon way, e.g. sending a SIGTERM to graceful stop the daemon. 

You *may* use rndc with a local host communication for this purpose, if this *and* the named is instrumented for this (remote control is enabled, a key is available for both). But there's no need for this.

In the other hand, if you have to *remote* control some administrative task aside from simple start/stopping for a named running on another host, then you may use rndc. You might use a "ssh command forward", too. But rndc is the provided tool for this task. 

Please consider, that you have to use the rndc-key of the *remote* host for this. If you really respect security, you have to use different keys for the named's on different hosts. This actually can only be managed in a reasonable way by using a configuration file and not by use of RNDC, please take a look to man rndc.conf



This is a big loop on the meta level. But if wee can get consensus in this, then the issue i reported will disappear automatically.

In fact, in the newest script it have disappeared *if* (but only if) you're using the new "automagically" chroot setup -- a good job! With his, there are some umount happens after shutdown and they will need to wait too until the file systems are idle. There's a check for it and of course i should be leave there for stability. But i guess it was introduced for dealing with the same problem i reported initial...

sincerely
Guido

Comment 8 Stefan Behte (RETIRED) gentoo-dev

2010-11-18 12:08:44 UTC

I am also experiencing the problem, even after a little bit of uptime with BIND 9.4.3_p5. It nearly always needs two restarts, which is a bit annoying.

Comment 9 Guido Jäkel 2010-11-18 12:32:21 UTC

Dear Stefan,
thank you for the "me, too"-voting. 

May you test, if this disappear, if you use a stop and delayed start instead of a restart? May you be test the suggested modification of the init script?

Guido

Comment 10 Stefan Behte (RETIRED) gentoo-dev

2010-11-24 12:20:21 UTC

named.init-r8 fixes the problem for me, thanks idl0r.

I understand Guidos point as: "remove the rndc restart based on the setting of 'RNDC_KEY', it's no good.

Comment 11 kfm 2010-12-04 17:52:16 UTC

I just strayed across this bug as I have been plagued by this bug on all of my production systems for a long time and I've been meaning to file a report ...

It seems to me that all of Guido's points are manifestly correct so I'm going to go over them, starting with the problem depicted in the bug summary. Firstly, at the point that "rndc stop" terminates, it cannot be be guaranteed that bind has, in fact, stopped. Therefore, where a near-immediate invocation of start() follows - as is the case when restarting the service - it fails because the previous instance has _not_ fully shut down yet. Ergo, it's asynchronous - it returns before actually having completed its work. That's where the problem stems from.

Secondly, there is simply no need to use rndc for this purpose at all in the stop() function. The current script is terribly over-engineered; just let start-stop-daemon take care of terminating the process unconditionally by dispatching a SIGTERM signal. As soon as the invocation of start-stop-daemon has returned control back to the script, named should have been terminated. This is how things are normally done.

Thirdly, in the reload() function we can simply dispatch a SIGHUP. No explanation should be needed here but, just to be crystal clear, the BIND manual states that it "causes the server to read named.conf and reload the database" which is exactly what we want and is the same approach adopted by many other daemons and their associated runscripts (see below).

Fourthly, it does not seem rational to use rndc *whatsoever* in the runscript. As Guido says, the "... init script framework is intended to mechanize the services on a concrete host itself." In other words, when one calls upon the named runscript on host A (for whatever purpose), one does *not* expect it to interfere with the operation of a named instance running on host B by way of a remote procedure call. To put it more succinctly, stop trying to babysit the sysadmin! If the sysadmin needs to utilise rndc to interact with a remote instance of named, then that is what he or she will do. The runscript is overstepping its territory here.

So, let's step back for a moment and consider where we were with "named.init-r7" originally. The patch I am about to attach applies against this version of the init script and fixes this bug as well as all of the above mentioned concerns, along with a few QA fixes:

1) Uses start-stop-daemon in stop() unconditionally
2) Ensure a separation of concerns in stop() by using independent ebegin/eend
blocks for the termination and chroot dir cleanup phases. This also means
that we really do check that bind has stopped.
3) Dispatch a SIGHUP signal in reload() rather than completely restarting.
4) Never try to start named from reload(). Instead, just report that it isn't
running, if necessary.
5) If checkconfig returns a non-zero value then honour it. If we don't want it
to be 1 for whatever reason then it can be addressed in one place, rather
than having hard-coded values scattered hither and thither.

Regarding (4), this change brings it in line with other scripts in gentoo such as syslog-ng, among others. Looking at various other init scripts, it seems quite clear that the reload action is intended merely to instruct the running daemon to adapt to configuration changes. Furthermore, there are other examples of reload() warning about a daemon that has not yet been started including syslog-ng, tenshi and lighttpd although I have to say that there is an obvious lack of consistency.

I was tempted to change more because there are other things in this runscript that could be problematic but I didn't want to stray too far from the scope of this particular bug (further nitpicks can be the subject of other bugs). The point is that the least 'intrusive' fix for this bug is merely to stop using rndc in the stop() function but I think that the other changes make sense in their own right.

Guido, what do you think of this patch?

Comment 12 kfm 2010-12-04 17:54:40 UTC

Created attachment 256329 [details, diff]
Patch against named.init-r7 (as per comment 11)

Comment 13 kfm 2010-12-04 18:09:02 UTC

> Please consider, that you have to use the rndc-key of the *remote* host for
> this. If you really respect security, you have to use different keys for the
> named's on different hosts. This actually can only be managed in a reasonable
> way by using a configuration file and not by use of RNDC, please take a look to
> man rndc.conf

Just to add also that this is spot on. This is a problem domain that can only be rationally addressed through the direct usage of /etc/rndc.conf and the rndc application itself, not the runscript and /etc/bind/named.conf. This wheel doesn't need to be re-invented.

Comment 14 kfm 2010-12-04 18:23:47 UTC

Sorry, I meant to say /etc/conf.d/named (not /etc/bind/named.conf) in the previous comment. Hopefully, that was clear :)

Comment 15 kfm 2010-12-04 22:08:52 UTC

The runscript is deficient in so many areas that I am now in the process of completely revamping it (using a 'back to basics' approach). I will probably file it as an enhancment bug when done and it will contain the changes already discussed. Just as an example of how it will be improved, the checkconfig function will employ this sort of logic:

checkconfig() {
    # No need for cheesy hacks - named has a formal tool for
    # validating the configuration file
    checkconf="$(named-checkconf)"
    [[ $? -eq 0 ]] || eerror "${checkconf}"
}

That means that if anything is wrong, then the user actually gets to see why ...

# /etc/init.d/named start
 * Starting named ...
 * /etc/bind/named.conf:33: unknown option 'badoption'     [ !! ]
 * ERROR: named failed to start

Comment 16 Christian Ruppert (idl0r) gentoo-dev

2010-12-04 22:26:48 UTC

Did you take a look at the new init script yet?

Comment 17 kfm 2010-12-04 23:39:22 UTC

I just skimmed it. It's an improvement on the prior iteration [1] but I still think it's convoluted, sub-optimal [2] and that a fresh look is very much merited. Most importantly, it doesn't fully resolve this bug because rndc can still potentially be used in stop() on a local instance. There's no way that rndc can be used in a deterministic fashion here.

I stand by mine and Guido's previous remarks; the runscript should control a locally running instance of named only, in which case there's no need for rndc at all. The whole point of rndc is that it is a tool for remote administration. That it may also be usefully employed for administrative tasks on a local instance has no bearing on the remit of a runscript which is generally expected to start and stop processes locally via direct process control. Besides, even if that were not true, the implementation is not even particularly useful.

Further, the start() function can only ever start a local instance by definition, yet we have a situation where stop() may instead try to stop an instance on a remote system, depending on how conf.d/named is populated. This disparate behaviour is madness to me. If a user wants to start an instance on a remote system, there is ssh. If the user wants to stop an instance on a remote system there is ssh and also rndc but it's surely not the job of a runscript to effectively act as a (fundamentally flawed) proxy for controlling remote instances. It's a solution in search of a problem.

Anyway, I've said my piece. I'll file my revised script in a separate bug before the weekend is through and very much hope that you will agree that it makes sense.

[1] I regret that I did not read it fully first as I can now see that some of my criticisms had been addressed, although not necessarily in what I would say is the best way (e.g. I think my notion of how to use named-checkconf is more user-friendly as it conveys the error rather than displaying a generic "your config is broken" type message; named-checkconf only reports one error at a time anyway).

[2] I'm even partly to blame as I contributed that horrendous pid-file detection command years ago! Assuming it is even needed, it could be simplifed to: named-checkconf -p | grep 'pid-file' | cut -d\" -f2

Comment 18 Guido Jäkel 2010-12-06 08:24:27 UTC

>Guido, what do you think of this patch?

Thank you Kerin, i'm at one to you. I didn't looked at the technical details yet, but your point-out's and solutions are looking clear and well substantiated to me.

I'm holding on for your next script version.

Comment 19 Christian Ruppert (idl0r) gentoo-dev

2010-12-13 22:12:01 UTC

Should be fixed in bind-9.7.2_p3-r2 and bind-9.6.2_p3-r2.