Summary: | sys-apps/openrc stop-start-daemon --pid check is broken when starting services | ||
---|---|---|---|
Product: | Gentoo Hosted Projects | Reporter: | j4Hu <bugs> |
Component: | OpenRC | Assignee: | OpenRC Team <openrc> |
Status: | RESOLVED FIXED | ||
Severity: | major | CC: | chutzpah, gentoo, nikoli, non7top |
Priority: | High | ||
Version: | unspecified | ||
Hardware: | All | ||
OS: | Linux | ||
Whiteboard: | |||
Package list: | Runtime testing required: | --- | |
Attachments: | stale-pidfiles.patch |
Description
j4Hu
2008-07-15 10:43:34 UTC
baselayout-1 was buggy in this regard because it checked that the process command was running matched the name of the pidfile (not always the case). It didn't actually check if the pid matched either. OpenRC just checks that the pid is still valid, so it is a better check. But as you have pointed out, it's not enirely foolproof either. Still, it's better than baselayout-1. Maybe we should try and test uptime vs pidfile age - but that relies on a working clock and ntp would not have started at this point so it's a bit of a catch 22. I'm not sure if we can portably work out if it's a kernel process or not which I think is the only other thing we can do. Another thing we could do is always clean /var/run when booting for the first time. Any word here Roy? (In reply to comment #2) > Any word here Roy? Nope, not yet. I don't regard this as a critical issue to fix. What openrc does now is better (imo) than baselayout-1, but is still not the Best We Can Do. Not exactly the same issue, but bootmisc (from openrc-0.3.0-r1) is removing .keep* files from /var/run/ subdirectories. For example: $ qlist dbus | grep /var/run /var/run/dbus/.keep_sys-apps_dbus-0 $ ls -a /var/run/dbus/ . .. system_bus_socket init.d scripts should be handling missing dirs in /var/run ... in other words, you should be able to do `rm -rf /var/run/*` and have any init.d script still work properly. if it doesnt, fix the init.d script to create any dirs it needs. So you would consider "keepdir /var/run/foo" an error? i wouldnt go that far (yet?). init.d scripts have to be able to recover from /var scrubbing as users themselves tend to do this. telling someone they need to re-emerge a package just so it'll do `mkdir /var/run/.../` is awful. Re-assigning to openrc since this is an openrc issue. I agree with Roy in that this is not critical so it should not block stabilization. Why do we not mount /var/run and /var/lock as tmpfs by default? Or maybe at least officially support a volatile /var/run? More and more other distributions goes towards this for a number of reasons including: * it does not left stalled files behind between reboots (the original reason for this bug) * it removes the need for small reads/writes from/to harddrive for information that does not need to survive between reboots (possible power-saving and faster disk respons). To be honest, currently the only thing we have in /var/run that are not supposed to survive between reboots are directories made by ebuilds, and the initscripts could just as easily do those themselves... i already said that init.d scripts should be fixed to run `checkpath` on any dirs they require in /var/{run,lib,cache}/.../ For fixing the individual packages, the tracker bug is Bug 332633 Once we get baselayout-2 with /run support, /var/run will be linked to /run and /var/lock will be linked to /run/lock. This will be a tmpfs, so it will be empty every time we reboot. that doesn't really have any bearing on this. ssd's --pid check is broken when starting services. # /etc/init.d/sshd stop # echo 1 > /var/run/sshd.pid # /etc/init.d/sshd start * Starting sshd ... * start-stop-daemon: /usr/sbin/sshd is already running [ !! ] * ERROR: sshd failed to start *** Bug 369217 has been marked as a duplicate of this bug. *** *** Bug 359061 has been marked as a duplicate of this bug. *** Note cleaning /var/run at startup doesn't really fix anything. If you need to restart a crashed daemon while the machine is running then you have a stale pid in /var/run. Consequence is that you have an unclean /var/run and the init file needs to deal with this. For sure zapping /var/run probably gets rid of 90% of visible cases of stale pid files, but it's not a solution to zapping 100% of them. The correct solution is to ensure that the init file can deal with stale pid files lying around - anything less just means that you eventually paint yourself into a corner where the init script won't restart (unless you reboot or manually clean /var/run) The daemon itself should normally be capable of dealing with stale pid files. If it can't then the correct idiom is: if ! start-stop-daemon --test --quiet --stop --pidfile ${pidfile}; then rm -f ${pidfile} fi I have opened one outstanding bug on this for dbus, probably there are others that need such a fixup Created attachment 282349 [details]
stale-pidfiles.patch
Would this patch to openrc's template start function be desirable also?
Hmm, I haven't tested, but I guess that seems preferable to me? Any developer want to comment? I think we need *some fix* here, so please comment? On further reflection the suggested fix has some aspects worth bearing in mind: - Relies on an undocumented variable $pidfile to work - Surprise side-effect if someone sets $pidfile to something that isn't usable correctly to start-stop-daemon - init.d files still need to be updated to set $pidfile correctly for affected ebuilds (dbus for example) It still seems like progress, but worth bearing in mind the above? (In reply to comment #19) > I think we need *some fix* here, so please comment? No other devs are objecting, so I'll probably go ahead with this fix in the next 12/24 hours. > On further reflection the suggested fix has some aspects worth bearing in mind: > > - Relies on an undocumented variable $pidfile to work This variable is documented, see man runscript. > - Surprise side-effect if someone sets $pidfile to something that isn't usable > correctly to start-stop-daemon Again, this variable is documented in man runscript, so folks should set it as stated there. > - init.d files still need to be updated to set $pidfile correctly for affected > ebuilds (dbus for example) There are several updates to init.d files that need to happen to make them fully utilize openrc, but that doesn't really affect this fix. Hi all, (In reply to comment #16) > if ! start-stop-daemon --test --quiet --stop --pidfile ${pidfile}; then > rm -f ${pidfile} > fi I looked over the documentation again and found that this will never be false, so the pid file will never be removed. Here is the description of the --test option from the start-stop-daemon man page. -t, --test Print the action(s) that would be taken, but don't actually do anything. The return value is set as if the command was taken and worked. That means that your if condition will always be successful, so the pid will not be deleted. This also applies to my proposed fix, so I don't think we can use either one. > I have opened one outstanding bug on this for dbus, probably there are others > that need such a fixup Given this info, this fix will not repair dbus. I hope it's just the man page which is wrong? The debian start-stop-daemon sets the result based as we might expect, eg: http://man.cx/start-stop-daemon%288%29 http://www.busybox.net/downloads/BusyBox.html It certainly currently works as expected and it's an idiom used in a number of init scripts already, eg see /etc/init.d/bootmisc itself... As such I think the solution is valid? (In reply to comment #22) > It certainly currently works as expected and it's an idiom used in a number of > init scripts already, eg see /etc/init.d/bootmisc itself... > > As such I think the solution is valid? All --quiet --test tests for is whether there is a process in the system with the pid that is in the pid file. It does not attempt to verify that the process is the one we think it is. So, technically I'm not sure that test is complete. Sure, but it's not clear that we can do better? If you see my bug 359061 this is an example of where it's needed today. Spanky, not unreasonably doesn't want individual init.d entries each customised to check for stale pids, yet we have an example here of a broken daemon that doesn't delete it's pid when stopping, yet won't start if there is a stale pid present (durr..). What to do? It's hard to see that your change makes things *worse* than they are today: - If a daemon crashes, some other process is lucky enough to grab the same pid, then we try and restart. Pid file is now not deleted and we might have a problem. Doesn't seem worse than the status quo though - If we try and "stop", stop claims we are stopped, but daemon is slow and still shutting down. When we try and start we should still see our process is alive, therefore don't delete the pid, therefore hopefully start fails in some sensible way later because stop didn't really work correctly. No change to status quo - I'm struggling to see where a) openrc thinks the daemon is down, b) daemon is not really down, c) pidfile doesn't list a running process id, d) deleting it then causes a problem? I'm not claiming it's fault free, but I can't see how it will cause a problem right now? Someone smarter than me want to suggest a problem? *** Bug 359061 has been marked as a duplicate of this bug. *** We can even make /var/run a tmpfs without the /run stuff. And we should do that soon. i don't think we need to as we're in the process of transitioning to /run now This will be obsoleted by /run soon. *** Bug 402837 has been marked as a duplicate of this bug. *** Reopening. While we aren't doing the /var/run wiping, we should still maybe be validating what is in pidfiles at service start time, if only to check if they are completely out of date. 1. Is pidfile older than last boot time? Remove 2. Valid what's running at given pids? Similar problem with mixing process from inside vz containers. In my case most contaners have /usr/sbin/crond process running and check against them succeeds while it shouldn't. s linux # /etc/init.d/vixie-cron start * Starting vixie-cron ... * start-stop-daemon: /usr/sbin/cron is already running [ !! ] * ERROR: vixie-cron failed to start [1] s linux # /etc/init.d/vz stop * Shutting down CT 126 ... * Shutting down CT 125 ... * Shutting down CT 124 ... * Shutting down CT 123 ... * Shutting down CT 120 ... * Shutting down CT 115 ... * Shutting down CT 114 ... * Shutting down CT 112 ... * Shutting down CT 111 ... * Bringing down interface venet0 ... [ ok ] s linux # /etc/init.d/vixie-cron start * Starting vixie-cron ... [ ok ] (In reply to comment #30) > 1. Is pidfile older than last boot time? Remove > 2. Valid what's running at given pids? These sound like reasonable checks to me, I would be willing to write a patch for this if there is a chance of it being accepted. (In reply to comment #32) > (In reply to comment #30) > > 1. Is pidfile older than last boot time? Remove > > 2. Valid what's running at given pids? > > These sound like reasonable checks to me, I would be willing to write a > patch for this if there is a chance of it being accepted. I am pretty sure we would like that :) *** Bug 431180 has been marked as a duplicate of this bug. *** Tracking a background process with a pid file is intrinsically unreliable. Refer to e.g. http://mywiki.wooledge.org/ProcessManagement FYI There is an ongoing effort to let OpenRC talk to another service supervisor: http://wpr.cz/ccx/wobsite/article/openrc-supervisor.html http://www.awa.tohoku.ac.jp/~benda/projects/runit.html resolved |