Gentoo Websites Logo
Go to: Gentoo Home Documentation Forums Lists Bugs Planet Store Wiki Get Gentoo!
Bug 231854 - sys-apps/openrc stop-start-daemon --pid check is broken when starting services
Summary: sys-apps/openrc stop-start-daemon --pid check is broken when starting services
Status: RESOLVED FIXED
Alias: None
Product: Gentoo Hosted Projects
Classification: Unclassified
Component: OpenRC (show other bugs)
Hardware: All Linux
: High major (vote)
Assignee: OpenRC Team
URL:
Whiteboard:
Keywords:
: 359061 369217 402837 431180 (view as bug list)
Depends on:
Blocks:
 
Reported: 2008-07-15 10:43 UTC by Jan Hudoba
Modified: 2021-09-23 09:06 UTC (History)
4 users (show)

See Also:
Package list:
Runtime testing required: ---


Attachments
stale-pidfiles.patch (stale-pidfiles.patch,446 bytes, text/plain)
2011-08-06 18:05 UTC, William Hubbs
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Jan Hudoba 2008-07-15 10:43:34 UTC
bootmisc in conjunction with start-stop-daemon is not removing pid files of really not running services

Reproducible: Sometimes

Steps to Reproduce:
1. start system
2. system crash / power outage
3. start system

Actual Results:  
some services does not start:
- because of existing pid file OR
- because of bad running state detection

Expected Results:  
remove pid files of realy not running services



bootmisc in baselayut-1 is working without these problems

*********
baselayout-1: /etc/init.d/bootmisc:55

  # Do not remove pidfiles of already running daemons
                        if [[ -z $(ps --no-heading -C "${daemon}") ]] ; then
                                if [[ -f ${x} || -L ${x} ]] ; then
                                        rm -f "${x}"
                                fi
                        fi


*********
baselayout-2: /etc/init.d/bootmisc:91

                do
                        [ ! -f "${x}" ] && continue
                        # Do not remove pidfiles of already running daemons
                        case "${x}" in
                                *.pid)
                                        start-stop-daemon --test --quiet \
                                        --stop --pidfile "${x}" && continue
                                ;;
                        esac
                        rm -f -- "${x}"
                done


*************************************************
examples of bad baselayout-2 results:

script: /etc/init.d/sshd (but it is random service, so every init script)
output: bash - console commands, trying to solve problem
time: after system crash and startup

host ~ # /etc/init.d/sshd start
 * Starting sshd ... * start-stop-daemon: /usr/sbin/sshd is already running
 [ !! ]
 * ERROR: sshd failed to start
host ~ # cat /var/run/sshd.pid
3682

# find running process with pid 3682 (it is kernel process)
host ~ # ps auxwf |grep 3682 |grep -v grep
root      3682  0.0  0.0      0     0 ?        S<   22:50   0:00  \_ [user_dlm]

# start-stop-daemon is still thinking that he can not run ssh 
# because it is running
host ~ # /etc/init.d/sshd zap start
 * Manually resetting sshd to stopped state
 * Starting sshd ... * start-stop-daemon: /usr/sbin/sshd is already running
 [ !! ]
 * ERROR: sshd failed to start

host~ # rm /var/run/sshd.pid
host~ # /etc/init.d/sshd zap start
 * Manually resetting sshd to stopped state
 * Starting sshd ... [ ok ]


*************************************************
example of script that is doing pid checking by itself (not with start-stop-daemon), but can not run too, because it dont remove pid for itself:

script: /etc/init.d/amavisd 
output: syslog-ng log file
time: after system crash and startup

Jul 14 13:59:46 host amavis[3224]: Ignoring stale PID file /var/amavis/amavisd.pid, older than system uptime 0 0:00:00
Jul 14 13:59:46 host amavis[3224]: starting.  /usr/sbin/amavisd at mail.autoolymp.sk amavisd-new-2.5.2 (20070627), Unicode aware
Jul 14 13:59:47 host amavis[3224]: (!)Net::Server: 2008/07/14-13:59:47 Pid_file already exists for running process (3223)... aborting\n\n  at line 277 in file /usr/lib64/perl5/vendor_perl/5.8.8/Net/Server.pm
Jul 14 13:59:47 host amavis[3224]: Net::Server: 2008/07/14-13:59:47 Server closing!
Jul 14 13:59:47 host /etc/init.d/amavisd[3223]: start-stop-daemon: /usr/sbin/amavisd died
Jul 14 13:59:47 host /etc/init.d/amavisd[3215]: ERROR: amavisd failed to start
Comment 1 Roy Marples 2008-08-20 07:51:14 UTC
baselayout-1 was buggy in this regard because it checked that the process command was running matched the name of the pidfile (not always the case). It didn't actually check if the pid matched either.

OpenRC just checks that the pid is still valid, so it is a better check. But as you have pointed out, it's not enirely foolproof either. Still, it's better than baselayout-1.

Maybe we should try and test uptime vs pidfile age - but that relies on a working clock and ntp would not have started at this point so it's a bit of a catch 22. I'm not sure if we can portably work out if it's a kernel process or not which I think is the only other thing we can do.

Another thing we could do is always clean /var/run when booting for the first time.
Comment 2 Doug Goldstein (RETIRED) gentoo-dev 2008-10-06 19:06:17 UTC
Any word here Roy?
Comment 3 Roy Marples 2008-10-06 20:14:44 UTC
(In reply to comment #2)
> Any word here Roy?

Nope, not yet.
I don't regard this as a critical issue to fix. What openrc does now is better (imo) than baselayout-1, but is still not the Best We Can Do.
Comment 4 Ulrich Müller gentoo-dev 2008-11-02 23:56:26 UTC
Not exactly the same issue, but bootmisc (from openrc-0.3.0-r1) is removing .keep* files from /var/run/ subdirectories. For example:

   $ qlist dbus | grep /var/run
   /var/run/dbus/.keep_sys-apps_dbus-0
   $ ls -a /var/run/dbus/
   .  ..  system_bus_socket
Comment 5 SpanKY gentoo-dev 2008-11-16 13:28:59 UTC
init.d scripts should be handling missing dirs in /var/run ... in other words, you should be able to do `rm -rf /var/run/*` and have any init.d script still work properly.  if it doesnt, fix the init.d script to create any dirs it needs.
Comment 6 Ulrich Müller gentoo-dev 2008-11-16 13:43:28 UTC
So you would consider "keepdir /var/run/foo" an error?
Comment 7 SpanKY gentoo-dev 2008-11-16 14:17:36 UTC
i wouldnt go that far (yet?).  init.d scripts have to be able to recover from /var scrubbing as users themselves tend to do this.  telling someone they need to re-emerge a package just so it'll do `mkdir /var/run/.../` is awful.
Comment 8 William Hubbs gentoo-dev 2011-01-24 16:25:09 UTC
Re-assigning to openrc since this is an openrc issue.

I agree with Roy in that this is not critical so it should not block stabilization.
Comment 9 Xake 2011-04-11 09:56:02 UTC
Why do we not mount /var/run and /var/lock as tmpfs by default?
Or maybe at least officially support a volatile /var/run?
More and more other distributions goes towards this for a number of reasons including:
* it does not left stalled files behind between reboots (the original reason for this bug)
* it removes the need for small reads/writes from/to harddrive for information that does not need to survive between reboots (possible power-saving and faster disk respons).

To be honest, currently the only thing we have in /var/run that are not supposed to survive between reboots are directories made by ebuilds, and the initscripts could just as easily do those themselves...
Comment 10 SpanKY gentoo-dev 2011-04-12 16:24:26 UTC
i already said that init.d scripts should be fixed to run `checkpath` on any dirs they require in /var/{run,lib,cache}/.../
Comment 11 Jeremy Olexa (darkside) (RETIRED) archtester gentoo-dev Security 2011-04-12 19:19:06 UTC
For fixing the individual packages, the tracker bug is Bug 332633
Comment 12 William Hubbs gentoo-dev 2011-04-26 21:44:37 UTC
Once we get baselayout-2 with /run support, /var/run will be linked to /run and /var/lock will be linked to /run/lock.

This will be a tmpfs, so it will be empty every time we reboot.
Comment 13 SpanKY gentoo-dev 2011-05-16 21:57:59 UTC
that doesn't really have any bearing on this.  ssd's --pid check is broken when starting services.

# /etc/init.d/sshd stop
# echo 1 > /var/run/sshd.pid
# /etc/init.d/sshd start
 * Starting sshd ...
 * start-stop-daemon: /usr/sbin/sshd is already running                   [ !! ]
 * ERROR: sshd failed to start
Comment 14 SpanKY gentoo-dev 2011-05-30 23:04:22 UTC
*** Bug 369217 has been marked as a duplicate of this bug. ***
Comment 15 SpanKY gentoo-dev 2011-08-06 10:11:44 UTC
*** Bug 359061 has been marked as a duplicate of this bug. ***
Comment 16 Ed Wildgoose 2011-08-06 16:57:07 UTC
Note cleaning /var/run at startup doesn't really fix anything.  If you need to restart a crashed daemon while the machine is running then you have a stale pid in /var/run.  Consequence is that you have an unclean /var/run and the init file needs to deal with this.

For sure zapping /var/run probably gets rid of 90% of visible cases of stale pid files, but it's not a solution to zapping 100% of them.  The correct solution is to ensure that the init file can deal with stale pid files lying around - anything less just means that you eventually paint yourself into a corner where the init script won't restart (unless you reboot or manually clean /var/run)

The daemon itself should normally be capable of dealing with stale pid files.  If it can't then the correct idiom is:

    if ! start-stop-daemon --test --quiet --stop --pidfile ${pidfile}; then
        rm -f ${pidfile}
    fi

I have opened one outstanding bug on this for dbus, probably there are others that need such a fixup
Comment 17 William Hubbs gentoo-dev 2011-08-06 18:05:49 UTC
Created attachment 282349 [details]
stale-pidfiles.patch

Would this patch to openrc's template start function be desirable also?
Comment 18 Ed Wildgoose 2011-08-07 07:52:56 UTC
Hmm, I haven't tested, but I guess that seems preferable to me?  Any developer want to comment?
Comment 19 Ed Wildgoose 2011-08-08 12:30:48 UTC
I think we need *some fix* here, so please comment?

On further reflection the suggested fix has some aspects worth bearing in mind:

- Relies on an undocumented variable $pidfile to work
- Surprise side-effect if someone sets $pidfile to something that isn't usable correctly to start-stop-daemon
- init.d files still need to be updated to set $pidfile correctly for affected ebuilds (dbus for example)

It still seems like progress, but worth bearing in mind the above?
Comment 20 William Hubbs gentoo-dev 2011-08-08 17:15:11 UTC
(In reply to comment #19)
> I think we need *some fix* here, so please comment?

No other devs are objecting, so I'll probably go ahead with this fix in the next 12/24 hours.

> On further reflection the suggested fix has some aspects worth bearing in mind:
> 
> - Relies on an undocumented variable $pidfile to work

This variable is documented, see man runscript.

> - Surprise side-effect if someone sets $pidfile to something that isn't usable
> correctly to start-stop-daemon

Again, this variable is documented in man runscript, so folks should set it as stated there.

> - init.d files still need to be updated to set $pidfile correctly for affected
> ebuilds (dbus for example)

There are several updates to init.d files that need to happen to make them fully utilize openrc, but that doesn't really affect this fix.
Comment 21 William Hubbs gentoo-dev 2011-08-09 21:43:18 UTC
Hi all,
(In reply to comment #16)
>     if ! start-stop-daemon --test --quiet --stop --pidfile ${pidfile}; then
>         rm -f ${pidfile}
>     fi

I looked over the documentation again and found that this will never be false, so the pid file will never be removed. Here is the description of the --test option from the start-stop-daemon man page.

-t, --test
Print the action(s) that would be taken, but don't actually
do anything.  The return value is set as if the command was taken and worked.

That means that your if condition will always be successful, so the pid
will not be deleted. This also applies to my proposed fix, so I don't
think we can use either one.

> I have opened one outstanding bug on this for dbus, probably there are others
> that need such a fixup

Given this info, this fix will not repair dbus.
Comment 22 Ed Wildgoose 2011-08-10 22:38:40 UTC
I hope it's just the man page which is wrong?  The debian start-stop-daemon sets the result based as we might expect, eg:
    http://man.cx/start-stop-daemon%288%29
    http://www.busybox.net/downloads/BusyBox.html

It certainly currently works as expected and it's an idiom used in a number of init scripts already, eg see /etc/init.d/bootmisc itself... 

As such I think the solution is valid?
Comment 23 William Hubbs gentoo-dev 2011-08-11 01:00:25 UTC
(In reply to comment #22)
> It certainly currently works as expected and it's an idiom used in a number of
> init scripts already, eg see /etc/init.d/bootmisc itself... 
> 
> As such I think the solution is valid?

All --quiet --test tests for is whether there is a process in the system with the pid that is in the pid file. It does not attempt to verify that the process is the one we think it is. So, technically I'm not sure that test is complete.
Comment 24 Ed Wildgoose 2011-08-11 08:11:47 UTC
Sure, but it's not clear that we can do better?  If you see my bug 359061 this is an example of where it's needed today. Spanky, not unreasonably doesn't want individual init.d entries each customised to check for stale pids, yet we have an example here of a broken daemon that doesn't delete it's pid when stopping, yet won't start if there is a stale pid present (durr..).  What to do?

It's hard to see that your change makes things *worse* than they are today:

- If a daemon crashes, some other process is lucky enough to grab the same pid, then we try and restart. Pid file is now not deleted and we might have a problem. Doesn't seem worse than the status quo though

- If we try and "stop", stop claims we are stopped, but daemon is slow and still shutting down. When we try and start we should still see our process is alive, therefore don't delete the pid, therefore hopefully start fails in some sensible way later because stop didn't really work correctly. No change to status quo

- I'm struggling to see where a) openrc thinks the daemon is down, b) daemon is not really down, c) pidfile doesn't list a running process id, d) deleting it then causes a problem?

I'm not claiming it's fault free, but I  can't see how it will cause a problem right now? Someone smarter than me want to suggest a problem?
Comment 25 SpanKY gentoo-dev 2011-08-20 05:16:27 UTC
*** Bug 359061 has been marked as a duplicate of this bug. ***
Comment 26 Christian Ruppert (idl0r) gentoo-dev 2011-11-27 17:22:25 UTC
We can even make /var/run a tmpfs without the /run stuff.
And we should do that soon.
Comment 27 SpanKY gentoo-dev 2011-11-27 18:05:27 UTC
i don't think we need to as we're in the process of transitioning to /run now
Comment 28 Robin Johnson archtester Gentoo Infrastructure gentoo-dev Security 2012-02-09 10:05:52 UTC
This will be obsoleted by /run soon.
Comment 29 William Hubbs gentoo-dev 2012-02-09 22:28:29 UTC
*** Bug 402837 has been marked as a duplicate of this bug. ***
Comment 30 Robin Johnson archtester Gentoo Infrastructure gentoo-dev Security 2012-02-10 18:12:34 UTC
Reopening. While we aren't doing the /var/run wiping, we should still maybe be validating what is in pidfiles at service start time, if only to check if they are completely out of date.

1. Is pidfile older than last boot time? Remove
2. Valid what's running at given pids?
Comment 31 Vladimir Berezhnoy 2012-02-11 13:28:57 UTC
Similar problem with mixing process from inside vz containers. In my case most contaners have /usr/sbin/crond process running and check against them succeeds while it shouldn't.


s linux # /etc/init.d/vixie-cron start
 * Starting vixie-cron ...
 * start-stop-daemon: /usr/sbin/cron is already running                                                                                                                                           [ !! ]
 * ERROR: vixie-cron failed to start
[1] s linux # /etc/init.d/vz stop
 * Shutting down CT 126 ...
 * Shutting down CT 125 ...
 * Shutting down CT 124 ...
 * Shutting down CT 123 ...
 * Shutting down CT 120 ...
 * Shutting down CT 115 ...
 * Shutting down CT 114 ...
 * Shutting down CT 112 ...
 * Shutting down CT 111 ...
 * Bringing down interface venet0 ...                                                                                                                                                             [ ok ]
s linux # /etc/init.d/vixie-cron start
 * Starting vixie-cron ...                                                                                                                                                                        [ ok ]
Comment 32 Patrick McLean gentoo-dev 2012-04-25 21:21:50 UTC
(In reply to comment #30) 
> 1. Is pidfile older than last boot time? Remove
> 2. Valid what's running at given pids?

These sound like reasonable checks to me, I would be willing to write a patch for this if there is a chance of it being accepted.
Comment 33 Christian Ruppert (idl0r) gentoo-dev 2012-04-26 06:20:03 UTC
(In reply to comment #32)
> (In reply to comment #30) 
> > 1. Is pidfile older than last boot time? Remove
> > 2. Valid what's running at given pids?
> 
> These sound like reasonable checks to me, I would be willing to write a
> patch for this if there is a chance of it being accepted.

I am pretty sure we would like that :)
Comment 34 SpanKY gentoo-dev 2012-08-16 17:49:29 UTC
*** Bug 431180 has been marked as a duplicate of this bug. ***
Comment 35 Benda Xu gentoo-dev 2012-08-17 05:23:13 UTC
Tracking a background process with a pid file is intrinsically unreliable. Refer to e.g. 

http://mywiki.wooledge.org/ProcessManagement

FYI
There is an ongoing effort to let OpenRC talk to another service supervisor:

http://wpr.cz/ccx/wobsite/article/openrc-supervisor.html
http://www.awa.tohoku.ac.jp/~benda/projects/runit.html
Comment 36 Jan Hudoba 2021-09-23 09:06:26 UTC
resolved