Gentoo Websites Logo
Go to: Gentoo Home Documentation Forums Lists Bugs Planet Store Wiki Get Gentoo!
Bug 172442 - baselayout-1.12.9 and apache-2.0.59-r2: start-stop-daemon won't stop process, maybe due to prelink
Summary: baselayout-1.12.9 and apache-2.0.59-r2: start-stop-daemon won't stop process,...
Status: RESOLVED FIXED
Alias: None
Product: Gentoo Linux
Classification: Unclassified
Component: [OLD] baselayout (show other bugs)
Hardware: All Linux
: High normal (vote)
Assignee: Gentoo's Team for Core System packages
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2007-03-27 18:34 UTC by Martin von Gagern
Modified: 2007-05-16 14:04 UTC (History)
1 user (show)

See Also:
Package list:
Runtime testing required: ---


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Martin von Gagern 2007-03-27 18:34:13 UTC
Several times already I encountered a problem where I've been unable to stop apache using the init script. Now I've taken the time to investigate and will tell you what I found. I'm not sure where the bug actually lies.

[1]# /etc/init.d/apache2 stop
 * Stopping apache2 ...
No /usr/sbin/apache2 found running; none killed.  [ !! ]

[2]# grep KILL /etc/init.d/apache2
 /sbin/start-stop-daemon --stop --retry -TERM/5/-TERM/5/-KILL/5 --exec ${APACHE2} --pidfile /var/run/apache2.pid

[3]# cat /var/run/apache2.pid
6067

[4]# ps -C apache2 -o pid,command
  PID COMMAND
 6067 /usr/sbin/apache2 -D SSL ...
 6068 /usr/sbin/apache2 -D SSL ...
 6205 /usr/sbin/apache2 -D SSL ...
 6207 /usr/sbin/apache2 -D SSL ...

[5]# /sbin/start-stop-daemon --stop --retry -TERM/5/-TERM/5/-KILL/5 \
     --exec /usr/sbin/apache2 --pidfile /var/run/apache2.pid
No /usr/sbin/apache2 found running; none killed.

[6]# ls -l /proc/6067/exe
lrwxrwxrwx 1 root root 0 Mar 27 19:11 /proc/6067/exe -> /usr/sbin/apache2

[7]# ltrace /sbin/start-stop-daemon --stop --retry -TERM/5/-TERM/5/-KILL/5 \
     --exec /usr/sbin/apache2 --pidfile /var/run/apache2.pid
...
fopen("/var/run/apache2.pid", "r")                                 = 0x804d040
fscanf(0x804d040, 0x804acdc, 0xbfa3116c, 0, 0)                     = 1
sprintf("/proc/6067/exe", "/proc/%d/exe", 6067)                    = 14
__xstat(3, "/proc/6067/exe", 0xbfa310c4)                           = 0
fclose(0x804d040)                                                  = 0
printf("No %s found running; none killed"..., "/usr/sbin/apache2"
No /usr/sbin/apache2 found running; none killed.
       )                                                           = 49
exit(1 <unfinished ...>
+++ exited (status 1) +++

[8]# less baselayout-1.12.9/src/start-stop-daemon.c
...
pid_is_exec(pid_t pid, const struct stat *esb)
{
        struct stat sb;
        char buf[32];

        sprintf(buf, "/proc/%d/exe", pid);
        if (stat(buf, &sb) != 0)
                return 0;
        return (sb.st_dev == esb->st_dev && sb.st_ino == esb->st_ino);
}
...

[9]# stat -L /usr/sbin/apache2 /proc/6067/exe
  File: `/usr/sbin/apache2'
  Size: 336640     Blocks: 672        IO Block: 4096   regular file
Device: 801h/2049d Inode: 1053039     Links: 1
Access: (0755/-rwxr-xr-x)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2007-03-27 20:00:33.000000000 +0200
Modify: 2006-11-06 16:01:10.000000000 +0100
Change: 2007-03-26 03:21:14.000000000 +0200
  File: `/proc/6067/exe'
  Size: 336640     Blocks: 672        IO Block: 4096   regular file
Device: 801h/2049d Inode: 1048946     Links: 0
Access: (0755/-rwxr-xr-x)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2007-03-27 19:34:33.000000000 +0200
Modify: 2006-11-06 16:01:10.000000000 +0100
Change: 2007-03-26 03:21:14.000000000 +0200

So what do these steps tell us?
[1] The described symptom, I'm unable to stop apache the usual way
[2] This is the command apache actually executes, so we can reproduce that
[3] This is the pid of the first apache process, used by start-stop-daemon
[4] This list shows that apache is indeed still running
[5] Reproducing the same command line as the init script gives the same result
[6] On the file system level the binary seems to match
[7] Almost the last thing start-stop-daemon does is stat /proc/PID/exe
[8] start-stop-daemon compares device and inode numbers not path names
[9] Mysteriousely the two files have different inode numbers

Now I'm really puzzeled. How can it be that a file and a symlink to that same file have different inode numbers? Can it be that the corresponding binary has been modified in the meantime and by some magic the proc file system still points at the old binary from which the process was started?

If that were the case, then maybe prelinking has something to do with that; after all prelinking regularly randomizes the layout of libraries in memory, so it might be possible that this file (and many many others) change inode number every week. I don't know enough to be certain, but that seems the most likely reason to me, as the file modification time is way in the past, and that agrees with the portage logs that mention no update of apache since last year. My system has been up now for about 5 days, and the last full prelink run was yesterday.

If my assumptions are correct so far, what should be done about it?
1. Don't check binary, only PID
2. Modify start-stop-daemon to use readlink and compare paths
3. Some huge effort involving inode number logging in prelink
I'm not sure which aproach is the best in terms of security and feasibility.

I now remember that my squid used to take ages during shutdown if it had been up for some time, although it terminated almost immediately if it were running only a short time. The reason might be the same, that it had to reach some timeout if there was a full prelink run between starting and stopping of the daemon.
Comment 1 Martin von Gagern 2007-03-27 19:00:32 UTC
I just replaced "--exec ${APACHE2}" with "--name ${APACHE2##*/}" in stop and reload, and with this stopping apache works, and after starting it again the inodes do match.

"grep -lEe '--stop.*--exec' /etc/init.d/*" lists several other scripts that would probably need similar adjustments, if that should become the official solution.
Comment 2 SpanKY gentoo-dev 2007-03-28 02:32:48 UTC
umm, symlinks are contained in inodes all by themselves, so it's not unusual at all for a file and a symlink to that file to have different inodes

the point of the stat() is to dereference the symlink ... that's why it is stat() and not lstat()

giving --exec the full path should work just fine ...
Comment 3 Roy Marples (RETIRED) gentoo-dev 2007-03-28 07:52:37 UTC
Does the pid in the pidfile match the pid of the processes?
Comment 4 Martin von Gagern 2007-03-28 08:57:10 UTC
(In reply to comment #2)
> umm, symlinks are contained in inodes all by themselves, so it's not unusual
> at all for a file and a symlink to that file to have different inodes

Yes, but it's the file and the "symlink" target that have different inodes!

> the point of the stat() is to dereference the symlink ... that's why it is
> stat() and not lstat()

That's why I provided -L to stat. I even checked the coreutils sources, and this switch causes the stat tool to use the stat() call instead of lstat() - just the way start-stop-daemon does.

> giving --exec the full path should work just fine ...

Should, but does not.

Steps to reproduce, provided your sleep binary is prelinked:
1. cp /bin/sleep .
2. /sbin/start-stop-daemon -Svbx $PWD/sleep 5m -mp $PWD/sleep.pid
3. /usr/sbin/prelink -u ./sleep
4. ls -l /proc/$(<sleep.pid)/exe
5. stat -L $PWD/sleep /proc/$(<sleep.pid)/exe
6. /sbin/start-stop-daemon -Kvx $PWD/sleep 5m -p $PWD/sleep.pid
7. /sbin/start-stop-daemon -Kvn sleep 5m -p $PWD/sleep.pid

(In reply to comment #3)
> Does the pid in the pidfile match the pid of the processes?

Yes, that's pid 6067 in steps 3 and 4 of my original post.
Comment 5 Roy Marples (RETIRED) gentoo-dev 2007-03-28 09:30:17 UTC
I cannot reproduce this with a prelinked sleep or a normal sleep with start-stop-daemon from baselayout-1.12.9, 1.13.0_alpha12 and the completely re-written baselayout-2 version.

FWIW, baselayout-2 s-s-d no longer checks inodes, but for other reasons than this bug. If you want to test it to see if it fixes your issue, you're more than welcome.
http://dev.gentoo.org/~uberlord/baselayout-1.13.99.tar.bz2

Extract tarball, cd into it's src dir and just type make. Do NOT install anything, just try the build start-stop-daemon in there.
Comment 6 Martin von Gagern 2007-03-28 09:55:11 UTC
(In reply to comment #5)
> I cannot reproduce this with a prelinked sleep or a normal sleep with
> start-stop-daemon from baselayout-1.12.9, 1.13.0_alpha12 and the completely
> re-written baselayout-2 version.

Strange! Can you at least reproduce the different inode numbers?

The problem can also be reproduced by replacing the binary using mv, a reproduction that does not rely on the binary being prelinked:

1. cp /bin/sleep .
2. /sbin/start-stop-daemon -Svmp $PWD/sleep.pid -bx $PWD/sleep 5m
3. stat -L $PWD/sleep /proc/$(<sleep.pid)/exe
4. cp /bin/sleep sleep.new
5. mv sleep.new sleep
6. ls -l /proc/$(<sleep.pid)/exe
7. stat -L $PWD/sleep /proc/$(<sleep.pid)/exe
8. /sbin/start-stop-daemon -Kvp $PWD/sleep.pid -x $PWD/sleep
9. /sbin/start-stop-daemon -Kvp $PWD/sleep.pid -n sleep

If you were to delete the file explicitely before creating a new of the same name, the symlink would be named to ".../sleep (deleted)" but otherwise behave the same, i.e. using stat it would resolve to the same (deleted) file.
Therefore I take it that proc symlinks behave a bit different from most other symlinks in that there is further information than only the target name.

> FWIW, baselayout-2 s-s-d no longer checks inodes, but for other reasons than
> this bug. If you want to test it to see if it fixes your issue, you're more
> than welcome.
> http://dev.gentoo.org/~uberlord/baselayout-1.13.99.tar.bz2
> 
> Extract tarball, cd into it's src dir and just type make. Do NOT install
> anything, just try the build start-stop-daemon in there.

OK, after also setting LD_LIBRARY_PATH to that src dir, things worked.
They even worked for all three scenarios: prelink, mv and rm. Nice work!
Just out of curiosity, what's the reason the inode checks were dropped?
Comment 7 Roy Marples (RETIRED) gentoo-dev 2007-03-28 10:30:18 UTC
(In reply to comment #6)
> OK, after also setting LD_LIBRARY_PATH to that src dir, things worked.
> They even worked for all three scenarios: prelink, mv and rm. Nice work!
> Just out of curiosity, what's the reason the inode checks were dropped?

inodes would be different when upgrading the binary - so for us it's a useless and harmful check :)
Instead we do our best to verify that the path is how the daemon was started.
Comment 8 Martin von Gagern 2007-03-28 13:07:37 UTC
(In reply to comment #7)
> inodes would be different when upgrading the binary - so for us it's a useless
> and harmful check :)

Not very different from the issue at hand here.
When will a baselayout including this fix be released?
Is there a bug report to this change, on which this one here could depend?

> Instead we do our best to verify that the path is how the daemon was started.

If you want to do your very best, you might even consider the exotic case where the binary was deleted instead of overwritten (so that readlink yields a trailing string " (deleted)") and the process chose to change its title by modifying its argv data (resulting in a modified cmdline). In this case you might want to first strip the trailing " (deleted)".
Maybe not worth the effort, though. Just came to my mind.
Comment 9 Roy Marples (RETIRED) gentoo-dev 2007-03-28 13:37:56 UTC
(In reply to comment #8)
> Not very different from the issue at hand here.
> When will a baselayout including this fix be released?
> Is there a bug report to this change, on which this one here could depend?

Hopefully soon. I'd love to give a date, but the silly thing keeps moving :/
And no, there's not a bug report on this change that I know of.

> > Instead we do our best to verify that the path is how the daemon was started.
> 
> If you want to do your very best, you might even consider the exotic case where
> the binary was deleted instead of overwritten (so that readlink yields a
> trailing string " (deleted)") and the process chose to change its title by
> modifying its argv data (resulting in a modified cmdline). In this case you
> might want to first strip the trailing " (deleted)".
> Maybe not worth the effort, though. Just came to my mind.

It's a good idea. I've just put code in for this :)
Comment 10 Roy Marples (RETIRED) gentoo-dev 2007-04-14 09:48:51 UTC
well, baselayout-2 is in portage now. A few small issues, but nothing major so far.
Comment 11 Roy Marples (RETIRED) gentoo-dev 2007-05-16 14:04:10 UTC
Closed as fixed as baselayout-2 is in the tree.
However, apache2 creates it's pidfile very very late, which gives the current init script some issues.

Open a new bug and assign to apache team if it troubles you.