Summary: | runit as process 1 don't reap the zombies | ||
---|---|---|---|
Product: | Gentoo Linux | Reporter: | Alex Efros <powerman-asdf> |
Component: | [OLD] Core system | Assignee: | Gentoo's Team for Core System packages <base-system> |
Status: | RESOLVED OBSOLETE | ||
Severity: | critical | CC: | dschridde+gentoobugs, radek |
Priority: | High | ||
Version: | unspecified | ||
Hardware: | x86 | ||
OS: | Linux | ||
URL: | http://thread.gmane.org/gmane.comp.sysutils.supervision.general/1416 | ||
Whiteboard: | |||
Package list: | Runtime testing required: | --- |
Description
Alex Efros
2007-08-26 03:52:59 UTC
Probably I should add, this issue exists for about 3 months. First time it happens on kernel 2.6.16-hardened-r11 and runit 1.5.0, then I've upgraded to 2.6.20-hardened-r2 and runit 1.7.2 (with and without custom patched from author). First time this issue happens 2007-05-26. After this I got it every ~7 days, so probably something what changed 1-2 weeks before this date is the source of the problem. Here is list of kernel, glibc and runit upgrades for that period: 2.6.16-hardened-r11 was used from 2006-09-10 2.6.20-hardened-r2 was used from 2007-06-16 2.6.20-hardened-r6 was used from 2007-08-23 Mon Dec 18 02:25:38 2006 >>> sys-libs/glibc-2.3.6-r5 Thu Jun 14 23:12:38 2007 >>> sys-libs/glibc-2.5-r3 Fri Jun 15 00:52:36 2007 >>> sys-libs/glibc-2.5-r3 Fri Jun 15 02:24:19 2007 >>> sys-libs/glibc-2.5-r3 Fri Jun 15 12:41:12 2007 >>> sys-libs/glibc-2.5-r3 Thu Jul 12 19:49:46 2007 >>> sys-libs/glibc-2.5-r4 Mon Apr 10 01:21:32 2006 >>> sys-process/runit-1.4.1 Fri Apr 21 19:18:39 2006 >>> sys-process/runit-1.5.0 Mon Jun 11 13:00:46 2007 >>> sys-process/runit-1.7.2 Fri Jun 15 04:31:12 2007 >>> sys-process/runit-1.7.2 Wed Jun 20 17:14:15 2007 >>> sys-process/runit-1.7.2 Wed Jun 20 18:41:15 2007 >>> sys-process/runit-1.7.2 Sat Jul 7 07:20:10 2007 >>> sys-process/runit-1.7.2 and here is list of all upgrades for that period (May 2007): Sun May 6 19:05:48 2007 >>> sys-apps/debianutils-2.17.5 Sun May 6 19:08:07 2007 >>> dev-libs/apr-0.9.12 Sun May 6 19:11:34 2007 >>> dev-util/pkgconfig-0.21-r1 Sun May 6 19:11:54 2007 >>> sys-libs/timezone-data-2007d Sun May 6 19:12:48 2007 >>> dev-lang/spidermonkey-1.5-r2 Sun May 6 19:13:17 2007 >>> sys-devel/patch-2.5.9-r1 Sun May 6 19:13:24 2007 >>> sys-apps/hdparm-6.9-r1 Sun May 6 19:14:28 2007 >>> net-misc/rsync-2.6.9-r2 Sun May 6 19:15:34 2007 >>> dev-libs/pth-2.0.6 Sun May 6 19:15:37 2007 >>> sys-devel/binutils-config-1.9-r4 Sun May 6 19:19:57 2007 >>> app-shells/bash-3.2_p15-r1 Sun May 6 19:20:31 2007 >>> dev-util/dialog-1.1.20070227 Sun May 6 19:20:59 2007 >>> sys-apps/man-1.6e-r3 Sun May 6 19:22:14 2007 >>> media-libs/libpng-1.2.16 Sun May 6 19:23:48 2007 >>> media-libs/freetype-2.1.10-r3 Sun May 6 19:23:58 2007 >>> app-misc/ca-certificates-20070303-r1 Sun May 6 19:26:04 2007 >>> sys-libs/readline-5.2_p2 Sun May 6 19:27:49 2007 >>> dev-libs/libgpg-error-1.5 Sun May 6 19:28:43 2007 >>> sys-devel/m4-1.4.9 Sun May 6 19:30:40 2007 >>> sys-fs/e2fsprogs-1.39-r2 Sun May 6 19:31:19 2007 >>> app-editors/nano-2.0.4 Sun May 6 19:32:19 2007 >>> net-mail/fetchmail-6.3.8 Sun May 6 19:32:55 2007 >>> sys-devel/flex-2.5.33-r2 Sun May 6 19:33:25 2007 >>> sys-apps/baselayout-1.12.9-r2 Sun May 6 19:36:02 2007 >>> sys-apps/util-linux-2.12r-r6 Sun May 6 19:37:14 2007 >>> app-editors/vim-core-7.0.235 Sun May 6 19:38:27 2007 >>> dev-libs/libksba-1.0.0 Sun May 6 19:40:22 2007 >>> dev-libs/libxslt-1.1.20 Sun May 6 19:41:04 2007 >>> sys-apps/module-init-tools-3.2.2-r3 Sun May 6 19:47:27 2007 >>> app-editors/vim-7.0.235 Sun May 6 19:53:14 2007 >>> sys-kernel/hardened-sources-2.6.20-r2 Sun May 6 19:56:14 2007 >>> net-misc/curl-7.15.1-r1 Sun May 6 20:36:14 2007 >>> dev-db/mysql-5.0.38 Sun May 6 20:53:01 2007 >>> media-gfx/imagemagick-6.3.3 Sun May 6 20:53:19 2007 >>> sys-devel/gcc-config-1.3.16 Sun May 6 21:11:35 2007 >>> sys-libs/libstdc++-v3-3.3.6 Wed May 9 15:53:39 2007 >>> media-libs/freetype-2.3.3 Wed May 9 15:59:54 2007 >>> dev-lang/python-2.4.4 Wed May 9 16:04:33 2007 >>> mail-mta/netqmail-1.05-r8 Wed May 23 13:51:19 2007 >>> sys-apps/portage-2.1.2.7 Wed May 23 13:51:45 2007 >>> sys-libs/timezone-data-2007e Wed May 23 13:51:55 2007 >>> app-forensics/chkrootkit-0.47 Wed May 23 13:52:28 2007 >>> sys-libs/zlib-1.2.3-r1 Wed May 23 13:53:50 2007 >>> media-libs/libpng-1.2.18 Wed May 23 13:56:07 2007 >>> media-libs/freetype-2.3.4-r2 Wed May 23 14:44:45 2007 >>> dev-db/mysql-5.0.40 Wed May 23 14:50:22 2007 >>> dev-lang/python-2.4.4-r4 Wed May 23 14:50:24 2007 >>> app-admin/python-updater-0.2 Wed May 23 14:51:36 2007 >>> sys-apps/util-linux-2.12r-r7 Wed May 23 14:52:04 2007 >>> sys-apps/gradm-2.1.10.200702231759 Wed May 23 14:59:26 2007 >>> app-crypt/gnupg-1.4.7-r1 you state in here [1] that the C test script indeed causes you trouble. I run the same program on my hardened x86 server, which has a 13 days uptime, and i got this: Nebuchadnezzar ~ # date; ps a | grep Z | wc Mon Aug 27 19:01:48 WEST 2007 1 7 48 Nebuchadnezzar ~ # ./a.out | grep f | wc -l 5804 Nebuchadnezzar ~ # date; ps a | grep Z | wc Mon Aug 27 19:02:36 WEST 2007 1 7 48 So, this isn't reproducible on a stable x86 hardened system (with sysvinit) which makes me believe this is runit's fault. Please confirm that that script indeed causes zombies on your systems. If that is the case, this bug should be moves to the runit guys :/ [1] http://article.gmane.org/gmane.comp.sysutils.supervision.general/1447 (btw, why are you still using gcc-3.4??) > Please confirm that that script indeed causes zombies on your systems. As I wrote to runit maillist, this script causes zombies only __after__ this bug already happens (after several days uptime). Until I got few dozens of unreaped zombies by unknown reason, this script doesn't increase amount of non-reaped zombies. > (btw, why are you still using gcc-3.4??) because newer gcc is hardmasked for hardened yet: # grep gcc /usr/portage/profiles/hardened/package.mask # The following packages need =gcc-4* # Mask off gcc-4 for all hardened arches until SSP is sorted out (i.e. # backport for gcc-4.0 and 4.0/4.1 rigged for SSP support in the C # but be prepared to rebuild anything you build with gcc-4, later. =sys-devel/gcc-4* P.S. I've one more server, which also use same runit, glibc and kernel versions, and it doesn't have this issue. I wanna say this issue isn't something what happens on all servers which use runit. I've spend 3 months trying to catch this bug in runit, because it's much ease than in kernel. But now it looks like runit is ok, and this issue is outside it. (In reply to comment #3) > > (btw, why are you still using gcc-3.4??) > > because newer gcc is hardmasked for hardened yet: yeah, you're right, sorry... > > P.S. I've one more server, which also use same runit, glibc and kernel > versions, and it doesn't have this issue. I wanna say this issue isn't > something what happens on all servers which use runit. I've spend 3 months > trying to catch this bug in runit, because it's much ease than in kernel. But > now it looks like runit is ok, and this issue is outside it. well, from what i can tell, and what you can tell, it's not kernel's fault either since you have 2 server with the same kernel/glibc/gcc/runit and one works and the other don't. the question is: "what's different between them?" that's where the problem is... > the question is: "what's different between them?" that's where the problem is...
It's not that ease. :( I've spend a lot of time trying to figure it. That server isn't mine, and it admin several months ago made one mistake: he emerged gcc-4.1.1, which isn't support hardened yet. And this is only significant difference I've found. Probably all packages on that server compiled with that, non-hardened gcc-4.1.1 (including kernel, glibc and runit).
I dislike idea to install same non-hardened gcc on my other servers and recompile overall system with it, test zombie issue, and then disgrade gcc and recompile everything again. This is production servers, and I'm not sure they survive after this experiment.
Maybe it has sense try to switch to non-hardened gcc using current gcc-3.4.6, and recompile world, not sure. This at least shouldn't broke the system, unlike risky upgrading/disgrading gcc procedure. What you think - is it has sense?
well, i'll do it the other way around. reselect a hardened gcc on the box where the gcc was upgraded and "emerge -e world" recompile everything, including kernel just to make sure.... anyway, i really think this is not a kernel problem since i have a home server with months of uptime and i don't see any problem with zombies, and as you said, you also have servers that work good. If you can, downgrade gcc and recompile everything on that box. i'll wait for you results. More information ontopic available now. 1) There another guy who have this issue. He also use Gentoo, and he got this issue 1 day before me (25 May). So it looks like this bug is related to previous `emerge -uDNa world` we did - nothing else was changes on our servers. 2) That guy doesn't use hardened at all, and he use newer gcc (but he doesn't recompile overall system after upgrading gcc). So this issue isn't related to hardened, huh! 3) I've catched clean strace output from ssh, which created unreaped zombie. Do do this I've tried to connect as user mysql to my server. # date ; ps -ef axf | tail -n 1 Sat Sep 15 13:51:38 GMT 2007 sshd 14804 1 0 13:50 ? Z 0:00 [sshd] <defunct> # date ; ps -ef axf | tail -n 1 Sat Sep 15 13:51:53 GMT 2007 sshd 14804 1 0 13:50 ? Z 0:00 [sshd] <defunct> # tail -n 1 /var/log/syslog/all/current auth.info: Sep 15 13:50:43 sshd[14803]: User mysql not allowed because account is locked Strace output with all details about PIDs 939 (ssh server), 14803 and 14804 (unreaped zombie) is here: http://powerman.asdfgroup.com/tmp/ssh_strace.txt You can also looks at `ps -ef axf` output for this server (just before creating last zombie): http://powerman.asdfgroup.com/tmp/ps.txt and syslog output for Sep 15 (kernel log is empty for Sep 15): http://powerman.asdfgroup.com/tmp/syslog.txt Better to attach information to this bug rather than using URLs. By attaching, we can guarantee that the information will be available for future reference. Alex, Have you done what Carlos asked in comment #6? Ie, on the server that doesn't have this problem, install the same version of gcc and the kernel as you have on the 'problem servers', compile the kernel with that gcc, then see if that server also exhibits this behavior. If not, I would have to agree with Carlos that this is not a kernel issue. No, I didn't recompile system without hardened - it's production boxes, after all. And, as I noted in comment #7, somebody else has same issue without hardened. I did another things. First - I've replaced runit-init with sysvinit on half of my servers using this /etc/inittab: ---cut--- id:3:initdefault: rc::bootwait:/etc/runit/1 l0:0:wait:/bin/sh -c '/etc/runit/3; exec /sbin/halt' l3:3:once:/etc/runit/2 l6:6:wait:/bin/sh -c '/etc/runit/3; exec /sbin/reboot' ca:12345:ctrlaltdel:/sbin/shutdown -r now ---cut--- Now it's too soon to make a decision, but looks like systems with sysvinit doesn't have this issue with zombies. Second - runit author today ask me to test these commands on server with runit as process 1 which already have unreaped zombies: ---cut--- # chmod 0 /etc/runit/stopit # kill -CONT 1 ---cut--- and this helps - all zombies was reaped! And new zombies which I generate with test script was reaped too. Looks like he found something, and I hope he'll release a patch for runit soon. Now it looks like it is a bug in runit, so I'll change subject to reflect this. Reassigning to base-system, as this appears to be a runit issue. It's not a kernel bug. Looks like latest versions of runit doesn't have this issue. I'm testing it since Feb 2011, since kernel 2.6.36-hardened-r9 and runit 2.0.0. So, this bug can be closed. |