Detailed description available on url. In short, I've a problem with 3 servers and home workstation: after 2-10 days of work I got ~8000 unreaped zombies and system fail to start new processes (fork return error).
All these systems use hardened kernel, all up-to-date, and all use 'runit' as process N1 (i.e. they don't use sysvinit). Author of runit provide me with several patches (available on url), including patch which force runit to do waitpid() every 5 seconds (i.e. even in case kernel don't send SIGCHLD). All these patches won't help.
So, looks like this is sort of race condition bug in kernel, or gentoo patches for the kernel, which prevent some zombies to be reaped using waitpid() in process N1, even when PPID of these zombie processes is set to 1.
All my servers work very hard 24x7, run a lot of short-living processes, and after few days they start generating unreaped zombies. Actually, most of such zombies is [sshd] process (result of huge amount of failed login attempts by ssh-worms).
Right now I forced to manually reboot all servers every few days, which is unacceptable. I'm ready to do any test, provide you with any information, compile anything in debug mode, etc. - just say what you need (please read maillist thread on url first, because there already a lot of answered questions).
P.S. Simple perl script which generate 1000's of zombies doesn't trigger this bug - i.e. all zombies are reaped very fast. But, after several days, when this bug happens, and I already have 100's of unreaped zombies, such perl script will increase number of unreaped zombies very very quickly.
Portage 126.96.36.199 (hardened/x86/2.6, gcc-3.4.6, glibc-2.5-r4, 2.6.20-hardened-r6 i686)
System uname: 2.6.20-hardened-r6 i686 Intel(R) Core(TM)2 CPU 6600 @ 2.40GHz
Gentoo Base System release 1.12.9
Timestamp of tree: Fri, 24 Aug 2007 17:30:01 +0000
dev-java/java-config: 1.3.7, 2.0.33-r1
dev-lang/python: 2.3.5-r3, 2.4.4-r4
sys-devel/autoconf: 2.13, 2.61-r1
sys-devel/automake: 1.4_p6, 1.5, 1.6.3, 1.7.9-r1, 1.8.5-r3, 1.9.6-r2, 1.10
CFLAGS="-march=pentium-m -O2 -pipe"
CONFIG_PROTECT="/etc /service /usr/kde/3.5/env /usr/kde/3.5/share/config /usr/kde/3.5/shutdown /usr/share/X11/xkb /usr/share/config /var/qmail/alias /var/qmail/control"
CONFIG_PROTECT_MASK="/etc/env.d /etc/env.d/java/ /etc/gconf /etc/revdep-rebuild /etc/terminfo"
CXXFLAGS="-march=pentium-m -O2 -pipe"
FEATURES="distlocks metadata-transfer sandbox sfperms strict userpriv usersandbox"
GENTOO_MIRRORS="http://pandemonium.tiscali.de/pub/gentoo/ http://188.8.131.52/mirror/gentoo/ http://gentoo.zie.pg.gda.pl"
PORTAGE_RSYNC_OPTS="--recursive --links --safe-links --perms --times --compress --force --whole-file --delete --delete-after --stats --timeout=180 --exclude=/distfiles --exclude=/local --exclude=/packages --filter=H_**/files/digest-*"
PORTDIR_OVERLAY="/usr/portage/local/layman/musicbrainz /usr/portage/local/layman/berkano /usr/portage/local/layman/vmware /usr/local/portage /usr/local/portage-power /usr/local/portage-rusxmms"
Unset: CTARGET, EMERGE_DEFAULT_OPTS, INSTALL_MASK, LC_ALL, LDFLAGS, PORTAGE_COMPRESS, PORTAGE_COMPRESS_FLAGS, PORTAGE_RSYNC_EXTRA_OPTS
Probably I should add, this issue exists for about 3 months. First time it happens on kernel 2.6.16-hardened-r11 and runit 1.5.0, then I've upgraded to 2.6.20-hardened-r2 and runit 1.7.2 (with and without custom patched from author).
First time this issue happens 2007-05-26. After this I got it every ~7 days, so probably something what changed 1-2 weeks before this date is the source of the problem. Here is list of kernel, glibc and runit upgrades for that period:
2.6.16-hardened-r11 was used from 2006-09-10
2.6.20-hardened-r2 was used from 2007-06-16
2.6.20-hardened-r6 was used from 2007-08-23
Mon Dec 18 02:25:38 2006 >>> sys-libs/glibc-2.3.6-r5
Thu Jun 14 23:12:38 2007 >>> sys-libs/glibc-2.5-r3
Fri Jun 15 00:52:36 2007 >>> sys-libs/glibc-2.5-r3
Fri Jun 15 02:24:19 2007 >>> sys-libs/glibc-2.5-r3
Fri Jun 15 12:41:12 2007 >>> sys-libs/glibc-2.5-r3
Thu Jul 12 19:49:46 2007 >>> sys-libs/glibc-2.5-r4
Mon Apr 10 01:21:32 2006 >>> sys-process/runit-1.4.1
Fri Apr 21 19:18:39 2006 >>> sys-process/runit-1.5.0
Mon Jun 11 13:00:46 2007 >>> sys-process/runit-1.7.2
Fri Jun 15 04:31:12 2007 >>> sys-process/runit-1.7.2
Wed Jun 20 17:14:15 2007 >>> sys-process/runit-1.7.2
Wed Jun 20 18:41:15 2007 >>> sys-process/runit-1.7.2
Sat Jul 7 07:20:10 2007 >>> sys-process/runit-1.7.2
and here is list of all upgrades for that period (May 2007):
Sun May 6 19:05:48 2007 >>> sys-apps/debianutils-2.17.5
Sun May 6 19:08:07 2007 >>> dev-libs/apr-0.9.12
Sun May 6 19:11:34 2007 >>> dev-util/pkgconfig-0.21-r1
Sun May 6 19:11:54 2007 >>> sys-libs/timezone-data-2007d
Sun May 6 19:12:48 2007 >>> dev-lang/spidermonkey-1.5-r2
Sun May 6 19:13:17 2007 >>> sys-devel/patch-2.5.9-r1
Sun May 6 19:13:24 2007 >>> sys-apps/hdparm-6.9-r1
Sun May 6 19:14:28 2007 >>> net-misc/rsync-2.6.9-r2
Sun May 6 19:15:34 2007 >>> dev-libs/pth-2.0.6
Sun May 6 19:15:37 2007 >>> sys-devel/binutils-config-1.9-r4
Sun May 6 19:19:57 2007 >>> app-shells/bash-3.2_p15-r1
Sun May 6 19:20:31 2007 >>> dev-util/dialog-1.1.20070227
Sun May 6 19:20:59 2007 >>> sys-apps/man-1.6e-r3
Sun May 6 19:22:14 2007 >>> media-libs/libpng-1.2.16
Sun May 6 19:23:48 2007 >>> media-libs/freetype-2.1.10-r3
Sun May 6 19:23:58 2007 >>> app-misc/ca-certificates-20070303-r1
Sun May 6 19:26:04 2007 >>> sys-libs/readline-5.2_p2
Sun May 6 19:27:49 2007 >>> dev-libs/libgpg-error-1.5
Sun May 6 19:28:43 2007 >>> sys-devel/m4-1.4.9
Sun May 6 19:30:40 2007 >>> sys-fs/e2fsprogs-1.39-r2
Sun May 6 19:31:19 2007 >>> app-editors/nano-2.0.4
Sun May 6 19:32:19 2007 >>> net-mail/fetchmail-6.3.8
Sun May 6 19:32:55 2007 >>> sys-devel/flex-2.5.33-r2
Sun May 6 19:33:25 2007 >>> sys-apps/baselayout-1.12.9-r2
Sun May 6 19:36:02 2007 >>> sys-apps/util-linux-2.12r-r6
Sun May 6 19:37:14 2007 >>> app-editors/vim-core-7.0.235
Sun May 6 19:38:27 2007 >>> dev-libs/libksba-1.0.0
Sun May 6 19:40:22 2007 >>> dev-libs/libxslt-1.1.20
Sun May 6 19:41:04 2007 >>> sys-apps/module-init-tools-3.2.2-r3
Sun May 6 19:47:27 2007 >>> app-editors/vim-7.0.235
Sun May 6 19:53:14 2007 >>> sys-kernel/hardened-sources-2.6.20-r2
Sun May 6 19:56:14 2007 >>> net-misc/curl-7.15.1-r1
Sun May 6 20:36:14 2007 >>> dev-db/mysql-5.0.38
Sun May 6 20:53:01 2007 >>> media-gfx/imagemagick-6.3.3
Sun May 6 20:53:19 2007 >>> sys-devel/gcc-config-1.3.16
Sun May 6 21:11:35 2007 >>> sys-libs/libstdc++-v3-3.3.6
Wed May 9 15:53:39 2007 >>> media-libs/freetype-2.3.3
Wed May 9 15:59:54 2007 >>> dev-lang/python-2.4.4
Wed May 9 16:04:33 2007 >>> mail-mta/netqmail-1.05-r8
Wed May 23 13:51:19 2007 >>> sys-apps/portage-184.108.40.206
Wed May 23 13:51:45 2007 >>> sys-libs/timezone-data-2007e
Wed May 23 13:51:55 2007 >>> app-forensics/chkrootkit-0.47
Wed May 23 13:52:28 2007 >>> sys-libs/zlib-1.2.3-r1
Wed May 23 13:53:50 2007 >>> media-libs/libpng-1.2.18
Wed May 23 13:56:07 2007 >>> media-libs/freetype-2.3.4-r2
Wed May 23 14:44:45 2007 >>> dev-db/mysql-5.0.40
Wed May 23 14:50:22 2007 >>> dev-lang/python-2.4.4-r4
Wed May 23 14:50:24 2007 >>> app-admin/python-updater-0.2
Wed May 23 14:51:36 2007 >>> sys-apps/util-linux-2.12r-r7
Wed May 23 14:52:04 2007 >>> sys-apps/gradm-220.127.116.11702231759
Wed May 23 14:59:26 2007 >>> app-crypt/gnupg-1.4.7-r1
you state in here  that the C test script indeed causes you trouble. I run the same program on my hardened x86 server, which has a 13 days uptime, and i got this:
Nebuchadnezzar ~ # date; ps a | grep Z | wc
Mon Aug 27 19:01:48 WEST 2007
1 7 48
Nebuchadnezzar ~ # ./a.out | grep f | wc -l
Nebuchadnezzar ~ # date; ps a | grep Z | wc
Mon Aug 27 19:02:36 WEST 2007
1 7 48
So, this isn't reproducible on a stable x86 hardened system (with sysvinit) which makes me believe this is runit's fault. Please confirm that that script indeed causes zombies on your systems. If that is the case, this bug should be moves to the runit guys :/
(btw, why are you still using gcc-3.4??)
> Please confirm that that script indeed causes zombies on your systems.
As I wrote to runit maillist, this script causes zombies only __after__ this bug already happens (after several days uptime). Until I got few dozens of unreaped zombies by unknown reason, this script doesn't increase amount of non-reaped zombies.
> (btw, why are you still using gcc-3.4??)
because newer gcc is hardmasked for hardened yet:
# grep gcc /usr/portage/profiles/hardened/package.mask
# The following packages need =gcc-4*
# Mask off gcc-4 for all hardened arches until SSP is sorted out (i.e.
# backport for gcc-4.0 and 4.0/4.1 rigged for SSP support in the C
# but be prepared to rebuild anything you build with gcc-4, later.
P.S. I've one more server, which also use same runit, glibc and kernel versions, and it doesn't have this issue. I wanna say this issue isn't something what happens on all servers which use runit. I've spend 3 months trying to catch this bug in runit, because it's much ease than in kernel. But now it looks like runit is ok, and this issue is outside it.
(In reply to comment #3)
> > (btw, why are you still using gcc-3.4??)
> because newer gcc is hardmasked for hardened yet:
yeah, you're right, sorry...
> P.S. I've one more server, which also use same runit, glibc and kernel
> versions, and it doesn't have this issue. I wanna say this issue isn't
> something what happens on all servers which use runit. I've spend 3 months
> trying to catch this bug in runit, because it's much ease than in kernel. But
> now it looks like runit is ok, and this issue is outside it.
well, from what i can tell, and what you can tell, it's not kernel's fault either since you have 2 server with the same kernel/glibc/gcc/runit and one works and the other don't. the question is: "what's different between them?" that's where the problem is...
> the question is: "what's different between them?" that's where the problem is...
It's not that ease. :( I've spend a lot of time trying to figure it. That server isn't mine, and it admin several months ago made one mistake: he emerged gcc-4.1.1, which isn't support hardened yet. And this is only significant difference I've found. Probably all packages on that server compiled with that, non-hardened gcc-4.1.1 (including kernel, glibc and runit).
I dislike idea to install same non-hardened gcc on my other servers and recompile overall system with it, test zombie issue, and then disgrade gcc and recompile everything again. This is production servers, and I'm not sure they survive after this experiment.
Maybe it has sense try to switch to non-hardened gcc using current gcc-3.4.6, and recompile world, not sure. This at least shouldn't broke the system, unlike risky upgrading/disgrading gcc procedure. What you think - is it has sense?
well, i'll do it the other way around. reselect a hardened gcc on the box where the gcc was upgraded and "emerge -e world" recompile everything, including kernel just to make sure....
anyway, i really think this is not a kernel problem since i have a home server with months of uptime and i don't see any problem with zombies, and as you said, you also have servers that work good.
If you can, downgrade gcc and recompile everything on that box. i'll wait for you results.
More information ontopic available now.
1) There another guy who have this issue. He also use Gentoo, and he got this issue 1 day before me (25 May). So it looks like this bug is related to previous `emerge -uDNa world` we did - nothing else was changes on our servers.
2) That guy doesn't use hardened at all, and he use newer gcc (but he doesn't recompile overall system after upgrading gcc). So this issue isn't related to hardened, huh!
3) I've catched clean strace output from ssh, which created unreaped zombie. Do do this I've tried to connect as user mysql to my server.
# date ; ps -ef axf | tail -n 1
Sat Sep 15 13:51:38 GMT 2007
sshd 14804 1 0 13:50 ? Z 0:00 [sshd] <defunct>
# date ; ps -ef axf | tail -n 1
Sat Sep 15 13:51:53 GMT 2007
sshd 14804 1 0 13:50 ? Z 0:00 [sshd] <defunct>
# tail -n 1 /var/log/syslog/all/current
auth.info: Sep 15 13:50:43 sshd: User mysql not allowed because account is locked
Strace output with all details about PIDs 939 (ssh server), 14803 and 14804
(unreaped zombie) is here:
You can also looks at `ps -ef axf` output for this server (just before creating last zombie):
and syslog output for Sep 15 (kernel log is empty for Sep 15):
Better to attach information to this bug rather than using URLs. By attaching, we can guarantee that the information will be available for future reference.
Have you done what Carlos asked in comment #6? Ie, on the server that doesn't have this problem, install the same version of gcc and the kernel as you have on the 'problem servers', compile the kernel with that gcc, then see if that server also exhibits this behavior. If not, I would have to agree with Carlos that this is not a kernel issue.
No, I didn't recompile system without hardened - it's production boxes, after all. And, as I noted in comment #7, somebody else has same issue without hardened.
I did another things. First - I've replaced runit-init with sysvinit on half of my servers using this /etc/inittab:
l0:0:wait:/bin/sh -c '/etc/runit/3; exec /sbin/halt'
l6:6:wait:/bin/sh -c '/etc/runit/3; exec /sbin/reboot'
ca:12345:ctrlaltdel:/sbin/shutdown -r now
Now it's too soon to make a decision, but looks like systems with sysvinit doesn't have this issue with zombies.
Second - runit author today ask me to test these commands on server with runit as process 1 which already have unreaped zombies:
# chmod 0 /etc/runit/stopit
# kill -CONT 1
and this helps - all zombies was reaped! And new zombies which I generate with test script was reaped too. Looks like he found something, and I hope he'll release a patch for runit soon.
Now it looks like it is a bug in runit, so I'll change subject to reflect this.
Reassigning to base-system, as this appears to be a runit issue. It's not a kernel bug.
Looks like latest versions of runit doesn't have this issue.
I'm testing it since Feb 2011, since kernel 2.6.36-hardened-r9 and runit 2.0.0.
So, this bug can be closed.