OK, this is one of the weirdest bugs I've ever seen, so I'll give you a rather lengthy description as I'm not too sure about the connections myself. Actually I had problems to log into my system twice, both times I'm sure it was pam because multiple services were affected. I'm not sure if those two incidents are connected, but two unrelated PAM problems seem too much of a coincidence, so I file one single bug report. The first thing was, I came to my machine and was locked out: could log into no console, could not authenticate via ssh, could not even cancel my kde screen lock. All login atempts simply got stuck. Only Idea I had was to press reset. The Boot process got stuck while starting Postfix. So reset again, this time with init=/sbin/sulogin. That worked, luckily, but emerge did not work from this environment, I'll paste the trace somewhere down there. After several reboots I got my system to runlevel 1 with sulogin on the first vc. Now I think about it, I'm not entirely sure I even tried to log into one of the other, PAM based login vc's. I rsynced and remerged pam and pam-login. etc-update gave me some differences in system-auth (the path /lib/security/ got removed from modules), which is surprising because emerge labeld this a remerge, same version, and I'm pretty sure I did not throw away any new pam configs. After this login did work again. When going to runlevel 3, NFS got stuck, so I removed this service for the time being and could reboot successfully. Today, I suddenly experienced problems using su, and ssh and login as well. Luckily, I had a root shell open, so I could strace su and find out that it got stuck in the send call, writing to /dev/log. I checked, /dev/log was a socket all right. Unfortunately, I did not think to netstat this thing. After restarting metalog, su and all the other programs suddenly worked all right. The migration from sysklogd to metalog was a rather recent change to my system. Basically there are several issues in this: 1. Why does the send on /dev/log fail? 2. Why were there changes in my configs, as the package version kept te same? 3. PAM should probably respond more gracefully to bad configs, if this was indeed the problem I experienced. 4. PAM should recover from a logging problem. 5. emerge should work from a init=/sbin/sulogin shell, as should screen. Reproducible: Didn't try Steps to Reproduce: Actual Results: emerge from init=/sbin/sulogin shell: Traceback (most recent call last): File "/usr/bin/emerge", line 3200, in ? mydepgraph.merge(mydepgraph.altlist()) File "/usr/bin/emerge", line 1912, in merge retval=portage.doebuild(y,"merge",myroot,self.pkgsettings,edebug) File "/usr/lib/portage/pym/portage.py", line 2721, in doebuild retval=spawnebuild("install",actionmap,mysettings,debug,alwaysdep=1,logfile=logfile) File "/usr/lib/portage/pym/portage.py", line 2304, in spawnebuild retval=spawnebuild(actionmap[mydo]["dep"],actionmap,mysettings,debug,alwaysdep=alwaysdep,logfile=logfile) File "/usr/lib/portage/pym/portage.py", line 2304, in spawnebuild retval=spawnebuild(actionmap[mydo]["dep"],actionmap,mysettings,debug,alwaysdep=alwaysdep,logfile=logfile) File "/usr/lib/portage/pym/portage.py", line 2304, in spawnebuild retval=spawnebuild(actionmap[mydo]["dep"],actionmap,mysettings,debug,alwaysdep=alwaysdep,logfile=logfile) File "/usr/lib/portage/pym/portage.py", line 2304, in spawnebuild retval=spawnebuild(actionmap[mydo]["dep"],actionmap,mysettings,debug,alwaysdep=alwaysdep,logfile=logfile) File "/usr/lib/portage/pym/portage.py", line 2320, in spawnebuild droppriv=actionmap[mydo]["args"][1],logfile=logfile) File "/usr/lib/portage/pym/portage.py", line 1603, in spawn return portage_exec.spawn_bash(mystring,env=env,**keywords) File "/usr/lib/portage/pym/portage_exec.py", line 48, in spawn_bash return spawn(args,env=env,opt_name=opt_name,**keywords) File "/usr/lib/portage/pym/portage_exec.py", line 83, in spawn mypid.extend(spawn(('tee','-i','-a',logfile),returnpid=True,fd_pipes={0:pr,1:1,2:2})) TypeError: iteration over non-sequence Expected Results: emerge should work in minimalistic environments. [I--] [ ] app-admin/metalog-0.8_rc1-r1 (0) [I--] [ ] sys-apps/pam-login-3.17 (0) [I--] [ ] sys-fs/udev-058 (0) [I--] [ ] sys-libs/pam-0.78-r2 (0) Portage 2.0.51.22-r1 (default-linux/x86/2005.0, gcc-3.4.4, glibc-2.3.5-r0, 2.6.11.10 i686) ================================================================= System uname: 2.6.11.10 i686 Intel(R) Pentium(R) 4 CPU 3.00GHz Gentoo Base System version 1.6.12 distcc 2.18.3 i686-pc-linux-gnu (protocols 1 and 2) (default port 3632) [disabled] dev-lang/python: 2.3.5, 2.4.1 sys-apps/sandbox: 1.2.9 sys-devel/autoconf: 2.13, 2.59-r6 sys-devel/automake: 1.4_p6, 1.5, 1.6.3, 1.7.9-r1, 1.8.5-r3, 1.9.5 sys-devel/binutils: 2.16-r1 sys-devel/libtool: 1.5.18 virtual/os-headers: 2.6.11-r1 ACCEPT_KEYWORDS="x86 ~x86" AUTOCLEAN="yes" CBUILD="i686-pc-linux-gnu" CFLAGS="-march=prescott -O2" CHOST="i686-pc-linux-gnu" CONFIG_PROTECT="/etc /usr/kde/2/share/config /usr/kde/3.3/env /usr/kde/3.3/share/config /usr/kde/3.3/shutdown /usr/kde/3.4/env /usr/kde/3.4/share/config /usr/kde/3.4/shutdown /usr/kde/3/share/config /usr/lib/X11/xkb /usr/lib/mozilla/defaults/pref /usr/share/config /var/qmail/control" CONFIG_PROTECT_MASK="/etc/gconf /etc/terminfo /etc/texmf/web2c /etc/env.d" CXXFLAGS="-march=prescott -O2" DISTDIR="/usr/portage/distfiles" FEATURES="autoconfig distlocks sandbox sfperms strict" GENTOO_MIRRORS="http://gentoo.inode.at/ rsync://ftp.snt.utwente.nl/gentoo ftp://sunsite.informatik.rwth-aachen.de/pub/Linux/gentoo ftp://ftp.snt.utwente.nl/pub/os/linux/gentoo ftp://ftp.tu-clausthal.de/pub/linux/gentoo/ http://gentoo.oregonstate.edu http://www.ibiblio.org/pub/Linux/distributions/gentoo" LINGUAS="en de" MAKEOPTS="-j3" PKGDIR="/usr/portage/packages" PORTAGE_TMPDIR="/var/tmp" PORTDIR="/usr/portage" PORTDIR_OVERLAY="/usr/local/portage" SYNC="rsync://rsync.de.gentoo.org/gentoo-portage" USE="x86 X aac aalib acpi alsa apache2 apm arts auctex avi bash-completion bcmath berkdb bigger-fonts bitmap-fonts bzip2 bzlib c++ cairo cdparanoia cdr chroot crypt css cups curl dba dhcp divx4linux dnd doc dts dv dvd emboss encode escreen esd ethereal exif faad fam fastcgi fbcon ffmpeg fftw flac flatfile foomaticdb fortran ftp gcc-libffi gcj gd gdbm gif gimp gimpprint gnutls gpm graphviz gs gtk gtk2 guile imagemagick imlib ipv6 java jpeg junit kde latex ldap libg++ libwww lirc lzo lzw mad maildir mailwrapper mikmod mime mjpeg mmx motif mozilla moznocompose mozxmlterm mp3 mpeg mpeg2 mplayer mpm-worker mule mysql ncurses net network nls no-old-linux nptl odbc ogg oggvorbis opengl operanom2 oss pam pdf pdflib perl php pic pie plotutils png postgres povray procmail python qt quicktime rdesktop readline real recode samba sasl savedconfig sdl slang smime sndfile sockets sox spell sse sse2 ssl svga tcltk tcpd tetex threads tiff tokenizer transcode translator truetype truetype-fonts type1 type1-fonts unicode usb userlocales utf8 v4l v4l2 vorbis wmf xanim xchattext xemacs xine xml xml2 xmms xv xvid xvmc zlib linguas_en linguas_de userland_GNU kernel_linux elibc_glibc" Unset: ASFLAGS, CTARGET, LANG, LC_ALL, LDFLAGS
OK, as I got no reply to this bug report, I found out that some other services were stuck as well, namely fetchmail, cron and postfix. As I don't think they all use PAM during normal operation, I believe that they all suffered from the same problems when writing to /dev/log.
Created attachment 61205 [details] autopsy of a pair of metalog processes OK, this looks more and more like a metalog-only problem. I just had another such lockup, decided to reboot my system, and got stuck when starting postfix - again. So I guess that the inability to successfully start postfix is probably caused by metalog as well, and all this PAM related issues were more or less just Problems with metalog. I just booted into runlevel 1, logged into two consoles, and continued to runlevel 3. I got another lockup. I found out that "/etc/init.d/metalog restart" does not kill the previousely running instances, they stay alive. This gave me the opportunity to find out what happened to them. I attached some program outputs. First interesting thing is that both processes listen to /dev/log. It is my understanding that this is not what the kernel daemon should do, but I might be wrong there. I guess those other two sockets, 9778 and 9779, are just a pair of pipes connecting those two processes. The master is stuck in the syscall "futex", while the kernel daemon is stuck in the syscall "syslog".
Created attachment 61208 [details] maps and status of the stale processes I forgot to have a look at the maps and status info for attachment 61205 [details]. This tells me that the futex is somewhere in glibc code. And I found that the syslog call is infact to read messages from the kernel, so this looks like this process might still work as expected, and it dies respond to a simple kill, turning it into a zombie.
Possible to try an later kernel or maybe sysklogd or such ?
I did fall back to sysklogd. I'm not too keen to experiment with this system to see if a new kernel does change anything, since those errors were rather severe and I don't think this too lokely. Maybe when I find some more time, or if someone has a more specific idea what might have caused this problem.
Understood. I asked more to see if we can track it to metalog/kernel/glibc issue ... is it working with sysklogd btw ?
Yes, sysklogd never gave me any trouble. So by now I'm rather sure this problem is at least largely a metabug issue. Maybe I sould change the summary accordingly.
which version of metalog ?
This is metalog-0.8_rc1-r1, as stated in the original bug report.
I think I've run across this bug or something similar to it this morning -- after doing a rebuild of the system -- when I try to login, it prompts for password, I enter the correct password, and it just sits there for a minute and then claims timeout. Through SSH, it doesn't even do that, just sits there forever. The strange thing is that all of my services started fine, apache is running and responding -- it's just that nothing is able to log in. I'm not able to access the box right now to verify with the init=/sbin/sulogin, but I will do so asap (approx 1h 30m from now). Reproducable across 2 reboots (had to leave for school before I solved it..) Again this is after a complete toolchain/world rebuild with gcc-4.0.1 (which may be part of the problem) but I've got another system (this one I'm using, in fact) that has been running 100% gcc 4.0.1 for some time with no problems.. both with metalog. Again, not sure if this is related to this bug 100% but at least it sounds similar.
Alright folks, I have the answer to this bug I believe. Check out this section of /etc/metalog.conf: console logging : facility = "*" command = "/usr/sbin/consolelog.sh" This is our culprit. My guess is that upon boot, metalog tries to execute that same script for each line of dmesg and all the rest that it's missed so fast that init just refuses the calls to that script from metalog because it's basically bombing the system. The contents of the script called are not exactly complex, nor is it just a simple echo -- all I know is that disabling the console logging (which it is by default) fixes the problem for me.
Sorry for the double post here but it seems that some recent version of metalog changed that /usr/sbin/consolelog.sh script from a simple echo to what it is now -- which apparently init doesn't like. _Not only this_, but since this script may contain customized commands, shouldn't it be considered a protected config file? This is one of the reasons why no one caught this right away--because no one knew it had changed.
so you're saying the new dynamic one causes failure but the older one which was a simple `echo` worked ? why would the new one fail ? by the time metalog is launched, /dev should be populated and at the very least /dev/console should exist ... if you add 'set -x' to the beginning of the script, do you see useful output ?
metalog 8.0_rc1-r2 should address this
Mmm, nope. Still eventually denies login. For example: perl -e "while (true) { \`logger test\` }" Let it run for about a second, and you will no longer be able to log into your box, if you have console logging turned on in metalog.conf.
*** Bug 130497 has been marked as a duplicate of this bug. ***
I just ran into this nasty bug as well. I think it is related to these other bugs as well: http://bugs.gentoo.org/show_bug.cgi?id=31277 http://bugs.gentoo.org/show_bug.cgi?id=177424 Namely the last one fixed it for me. The path was wrong, typing in the correct one fixes my problems. It must be related to metalog, and I simply fail to see how a wrong path to a script should cause the system to be non-working, it is extremely critical!