96038 – metalog: processes hang when writing to /dev/log, e.g. PAM denies login

Bug 96038 - metalog: processes hang when writing to /dev/log, e.g. PAM denies login

Summary: metalog: processes hang when writing to /dev/log, e.g. PAM denies login

Status:	RESOLVED FIXED

Alias:	None

Product:	Gentoo Linux
Classification:	Unclassified
Component:	[OLD] Core system (show other bugs)
Hardware:	x86 Linux

Importance:	High critical (vote)
Assignee:	PAM Gentoo Team (OBSOLETE)

URL:
Whiteboard:
Keywords:

Duplicates (1):	130497 (view as bug list)
Depends on:
Blocks:

Reported:	2005-06-14 00:05 UTC by Martin von Gagern
Modified:	2007-05-19 21:11 UTC (History)
CC List:	4 users (show)

See Also:
Package list:
Runtime testing required:	---

Attachments
autopsy of a pair of metalog processes (metalog.txt,2.14 KB, text/plain) 2005-06-14 07:37 UTC, Martin von Gagern	Details
maps and status of the stale processes (metalog2.txt,3.39 KB, text/plain) 2005-06-14 08:25 UTC, Martin von Gagern	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Martin von Gagern 2005-06-14 00:05:55 UTC

OK, this is one of the weirdest bugs I've ever seen, so I'll give you a rather
lengthy description as I'm not too sure about the connections myself.
Actually I had problems to log into my system twice, both times I'm sure it was
pam because multiple services were affected. I'm not sure if those two incidents
are connected, but two unrelated PAM problems seem too much of a coincidence, so
I file one single bug report.

The first thing was, I came to my machine and was locked out: could log into no
console, could not authenticate via ssh, could not even cancel my kde screen
lock. All login atempts simply got stuck. Only Idea I had was to press reset.

The Boot process got stuck while starting Postfix. So reset again, this time
with init=/sbin/sulogin. That worked, luckily, but emerge did not work from this
environment, I'll paste the trace somewhere down there.
After several reboots I got my system to runlevel 1 with sulogin on the first
vc. Now I think about it, I'm not entirely sure I even tried to log into one of
the other, PAM based login vc's.

I rsynced and remerged pam and pam-login. etc-update gave me some differences in
system-auth (the path /lib/security/ got removed from modules), which is
surprising because emerge labeld this a remerge, same version, and I'm pretty
sure I did not throw away any new pam configs. After this login did work again.
When going to runlevel 3, NFS got stuck, so I removed this service for the time
being and could reboot successfully.

Today, I suddenly experienced problems using su, and ssh and login as well.
Luckily, I had a root shell open, so I could strace su and find out that it got
stuck in the send call, writing to /dev/log. I checked, /dev/log was a socket
all right. Unfortunately, I did not think to netstat this thing.

After restarting metalog, su and all the other programs suddenly worked all
right. The migration from sysklogd to metalog was a rather recent change to my
system.

Basically there are several issues in this:
1. Why does the send on /dev/log fail?
2. Why were there changes in my configs, as the package version kept te same?
3. PAM should probably respond more gracefully to bad configs, if this was
indeed the problem I experienced.
4. PAM should recover from a logging problem.
5. emerge should work from a init=/sbin/sulogin shell, as should screen.


Reproducible: Didn't try
Steps to Reproduce:

Actual Results:  
emerge from init=/sbin/sulogin shell:

Traceback (most recent call last):
  File "/usr/bin/emerge", line 3200, in ?
    mydepgraph.merge(mydepgraph.altlist())
  File "/usr/bin/emerge", line 1912, in merge
    retval=portage.doebuild(y,"merge",myroot,self.pkgsettings,edebug)
  File "/usr/lib/portage/pym/portage.py", line 2721, in doebuild
   
retval=spawnebuild("install",actionmap,mysettings,debug,alwaysdep=1,logfile=logfile)
  File "/usr/lib/portage/pym/portage.py", line 2304, in spawnebuild
   
retval=spawnebuild(actionmap[mydo]["dep"],actionmap,mysettings,debug,alwaysdep=alwaysdep,logfile=logfile)
  File "/usr/lib/portage/pym/portage.py", line 2304, in spawnebuild
   
retval=spawnebuild(actionmap[mydo]["dep"],actionmap,mysettings,debug,alwaysdep=alwaysdep,logfile=logfile)
  File "/usr/lib/portage/pym/portage.py", line 2304, in spawnebuild
   
retval=spawnebuild(actionmap[mydo]["dep"],actionmap,mysettings,debug,alwaysdep=alwaysdep,logfile=logfile)
  File "/usr/lib/portage/pym/portage.py", line 2304, in spawnebuild
   
retval=spawnebuild(actionmap[mydo]["dep"],actionmap,mysettings,debug,alwaysdep=alwaysdep,logfile=logfile)
  File "/usr/lib/portage/pym/portage.py", line 2320, in spawnebuild
    droppriv=actionmap[mydo]["args"][1],logfile=logfile)
  File "/usr/lib/portage/pym/portage.py", line 1603, in spawn
    return portage_exec.spawn_bash(mystring,env=env,**keywords)
  File "/usr/lib/portage/pym/portage_exec.py", line 48, in spawn_bash
    return spawn(args,env=env,opt_name=opt_name,**keywords)
  File "/usr/lib/portage/pym/portage_exec.py", line 83, in spawn
   
mypid.extend(spawn(('tee','-i','-a',logfile),returnpid=True,fd_pipes={0:pr,1:1,2:2}))
TypeError: iteration over non-sequence


Expected Results:  
emerge should work in minimalistic environments.

[I--] [  ] app-admin/metalog-0.8_rc1-r1 (0)
[I--] [  ] sys-apps/pam-login-3.17 (0)
[I--] [  ] sys-fs/udev-058 (0)
[I--] [  ] sys-libs/pam-0.78-r2 (0)

Portage 2.0.51.22-r1 (default-linux/x86/2005.0, gcc-3.4.4, glibc-2.3.5-r0,
2.6.11.10 i686)
=================================================================
System uname: 2.6.11.10 i686 Intel(R) Pentium(R) 4 CPU 3.00GHz
Gentoo Base System version 1.6.12
distcc 2.18.3 i686-pc-linux-gnu (protocols 1 and 2) (default port 3632) [disabled]
dev-lang/python:     2.3.5, 2.4.1
sys-apps/sandbox:    1.2.9
sys-devel/autoconf:  2.13, 2.59-r6
sys-devel/automake:  1.4_p6, 1.5, 1.6.3, 1.7.9-r1, 1.8.5-r3, 1.9.5
sys-devel/binutils:  2.16-r1
sys-devel/libtool:   1.5.18
virtual/os-headers:  2.6.11-r1
ACCEPT_KEYWORDS="x86 ~x86"
AUTOCLEAN="yes"
CBUILD="i686-pc-linux-gnu"
CFLAGS="-march=prescott -O2"
CHOST="i686-pc-linux-gnu"
CONFIG_PROTECT="/etc /usr/kde/2/share/config /usr/kde/3.3/env
/usr/kde/3.3/share/config /usr/kde/3.3/shutdown /usr/kde/3.4/env
/usr/kde/3.4/share/config /usr/kde/3.4/shutdown /usr/kde/3/share/config
/usr/lib/X11/xkb /usr/lib/mozilla/defaults/pref /usr/share/config
/var/qmail/control"
CONFIG_PROTECT_MASK="/etc/gconf /etc/terminfo /etc/texmf/web2c /etc/env.d"
CXXFLAGS="-march=prescott -O2"
DISTDIR="/usr/portage/distfiles"
FEATURES="autoconfig distlocks sandbox sfperms strict"
GENTOO_MIRRORS="http://gentoo.inode.at/ rsync://ftp.snt.utwente.nl/gentoo
ftp://sunsite.informatik.rwth-aachen.de/pub/Linux/gentoo
ftp://ftp.snt.utwente.nl/pub/os/linux/gentoo
ftp://ftp.tu-clausthal.de/pub/linux/gentoo/ http://gentoo.oregonstate.edu
http://www.ibiblio.org/pub/Linux/distributions/gentoo"
LINGUAS="en de"
MAKEOPTS="-j3"
PKGDIR="/usr/portage/packages"
PORTAGE_TMPDIR="/var/tmp"
PORTDIR="/usr/portage"
PORTDIR_OVERLAY="/usr/local/portage"
SYNC="rsync://rsync.de.gentoo.org/gentoo-portage"
USE="x86 X aac aalib acpi alsa apache2 apm arts auctex avi bash-completion
bcmath berkdb bigger-fonts bitmap-fonts bzip2 bzlib c++ cairo cdparanoia cdr
chroot crypt css cups curl dba dhcp divx4linux dnd doc dts dv dvd emboss encode
escreen esd ethereal exif faad fam fastcgi fbcon ffmpeg fftw flac flatfile
foomaticdb fortran ftp gcc-libffi gcj gd gdbm gif gimp gimpprint gnutls gpm
graphviz gs gtk gtk2 guile imagemagick imlib ipv6 java jpeg junit kde latex ldap
libg++ libwww lirc lzo lzw mad maildir mailwrapper mikmod mime mjpeg mmx motif
mozilla moznocompose mozxmlterm mp3 mpeg mpeg2 mplayer mpm-worker mule mysql
ncurses net network nls no-old-linux nptl odbc ogg oggvorbis opengl operanom2
oss pam pdf pdflib perl php pic pie plotutils png postgres povray procmail
python qt quicktime rdesktop readline real recode samba sasl savedconfig sdl
slang smime sndfile sockets sox spell sse sse2 ssl svga tcltk tcpd tetex threads
tiff tokenizer transcode translator truetype truetype-fonts type1 type1-fonts
unicode usb userlocales utf8 v4l v4l2 vorbis wmf xanim xchattext xemacs xine xml
xml2 xmms xv xvid xvmc zlib linguas_en linguas_de userland_GNU kernel_linux
elibc_glibc"
Unset:  ASFLAGS, CTARGET, LANG, LC_ALL, LDFLAGS

Comment 1 Martin von Gagern 2005-06-14 04:56:31 UTC

OK, as I got no reply to this bug report, I found out that some other services
were stuck as well, namely fetchmail, cron and postfix. As I don't think they
all use PAM during normal operation, I believe that they all suffered from the
same problems when writing to /dev/log.

Comment 2 Martin von Gagern 2005-06-14 07:37:34 UTC

Created attachment 61205 [details]
autopsy of a pair of metalog processes

OK, this looks more and more like a metalog-only problem. I just had another
such lockup, decided to reboot my system, and got stuck when starting postfix -
again. So I guess that the inability to successfully start postfix is probably
caused by metalog as well, and all this PAM related issues were more or less
just Problems with metalog.

I just booted into runlevel 1, logged into two consoles, and continued to
runlevel 3. I got another lockup. I found out that "/etc/init.d/metalog
restart" does not kill the previousely running instances, they stay alive. This
gave me the opportunity to find out what happened to them. I attached some
program outputs.

First interesting thing is that both processes listen to /dev/log. It is my
understanding that this is not what the kernel daemon should do, but I might be
wrong there. I guess those other two sockets, 9778 and 9779, are just a pair of
pipes connecting those two processes. The master is stuck in the syscall
"futex", while the kernel daemon is stuck in the syscall "syslog".

Comment 3 Martin von Gagern 2005-06-14 08:25:58 UTC

Created attachment 61208 [details]
maps and status of the stale processes

I forgot to have a look at the maps and status info for attachment 61205 [details].

This tells me that the futex is somewhere in glibc code. And I found that the
syslog call is infact to read messages from the kernel, so this looks like this
process might still work as expected, and it dies respond to a simple kill,
turning it into a zombie.

Comment 4 Martin Schlemmer (RETIRED) gentoo-dev

2005-07-05 13:15:46 UTC

Possible to try an later kernel or maybe sysklogd or such ?

Comment 5 Martin von Gagern 2005-07-05 13:29:22 UTC

I did fall back to sysklogd. I'm not too keen to experiment with this system to
see if a new kernel does change anything, since those errors were rather severe
and I don't think this too lokely. Maybe when I find some more time, or if
someone has a more specific idea what might have caused this problem.

Comment 6 Martin Schlemmer (RETIRED) gentoo-dev

2005-07-05 13:39:08 UTC

Understood.  I asked more to see if we can track it to metalog/kernel/glibc
issue ... is it working with sysklogd btw ?

Comment 7 Martin von Gagern 2005-07-05 13:51:06 UTC

Yes, sysklogd never gave me any trouble. So by now I'm rather sure this problem
is at least largely a metabug issue. Maybe I sould change the summary accordingly.

Comment 8 SpanKY gentoo-dev

2005-07-05 20:52:42 UTC

which version of metalog ?

Comment 9 Martin von Gagern 2005-07-06 01:14:16 UTC

This is metalog-0.8_rc1-r1, as stated in the original bug report.

Comment 10 Andrew D. Keyser (aka Legoguy) 2005-09-27 09:42:36 UTC

I think I've run across this bug or something similar to it this morning --
after doing a rebuild of the system -- when I try to login, it prompts for
password, I enter the correct password, and it just sits there for a minute and
then claims timeout. Through SSH, it doesn't even do that, just sits there
forever. The strange thing is that all of my services started fine, apache is
running and responding -- it's just that nothing is able to log in. I'm not able
to access the box right now to verify with the init=/sbin/sulogin, but I will do
so asap (approx 1h 30m from now). Reproducable across 2 reboots (had to leave
for school before I solved it..)

Again this is after a complete toolchain/world rebuild with gcc-4.0.1 (which may
be part of the problem) but I've got another system (this one I'm using, in
fact) that has been running 100% gcc 4.0.1 for some time with no problems.. both
with metalog.

Again, not sure if this is related to this bug 100% but at least it sounds similar.

Comment 11 Andrew D. Keyser (aka Legoguy) 2005-09-27 11:51:29 UTC

Alright folks, I have the answer to this bug I believe.

Check out this section of /etc/metalog.conf:

console logging :
 
 facility = "*"
 command = "/usr/sbin/consolelog.sh"

This is our culprit. My guess is that upon boot, metalog tries to execute that
same script for each line of dmesg and all the rest that it's missed so fast
that init just refuses the calls to that script from metalog because it's
basically bombing the system.

The contents of the script called are not exactly complex, nor is it just a
simple echo -- all I know is that disabling the console logging (which it is by
default) fixes the problem for me.

Comment 12 Andrew D. Keyser (aka Legoguy) 2005-09-27 11:56:48 UTC

Sorry for the double post here but it seems that some recent version of metalog
changed that /usr/sbin/consolelog.sh script from a simple echo to what it is now
-- which apparently init doesn't like. _Not only this_, but since this script
may contain customized commands, shouldn't it be considered a protected config
file? This is one of the reasons why no one caught this right away--because no
one knew it had changed.

Comment 13 SpanKY gentoo-dev

2005-09-27 12:01:30 UTC

so you're saying the new dynamic one causes failure but the older one which was
a simple `echo` worked ?

why would the new one fail ?  by the time metalog is launched, /dev should be
populated and at the very least /dev/console should exist ...

if you add 'set -x' to the beginning of the script, do you see useful output ?

Comment 14 SpanKY gentoo-dev

2006-02-07 17:04:05 UTC

metalog 8.0_rc1-r2 should address this

Comment 15 Andrew D. Keyser (aka Legoguy) 2006-02-12 21:00:21 UTC

Mmm, nope. Still eventually denies login.

For example:
perl -e "while (true) { \`logger test\` }"

Let it run for about a second, and you will no longer be able to log into your box, if you have console logging turned on in metalog.conf.

Comment 16 SpanKY gentoo-dev

2006-04-19 15:23:44 UTC

*** Bug 130497 has been marked as a duplicate of this bug. ***

Comment 17 Lasse Bigum 2007-05-19 21:11:04 UTC

I just ran into this nasty bug as well.

I think it is related to these other bugs as well:
http://bugs.gentoo.org/show_bug.cgi?id=31277
http://bugs.gentoo.org/show_bug.cgi?id=177424

Namely the last one fixed it for me. The path was wrong, typing in the correct one fixes my problems.

It must be related to metalog, and I simply fail to see how a wrong path to a script should cause the system to be non-working, it is extremely critical!