My Gentoo system crashes/hangs when trying to access the tapedrive. Anyway, I did the following to reproduce it: from a remote system I started 3 ssh sessions; in one I started a "tail -f /var/log/messages", in one I started "while true; do uptime; sleep 1 done" (this was to see if the system would crash immediately or first have a rising system load and hang then) and finally, in the third ssh session I ran "uptime; strace mt -f /dev/nst0 rewind; uptime". For the output of these commands see below, I have also included other command output of which I thought it would be useful. I have used the exact same configuration with Slackware Linux, running 2.2.x without problems, and 2.4.21, on the last one I did have some problems, sometimes I was unable to access the tapedrive until I rebooted the machine, but no crashes. If I look at the output of the strace, there was one thing I particulary noticed: it does execute the second "uptime" command, so it finishes the mt command and crashes only *after* that, but, if you look at the uptime commands in the loop, the last one is also at 21:17:30, so it does crash quickly after that. I am pretty convinced that this is not hardware related, but could do a swap with a similar drive if that is needed. I am always prepared to do some additional tests. Finally, it doesn't crash every time - I even think it is tape-related, it crashes on some tapes but does not crash on every one... but this may be a random pattern... (I prefer not to crash my server too often ;-) ) uname -a Linux linux 2.4.20-gentoo-r7 #2 SMP Mon Oct 13 17:52:27 CEST 2003 i586 Pentium 75 - 200 GenuineIntel GNU/Linux lspci [...] 00:0f.0 SCSI storage controller: Adaptec AHA-2940/2940W / AIC-7871 cat /proc/scsi/scsi Attached devices: Host: scsi0 Channel: 00 Id: 01 Lun: 00 Vendor: IBM Model: XP31070W !x Rev: 81K1 Type: Direct-Access ANSI SCSI revision: 02 Host: scsi0 Channel: 00 Id: 05 Lun: 00 Vendor: ARCHIVE Model: 4326XX 27871-XXX Rev: 0322 Type: Sequential-Access ANSI SCSI revision: 02 Host: scsi0 Channel: 00 Id: 06 Lun: 00 Vendor: IBM Model: CDRM00103 !I Rev: 1.0g Type: CD-ROM ANSI SCSI revision: 02 tail /var/log/messages [...] Nov 27 21:17:30 linux st0: Error 8 (sugg. bt 0x0, driver bt 0x0, host bt 0x0). while true; do uptime; sleep 1; done [...] 21:17:26 up 17 days, 4:32, 3 users, load average: 0.22, 0.35, 0.22 21:17:28 up 17 days, 4:32, 3 users, load average: 0.22, 0.35, 0.22 21:17:29 up 17 days, 4:32, 3 users, load average: 0.22, 0.35, 0.22 21:17:30 up 17 days, 4:32, 3 users, load average: 0.22, 0.35, 0.22 uptime; strace mt -f /dev/nst0 rewind; uptime 21:17:28 up 17 days, 4:32, 3 users, load average: 0.22, 0.35, 0.22 execve("/usr/sbin/mt", ["mt", "-f", "/dev/nst0", "rewind"], [/* 33 vars */]) = 0 uname({sys="Linux", node="linux", ...}) = 0 brk(0) = 0x804b8f4 mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x4 0013000 open("/etc/ld.so.preload", O_RDONLY) = -1 ENOENT (No such file or directory) open("/etc/ld.so.cache", O_RDONLY) = 3 fstat64(3, {st_mode=S_IFREG|0644, st_size=29035, ...}) = 0 mmap2(NULL, 29035, PROT_READ, MAP_PRIVATE, 3, 0) = 0x40014000 close(3) = 0 open("/lib/libc.so.6", O_RDONLY) = 3 read(3, "\177ELF\1\1\1\0\0\0\0\0\0\0\0\0\3\0\3\0\1\0\0\0\200Y\1"..., 1024) = 10 24 fstat64(3, {st_mode=S_IFREG|0755, st_size=1432651, ...}) = 0 mmap2(NULL, 1193700, PROT_READ|PROT_EXEC, MAP_PRIVATE, 3, 0) = 0x4001c000 mprotect(0x4013a000, 22244, PROT_NONE) = 0 mmap2(0x4013a000, 12288, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED, 3, 0x11e) = 0x4013a000 mmap2(0x4013d000, 9956, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMO US, -1, 0) = 0x4013d000 close(3) = 0 munmap(0x40014000, 29035) = 0 open("/dev/urandom", O_RDONLY) = 3 read(3, "\243O\235\211\250\27\351,\200a\227R\250\20\205\342a\212"..., 32) = 32 close(3) = 0 open("/dev/nst0", O_RDONLY) = -1 EIO (Input/output error) dup(2) = 3 fcntl64(3, F_GETFL) = 0x8002 (flags O_RDWR|O_LARGEFILE) brk(0) = 0x804b8f4 brk(0x804c8f4) = 0x804c8f4 brk(0) = 0x804c8f4 brk(0x804d000) = 0x804d000 fstat64(3, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 1), ...}) = 0 mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x4 0014000 _llseek(3, 0, 0xbffff7d0, SEEK_CUR) = -1 ESPIPE (Illegal seek) write(3, "/dev/nst0: Input/output error\n", 30/dev/nst0: Input/output error ) = 30 close(3) = 0 munmap(0x40014000, 4096) = 0 _exit(1) = ? 21:17:30 up 17 days, 4:32, 3 users, load average: 0.22, 0.35, 0.22
Can you see whether you get the same result with a 2.4.20 kernel?
I am running a 2.4.20 kernel. Do you mean an unpatched version (vanilla-sources)?
Yes.
Ok, I emerged vanilla-sources-2.4.20, copied my .config and did a make oldconfig and compiled, installed and booted the new kernel... mt -f /dev/nst0 rewind Error in /var/log/messages: Dec 3 22:35:53 linux st0: Error 8 (sugg. bt 0x0, driver bt 0x0, host bt 0x0). mt -f /dev/nst0 rewind /var/log/messages: Dec 3 22:36:10 linux st0: Block limits 1 - 16777215 bytes. mt -f /dev/nst0 fsf 1 mt -f /dev/nst0 rewind (no errors) mt -f /dev/nst0 fsf 5 /dev/nst0: Input/output error (error appears after some time) /var/log/messages: Dec 3 22:40:31 linux st0: Error with sense data: Info fld=0x5, Current st09:00: sense key Medium Error Dec 3 22:40:31 linux Additional sense indicates Sequential positioning error mt -f /dev/nst0 eof /dev/nst0: Input/output error (error appears immediately) linux root # mt -f /dev/nst0 rewind /dev/nst0: Input/output error (idem) /var/log/messages: Dec 3 22:41:08 linux st0: Error with sense data: Current st09:00: sense key Hardware Error Dec 3 22:41:08 linux Additional sense indicates Internal target failure (tried it a few extratimes, same result) Finally: strace mt -f /dev/nst0 rewind execve("/usr/sbin/mt", ["mt", "-f", "/dev/nst0", "rewind"], [/* 33 vars */]) = 0 uname({sys="Linux", node="linux", ...}) = 0 brk(0) = 0x804b8f4 mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x40013000 open("/etc/ld.so.preload", O_RDONLY) = -1 ENOENT (No such file or directory) open("/etc/ld.so.cache", O_RDONLY) = 3 fstat64(3, {st_mode=S_IFREG|0644, st_size=29035, ...}) = 0 mmap2(NULL, 29035, PROT_READ, MAP_PRIVATE, 3, 0) = 0x40014000 close(3) = 0 open("/lib/libc.so.6", O_RDONLY) = 3 read(3, "\177ELF\1\1\1\0\0\0\0\0\0\0\0\0\3\0\3\0\1\0\0\0\200Y\1"..., 1024) = 1024 fstat64(3, {st_mode=S_IFREG|0755, st_size=1432651, ...}) = 0 mmap2(NULL, 1193700, PROT_READ|PROT_EXEC, MAP_PRIVATE, 3, 0) = 0x4001c000 mprotect(0x4013a000, 22244, PROT_NONE) = 0 mmap2(0x4013a000, 12288, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED, 3, 0x11e) = 0x4013a000 mmap2(0x4013d000, 9956, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x4013d000 close(3) = 0 munmap(0x40014000, 29035) = 0 open("/dev/urandom", O_RDONLY) = 3 read(3, "2H\305t\221\33l=\316\203\245{\231\34$r~\351\1$\375X\365"..., 32) = 32 close(3) = 0 open("/dev/nst0", O_RDONLY) = 3 ioctl(3, SNDCTL_MIDI_MPUMODE, 0xbffff998) = -1 EIO (Input/output error) dup(2) = 4 fcntl64(4, F_GETFL) = 0x8002 (flags O_RDWR|O_LARGEFILE) brk(0) = 0x804b8f4 brk(0x804c8f4) = 0x804c8f4 brk(0) = 0x804c8f4 brk(0x804d000) = 0x804d000 fstat64(4, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 0), ...}) = 0 mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x40 mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x40014000 _llseek(4, 0, 0xbffff7a0, SEEK_CUR) = -1 ESPIPE (Illegal seek) write(4, "/dev/nst0: Input/output error\n", 30/dev/nst0: Input/output error ) = 30 close(4) = 0 munmap(0x40014000, 4096) = 0 ioctl(3, SNDCTL_MIDI_MPUCMD, 0xbffff980) = 0 close(3) = 0 _exit(2) = ? but... no complete system crashes... :-) This resembles the problems I had on Slackware, before I moved to Gentoo. Hope this helps. Let me know if you need any more. Thanks so far.
Created attachment 21662 [details] Diff file between the two straces For your convenience
... Can you change the CONFIG_PREEMPT [ i.e. on => off; off => on ] flag on your Gentoo kernel to check whether this is a scheduler bug?
I did a "genkernel --config" and changed CONFIG_PREEMPT from off to on, after the kernel had compiled I rebooted and faced a lot of errors (Segfaults), esp. modules were unable to load. I even tried to recompile the kernel (after booting to a vanilla-kernel, since the kernel-compilations segfaulted all over me to). The second try also gave me an unusable kernel. However, I was able to run an "mt -f /dev/nst0 rewind" and that also crashed the server. So it seems the PREEMPT setting is not the culprit.
Just FYI: I am now running linux 2.4.20-gentoo-r9 and the problem is still there...
Good news! It seems the problem has been resolved. I have installed linux-2.4.22-gentoo-r5 and I am not able to reproduce the problem anymore. I have also found the following: http://www.kolumbus.fi/kai.makisara/st-eot.html which seems like the same problem. I checked the diff between linux-2.4.20-gentoo-r9 and linux-2.4.22-gentoo-r5 and the changes mentionned on the page above have indeed been made on st.c I will check if my backup runs fine now and change the bug to resolved for now. I will report in a few days whether the problem has really gone away.
Thanks for the patch reference - I'll add this to -r11...
Patch is in gentoo-sources-2.4.20-r11. Thanks!
I am sorry... but I keep on running into this problem; it seems less frequent than it used to be, but it is still there... I have had the problem with 2.4.25-gentoo and 2.4.22-gentoo-r5.
Hm; can you confirm this on a vanilla-2.4.26 kernel for us?
I'd be happy to check this for you, however, compiling a kernel takes quite some time on my system (busy box, running on 2xP-90), so before I start I'd like to confirm that it is really necessary. I am running 2.4.25-gentoo now, and still having these crashes. And... linux root # diff /usr/src/linux-2.4.25-gentoo/drivers/scsi/st.c /usr/src/linux-2.4.26/drivers/scsi/st.c linux root # IOW, there are no changes to the driver between the version I am running now and vanilla-2.4.26. Again, I have no problems with checking this, but please confirm it is going to help in any way.
Hmmm - In which case can I suggest that you try filing an upstream bug [bugme.osdl.org] since if this is a persistent behaviour with all the various 2.4 kernels you've tried it looks like the bug is lurking upstream there... Sidenote: This does look a bit like flakey hardware - one of your comments says "Sequential positioning error" indicating that it's probably something with the hardware itself, the tape, or possibly something as silly as excess humidity in the tapes. Or even a totally chewed tape. Cleaning it more frequently might help with that. Although; if you're still getting total system hangs then this definitely has to be with either the kernel or with the SCSI controller hanging the whole system bus. Does the drive itself still operate as it usually does; namely do the green // amber LEDs work as they normally do when you get the hang and try to change the tape, for example?
I have decided to compile a 2.4.26 anyway, just to make sure. I have turned on debugging in st.c (as suggested on http://www.kolumbus.fi/kai.makisara/st-eot.html). I am pretty sure it is not a hardware problem, but I do have a spare drive available and can swap it - but it's probably going to be somewhere late next week before I can do that. About filing an upstream bug... feel free to assign the bug to me and let me handle the communication with Kai. If needed, I am even prepared to give him access to my machine, since this bug is driving me crazy ;-) If and when the problem is solved, I would need, however, someone to integrate a patch into the Gentoo-kernel tree. The new kernel is compiling right now. I will do some tests as soon as it is done - and have an opportunity to reboot the system.
I have just installed the 2.4.26 vanilla kernel. I was easily able to reproduce the problem. As said, I have turned DEBUG on and as long as things kept running I indeed saw some more debugging info in my /var/log/messages - but when things went wrong, nothing appeared. I was doing a tail -f /var/log/messages in a separate ssh session. Also monitored the console for messages, but nothing there either. I did, however, find out something interesting: the system keeps on responding to pings, even after it has frozen. I did some tests with ethereal on a different system, and in fact, it was even answering incoming TCP SYN requests with a SYN ACK on for example port 25, where qmail is running - but qmail was no longer answering. Any ideas on what to try next? Oh, about the lights: I've never really figured out what the lights mean, but I think they operate as normal, even after the system hangs. I can also normally retrieve the tape from the drive while the system hangs. I do not think it is caused by a "totally chewed tape" since the problem keeps on occuring on the same tape, but sometimes goes away after some time, after that I can use the tape normally. Also, the tape either hangs the entire machine, or works fine, nothing in between like read errors, write errors, etc. But as said, a tape can change from a "bad tape" to a "good tape" and v.v. I frequently clean my drive.
If the machine is still replying to TCP packets, then this looks like this is a kernel bug or deadlock of some sort. Since we don't maintain the SCSI / Tape code could you please file a bug for this at http://bugme.osdl.org so the upstream kernel developers can have a stab at this? Thanks!
I have sent a message to the developper of this driver, Kai Makisara, since he specifially asks for that on http://bugs.gentoo.org/show_bug.cgi?id=34545; depending on his answer I may log a bug at OSDL too. You can assign this bug to me, if that is convenient for you.
Oops, wrong URL pasted: ...since he specifically asks for that on http://www.kolumbus.fi/kai.makisara/st-eot.html ...
I assume this is in upstreams hands now.
I have contacted Kai Makisara, and received an answer from him within days. He advised me too look at the drivers for my SCSI card, since he didn't think it is related to his drivers. I have downgraded my SCSI drivers from aic7xxx to aic7xxx_old and I haven't seen the problem since I did that. So, this bug can be closed and will log a new bug for that problem as soon as I have time to gather some information...