Summary: | System crash on some tape access | ||
---|---|---|---|
Product: | Gentoo Linux | Reporter: | Jeroen Roos <bugzilla-gentoo> |
Component: | [OLD] Core system | Assignee: | Gentoo Kernel Bug Wranglers and Kernel Maintainers <kernel> |
Status: | VERIFIED UPSTREAM | ||
Severity: | major | ||
Priority: | High | ||
Version: | 1.4 | ||
Hardware: | x86 | ||
OS: | Linux | ||
Whiteboard: | |||
Package list: | Runtime testing required: | --- | |
Attachments: | Diff file between the two straces |
Description
Jeroen Roos
2003-11-27 13:12:20 UTC
Can you see whether you get the same result with a 2.4.20 kernel? I am running a 2.4.20 kernel. Do you mean an unpatched version (vanilla-sources)? Yes. Ok, I emerged vanilla-sources-2.4.20, copied my .config and did a make oldconfig and compiled, installed and booted the new kernel... mt -f /dev/nst0 rewind Error in /var/log/messages: Dec 3 22:35:53 linux st0: Error 8 (sugg. bt 0x0, driver bt 0x0, host bt 0x0). mt -f /dev/nst0 rewind /var/log/messages: Dec 3 22:36:10 linux st0: Block limits 1 - 16777215 bytes. mt -f /dev/nst0 fsf 1 mt -f /dev/nst0 rewind (no errors) mt -f /dev/nst0 fsf 5 /dev/nst0: Input/output error (error appears after some time) /var/log/messages: Dec 3 22:40:31 linux st0: Error with sense data: Info fld=0x5, Current st09:00: sense key Medium Error Dec 3 22:40:31 linux Additional sense indicates Sequential positioning error mt -f /dev/nst0 eof /dev/nst0: Input/output error (error appears immediately) linux root # mt -f /dev/nst0 rewind /dev/nst0: Input/output error (idem) /var/log/messages: Dec 3 22:41:08 linux st0: Error with sense data: Current st09:00: sense key Hardware Error Dec 3 22:41:08 linux Additional sense indicates Internal target failure (tried it a few extratimes, same result) Finally: strace mt -f /dev/nst0 rewind execve("/usr/sbin/mt", ["mt", "-f", "/dev/nst0", "rewind"], [/* 33 vars */]) = 0 uname({sys="Linux", node="linux", ...}) = 0 brk(0) = 0x804b8f4 mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x40013000 open("/etc/ld.so.preload", O_RDONLY) = -1 ENOENT (No such file or directory) open("/etc/ld.so.cache", O_RDONLY) = 3 fstat64(3, {st_mode=S_IFREG|0644, st_size=29035, ...}) = 0 mmap2(NULL, 29035, PROT_READ, MAP_PRIVATE, 3, 0) = 0x40014000 close(3) = 0 open("/lib/libc.so.6", O_RDONLY) = 3 read(3, "\177ELF\1\1\1\0\0\0\0\0\0\0\0\0\3\0\3\0\1\0\0\0\200Y\1"..., 1024) = 1024 fstat64(3, {st_mode=S_IFREG|0755, st_size=1432651, ...}) = 0 mmap2(NULL, 1193700, PROT_READ|PROT_EXEC, MAP_PRIVATE, 3, 0) = 0x4001c000 mprotect(0x4013a000, 22244, PROT_NONE) = 0 mmap2(0x4013a000, 12288, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED, 3, 0x11e) = 0x4013a000 mmap2(0x4013d000, 9956, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x4013d000 close(3) = 0 munmap(0x40014000, 29035) = 0 open("/dev/urandom", O_RDONLY) = 3 read(3, "2H\305t\221\33l=\316\203\245{\231\34$r~\351\1$\375X\365"..., 32) = 32 close(3) = 0 open("/dev/nst0", O_RDONLY) = 3 ioctl(3, SNDCTL_MIDI_MPUMODE, 0xbffff998) = -1 EIO (Input/output error) dup(2) = 4 fcntl64(4, F_GETFL) = 0x8002 (flags O_RDWR|O_LARGEFILE) brk(0) = 0x804b8f4 brk(0x804c8f4) = 0x804c8f4 brk(0) = 0x804c8f4 brk(0x804d000) = 0x804d000 fstat64(4, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 0), ...}) = 0 mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x40 mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x40014000 _llseek(4, 0, 0xbffff7a0, SEEK_CUR) = -1 ESPIPE (Illegal seek) write(4, "/dev/nst0: Input/output error\n", 30/dev/nst0: Input/output error ) = 30 close(4) = 0 munmap(0x40014000, 4096) = 0 ioctl(3, SNDCTL_MIDI_MPUCMD, 0xbffff980) = 0 close(3) = 0 _exit(2) = ? but... no complete system crashes... :-) This resembles the problems I had on Slackware, before I moved to Gentoo. Hope this helps. Let me know if you need any more. Thanks so far. Created attachment 21662 [details]
Diff file between the two straces
For your convenience
... Can you change the CONFIG_PREEMPT [ i.e. on => off; off => on ] flag on your Gentoo kernel to check whether this is a scheduler bug? I did a "genkernel --config" and changed CONFIG_PREEMPT from off to on, after the kernel had compiled I rebooted and faced a lot of errors (Segfaults), esp. modules were unable to load. I even tried to recompile the kernel (after booting to a vanilla-kernel, since the kernel-compilations segfaulted all over me to). The second try also gave me an unusable kernel. However, I was able to run an "mt -f /dev/nst0 rewind" and that also crashed the server. So it seems the PREEMPT setting is not the culprit. Just FYI: I am now running linux 2.4.20-gentoo-r9 and the problem is still there... Good news! It seems the problem has been resolved. I have installed linux-2.4.22-gentoo-r5 and I am not able to reproduce the problem anymore. I have also found the following: http://www.kolumbus.fi/kai.makisara/st-eot.html which seems like the same problem. I checked the diff between linux-2.4.20-gentoo-r9 and linux-2.4.22-gentoo-r5 and the changes mentionned on the page above have indeed been made on st.c I will check if my backup runs fine now and change the bug to resolved for now. I will report in a few days whether the problem has really gone away. Thanks for the patch reference - I'll add this to -r11... Patch is in gentoo-sources-2.4.20-r11. Thanks! I am sorry... but I keep on running into this problem; it seems less frequent than it used to be, but it is still there... I have had the problem with 2.4.25-gentoo and 2.4.22-gentoo-r5. Hm; can you confirm this on a vanilla-2.4.26 kernel for us? I'd be happy to check this for you, however, compiling a kernel takes quite some time on my system (busy box, running on 2xP-90), so before I start I'd like to confirm that it is really necessary. I am running 2.4.25-gentoo now, and still having these crashes. And... linux root # diff /usr/src/linux-2.4.25-gentoo/drivers/scsi/st.c /usr/src/linux-2.4.26/drivers/scsi/st.c linux root # IOW, there are no changes to the driver between the version I am running now and vanilla-2.4.26. Again, I have no problems with checking this, but please confirm it is going to help in any way. Hmmm - In which case can I suggest that you try filing an upstream bug [bugme.osdl.org] since if this is a persistent behaviour with all the various 2.4 kernels you've tried it looks like the bug is lurking upstream there... Sidenote: This does look a bit like flakey hardware - one of your comments says "Sequential positioning error" indicating that it's probably something with the hardware itself, the tape, or possibly something as silly as excess humidity in the tapes. Or even a totally chewed tape. Cleaning it more frequently might help with that. Although; if you're still getting total system hangs then this definitely has to be with either the kernel or with the SCSI controller hanging the whole system bus. Does the drive itself still operate as it usually does; namely do the green // amber LEDs work as they normally do when you get the hang and try to change the tape, for example? I have decided to compile a 2.4.26 anyway, just to make sure. I have turned on debugging in st.c (as suggested on http://www.kolumbus.fi/kai.makisara/st-eot.html). I am pretty sure it is not a hardware problem, but I do have a spare drive available and can swap it - but it's probably going to be somewhere late next week before I can do that. About filing an upstream bug... feel free to assign the bug to me and let me handle the communication with Kai. If needed, I am even prepared to give him access to my machine, since this bug is driving me crazy ;-) If and when the problem is solved, I would need, however, someone to integrate a patch into the Gentoo-kernel tree. The new kernel is compiling right now. I will do some tests as soon as it is done - and have an opportunity to reboot the system. I have just installed the 2.4.26 vanilla kernel. I was easily able to reproduce the problem. As said, I have turned DEBUG on and as long as things kept running I indeed saw some more debugging info in my /var/log/messages - but when things went wrong, nothing appeared. I was doing a tail -f /var/log/messages in a separate ssh session. Also monitored the console for messages, but nothing there either. I did, however, find out something interesting: the system keeps on responding to pings, even after it has frozen. I did some tests with ethereal on a different system, and in fact, it was even answering incoming TCP SYN requests with a SYN ACK on for example port 25, where qmail is running - but qmail was no longer answering. Any ideas on what to try next? Oh, about the lights: I've never really figured out what the lights mean, but I think they operate as normal, even after the system hangs. I can also normally retrieve the tape from the drive while the system hangs. I do not think it is caused by a "totally chewed tape" since the problem keeps on occuring on the same tape, but sometimes goes away after some time, after that I can use the tape normally. Also, the tape either hangs the entire machine, or works fine, nothing in between like read errors, write errors, etc. But as said, a tape can change from a "bad tape" to a "good tape" and v.v. I frequently clean my drive. If the machine is still replying to TCP packets, then this looks like this is a kernel bug or deadlock of some sort. Since we don't maintain the SCSI / Tape code could you please file a bug for this at http://bugme.osdl.org so the upstream kernel developers can have a stab at this? Thanks! I have sent a message to the developper of this driver, Kai Makisara, since he specifially asks for that on http://bugs.gentoo.org/show_bug.cgi?id=34545; depending on his answer I may log a bug at OSDL too. You can assign this bug to me, if that is convenient for you. Oops, wrong URL pasted: ...since he specifically asks for that on http://www.kolumbus.fi/kai.makisara/st-eot.html ... I assume this is in upstreams hands now. I have contacted Kai Makisara, and received an answer from him within days. He advised me too look at the drivers for my SCSI card, since he didn't think it is related to his drivers. I have downgraded my SCSI drivers from aic7xxx to aic7xxx_old and I haven't seen the problem since I did that. So, this bug can be closed and will log a new bug for that problem as soon as I have time to gather some information... |