34545 – System crash on some tape access

Bug 34545 - System crash on some tape access

Summary: System crash on some tape access

Status:	VERIFIED UPSTREAM

Alias:	None

Product:	Gentoo Linux
Classification:	Unclassified
Component:	[OLD] Core system (show other bugs)
Hardware:	x86 Linux

Importance:	High major (vote)
Assignee:	Gentoo Kernel Bug Wranglers and Kernel Maintainers

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2003-11-27 13:12 UTC by Jeroen Roos
Modified:	2004-09-05 23:42 UTC (History)
CC List:	0 users

See Also:
Package list:
Runtime testing required:	---

Attachments
Diff file between the two straces (diff.txt,1.45 KB, text/plain) 2003-12-03 14:42 UTC, Jeroen Roos	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Jeroen Roos 2003-11-27 13:12:20 UTC

My Gentoo system crashes/hangs when trying to access the tapedrive. 

Anyway, I did the following to reproduce it: from a remote system I started 3 ssh sessions; in one I started a "tail -f /var/log/messages", in one I started "while true; do uptime; sleep 1 done" (this was to see if the system would crash immediately or first have a rising system load and hang then) and finally, in the third ssh session I ran "uptime; strace mt -f /dev/nst0 rewind; uptime". For the output of these commands see below, I have also included other command output of which I thought it would be useful.

I have used the exact same configuration with Slackware Linux, running 2.2.x without problems, and 2.4.21, on the last one I did have some problems, sometimes I was unable to access the tapedrive until I rebooted the machine, but no crashes.

If I look at the output of the strace, there was one thing I particulary noticed: it does execute the second "uptime" command, so it finishes the mt command and crashes only *after* that, but, if you look at the uptime commands in the loop, the last one is also at 21:17:30, so it does crash quickly after that.

I am pretty convinced that this is not hardware related, but could do a swap with a similar drive if that is needed. I am always prepared to do some additional tests.

Finally, it doesn't crash every time - I even think it is tape-related, it crashes on some tapes but does not crash on every one... but this may be a random pattern... (I prefer not to crash my server too often ;-) )



uname -a
Linux linux 2.4.20-gentoo-r7 #2 SMP Mon Oct 13 17:52:27 CEST 2003 i586 Pentium 75 - 200 GenuineIntel GNU/Linux

lspci
[...]
00:0f.0 SCSI storage controller: Adaptec AHA-2940/2940W / AIC-7871

cat /proc/scsi/scsi 
Attached devices: 
Host: scsi0 Channel: 00 Id: 01 Lun: 00
  Vendor: IBM      Model: XP31070W      !x Rev: 81K1
  Type:   Direct-Access                    ANSI SCSI revision: 02
Host: scsi0 Channel: 00 Id: 05 Lun: 00
  Vendor: ARCHIVE  Model: 4326XX 27871-XXX Rev: 0322
  Type:   Sequential-Access                ANSI SCSI revision: 02
Host: scsi0 Channel: 00 Id: 06 Lun: 00
  Vendor: IBM      Model: CDRM00103     !I Rev: 1.0g
  Type:   CD-ROM                           ANSI SCSI revision: 02

tail /var/log/messages
[...]
Nov 27 21:17:30 linux st0: Error 8 (sugg. bt 0x0, driver bt 0x0, host bt 0x0).

while true; do uptime; sleep 1; done
[...]
 21:17:26 up 17 days,  4:32,  3 users,  load average: 0.22, 0.35, 0.22
 21:17:28 up 17 days,  4:32,  3 users,  load average: 0.22, 0.35, 0.22
 21:17:29 up 17 days,  4:32,  3 users,  load average: 0.22, 0.35, 0.22
 21:17:30 up 17 days,  4:32,  3 users,  load average: 0.22, 0.35, 0.22


uptime; strace mt -f /dev/nst0 rewind; uptime
 21:17:28 up 17 days,  4:32,  3 users,  load average: 0.22, 0.35, 0.22
 execve("/usr/sbin/mt", ["mt", "-f", "/dev/nst0", "rewind"], [/* 33 vars */]) = 
0
 uname({sys="Linux", node="linux", ...}) = 0
 brk(0)                                  = 0x804b8f4
 mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x4
0013000
 open("/etc/ld.so.preload", O_RDONLY)    = -1 ENOENT (No such file or directory)
 open("/etc/ld.so.cache", O_RDONLY)      = 3
 fstat64(3, {st_mode=S_IFREG|0644, st_size=29035, ...}) = 0
 mmap2(NULL, 29035, PROT_READ, MAP_PRIVATE, 3, 0) = 0x40014000
 close(3)                                = 0
 open("/lib/libc.so.6", O_RDONLY)        = 3
 read(3, "\177ELF\1\1\1\0\0\0\0\0\0\0\0\0\3\0\3\0\1\0\0\0\200Y\1"..., 1024) = 10
24
 fstat64(3, {st_mode=S_IFREG|0755, st_size=1432651, ...}) = 0
 mmap2(NULL, 1193700, PROT_READ|PROT_EXEC, MAP_PRIVATE, 3, 0) = 0x4001c000
 mprotect(0x4013a000, 22244, PROT_NONE)  = 0
 mmap2(0x4013a000, 12288, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED, 3, 0x11e)
 = 0x4013a000
 mmap2(0x4013d000, 9956, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMO
US, -1, 0) = 0x4013d000
 close(3)                                = 0
 munmap(0x40014000, 29035)               = 0
 open("/dev/urandom", O_RDONLY)          = 3
 read(3, "\243O\235\211\250\27\351,\200a\227R\250\20\205\342a\212"..., 32) = 32
 close(3)                                = 0
 open("/dev/nst0", O_RDONLY)             = -1 EIO (Input/output error)
 dup(2)                                  = 3
 fcntl64(3, F_GETFL)                     = 0x8002 (flags O_RDWR|O_LARGEFILE)
 brk(0)                                  = 0x804b8f4
 brk(0x804c8f4)                          = 0x804c8f4

 brk(0)                                  = 0x804c8f4
 brk(0x804d000)                          = 0x804d000
 fstat64(3, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 1), ...}) = 0
 mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x4
0014000
 _llseek(3, 0, 0xbffff7d0, SEEK_CUR)     = -1 ESPIPE (Illegal seek)
 write(3, "/dev/nst0: Input/output error\n", 30/dev/nst0: Input/output error
 ) = 30
 close(3)                                = 0
 munmap(0x40014000, 4096)                = 0
 _exit(1)                                = ?
  21:17:30 up 17 days,  4:32,  3 users,  load average: 0.22, 0.35, 0.22

Comment 1 Tim Yamin (RETIRED) gentoo-dev

2003-11-30 05:52:32 UTC

Can you see whether you get the same result with a 2.4.20 kernel?

Comment 2 Jeroen Roos 2003-12-02 01:53:34 UTC

I am running a 2.4.20 kernel. Do you mean an unpatched version (vanilla-sources)?

Comment 3 Tim Yamin (RETIRED) gentoo-dev

2003-12-02 08:56:27 UTC

Yes.

Comment 4 Jeroen Roos 2003-12-03 14:29:40 UTC

Ok, I emerged vanilla-sources-2.4.20, copied my .config and did a make oldconfig and compiled, installed and booted the new kernel...

mt -f /dev/nst0 rewind

Error in /var/log/messages:
Dec  3 22:35:53 linux st0: Error 8 (sugg. bt 0x0, driver bt 0x0, host bt 0x0).

mt -f /dev/nst0 rewind

/var/log/messages:
Dec  3 22:36:10 linux st0: Block limits 1 - 16777215 bytes.

mt -f /dev/nst0 fsf 1
mt -f /dev/nst0 rewind

(no errors)

mt -f /dev/nst0 fsf 5
 /dev/nst0: Input/output error
(error appears after some time)


/var/log/messages:
Dec  3 22:40:31 linux st0: Error with sense data: Info fld=0x5, Current
st09:00: sense key Medium Error
Dec  3 22:40:31 linux Additional sense indicates Sequential positioning error

mt -f /dev/nst0 eof
/dev/nst0: Input/output error
(error appears immediately)

linux root # mt -f /dev/nst0 rewind
/dev/nst0: Input/output error
(idem)

/var/log/messages:
Dec  3 22:41:08 linux st0: Error with sense data: Current st09:00: sense key
Hardware Error
Dec  3 22:41:08 linux Additional sense indicates Internal target failure

(tried it a few extratimes, same result)

Finally:
strace mt -f /dev/nst0 rewind
execve("/usr/sbin/mt", ["mt", "-f", "/dev/nst0", "rewind"], [/* 33 vars */]) =
0
uname({sys="Linux", node="linux", ...}) = 0
brk(0)                                  = 0x804b8f4
mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) =
0x40013000
open("/etc/ld.so.preload", O_RDONLY)    = -1 ENOENT (No such file or
directory)
open("/etc/ld.so.cache", O_RDONLY)      = 3
fstat64(3, {st_mode=S_IFREG|0644, st_size=29035, ...}) = 0
mmap2(NULL, 29035, PROT_READ, MAP_PRIVATE, 3, 0) = 0x40014000
close(3)                                = 0
open("/lib/libc.so.6", O_RDONLY)        = 3
read(3, "\177ELF\1\1\1\0\0\0\0\0\0\0\0\0\3\0\3\0\1\0\0\0\200Y\1"..., 1024) =
1024
fstat64(3, {st_mode=S_IFREG|0755, st_size=1432651, ...}) = 0
mmap2(NULL, 1193700, PROT_READ|PROT_EXEC, MAP_PRIVATE, 3, 0) = 0x4001c000
mprotect(0x4013a000, 22244, PROT_NONE)  = 0
mmap2(0x4013a000, 12288, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED, 3,
0x11e) = 0x4013a000
mmap2(0x4013d000, 9956, PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x4013d000
close(3)                                = 0
munmap(0x40014000, 29035)               = 0
open("/dev/urandom", O_RDONLY)          = 3
read(3, "2H\305t\221\33l=\316\203\245{\231\34$r~\351\1$\375X\365"..., 32) = 32
close(3)                                = 0
open("/dev/nst0", O_RDONLY)             = 3
ioctl(3, SNDCTL_MIDI_MPUMODE, 0xbffff998) = -1 EIO (Input/output error)
dup(2)                                  = 4
fcntl64(4, F_GETFL)                     = 0x8002 (flags O_RDWR|O_LARGEFILE)
brk(0)                                  = 0x804b8f4
brk(0x804c8f4)                          = 0x804c8f4
brk(0)                                  = 0x804c8f4
brk(0x804d000)                          = 0x804d000
fstat64(4, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 0), ...}) = 0
mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) =
0x40
mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) =
0x40014000
_llseek(4, 0, 0xbffff7a0, SEEK_CUR)     = -1 ESPIPE (Illegal seek)
write(4, "/dev/nst0: Input/output error\n", 30/dev/nst0: Input/output error
) = 30
close(4)                                = 0
munmap(0x40014000, 4096)                = 0
ioctl(3, SNDCTL_MIDI_MPUCMD, 0xbffff980) = 0
close(3)                                = 0
_exit(2)                                = ?


but... no complete system crashes... :-) This resembles the problems I had on
Slackware, before I moved to Gentoo.

Hope this helps.  Let me know if you need any more.

Thanks so far.

Comment 5 Jeroen Roos 2003-12-03 14:42:42 UTC

Created attachment 21662 [details]
Diff file between the two straces

For your convenience

Comment 6 Tim Yamin (RETIRED) gentoo-dev

2003-12-04 09:26:23 UTC

... Can you change the CONFIG_PREEMPT [ i.e. on => off; off => on ] flag on your Gentoo kernel to check whether this is a scheduler bug?

Comment 7 Jeroen Roos 2003-12-14 05:55:14 UTC

I did a "genkernel --config" and changed CONFIG_PREEMPT from off to on, after the kernel had compiled I rebooted and faced a lot of errors (Segfaults), esp. modules were unable to load. I even tried to recompile the kernel (after booting to a vanilla-kernel, since the kernel-compilations segfaulted all over me to). The second try also gave me an unusable kernel. However, I was able to run an "mt -f /dev/nst0 rewind" and that also crashed the server. So it seems the PREEMPT setting is not the culprit.

Comment 8 Jeroen Roos 2003-12-15 12:57:57 UTC

Just FYI: I am now running linux 2.4.20-gentoo-r9 and the problem is still there...

Comment 9 Jeroen Roos 2004-02-03 13:18:06 UTC

Good news! It seems the problem has been resolved. 

I have installed linux-2.4.22-gentoo-r5 and I am not able to reproduce the problem anymore.

I have also found the following: http://www.kolumbus.fi/kai.makisara/st-eot.html which seems like the same problem. I checked the diff between linux-2.4.20-gentoo-r9 and linux-2.4.22-gentoo-r5 and the changes mentionned on the page above have indeed been made on st.c

I will check if my backup runs fine now and change the bug to resolved for now. I will report in a few days whether the problem has really gone away.

Comment 10 Tim Yamin (RETIRED) gentoo-dev

2004-02-03 13:46:17 UTC

Thanks for the patch reference - I'll add this to -r11...

Comment 11 Tim Yamin (RETIRED) gentoo-dev

2004-02-08 08:08:06 UTC

Patch is in gentoo-sources-2.4.20-r11. Thanks!

Comment 12 Jeroen Roos 2004-04-29 16:16:04 UTC

I am sorry... but I keep on running into this problem; it seems less frequent than it used to be, but it is still there... I have had the problem with 2.4.25-gentoo and 2.4.22-gentoo-r5.

Comment 13 Tim Yamin (RETIRED) gentoo-dev

2004-06-21 14:15:24 UTC

Hm; can you confirm this on a vanilla-2.4.26 kernel for us?

Comment 14 Jeroen Roos 2004-06-22 11:51:55 UTC

I'd be happy to check this for you, however, compiling a kernel takes quite some time on my system (busy box, running on 2xP-90), so before I start I'd like to confirm that it is really necessary. I am running 2.4.25-gentoo now, and still having these crashes. And...
linux root # diff /usr/src/linux-2.4.25-gentoo/drivers/scsi/st.c /usr/src/linux-2.4.26/drivers/scsi/st.c 
linux root # 

IOW, there are no changes to the driver between the version I am running now and vanilla-2.4.26. Again, I have no problems with checking this, but please confirm it is going to help in any way.

Comment 15 Tim Yamin (RETIRED) gentoo-dev

2004-06-22 12:14:42 UTC

Hmmm - In which case can I suggest that you try filing an upstream bug [bugme.osdl.org] since if this is a persistent behaviour with all the various 2.4 kernels you've tried it looks like the bug is lurking upstream there...

Sidenote: This does look a bit like flakey hardware - one of your comments says "Sequential positioning error" indicating that it's probably something with the hardware itself, the tape, or possibly something as silly as excess humidity in the tapes. Or even a totally chewed tape. Cleaning it more frequently might help with that.

Although; if you're still getting total system hangs then this definitely has to be with either the kernel or with the SCSI controller hanging the whole system bus.

Does the drive itself still operate as it usually does; namely do the green // amber LEDs work as they normally do when you get the hang and try to change the tape, for example?

Comment 16 Jeroen Roos 2004-06-23 09:14:05 UTC

I have decided to compile a 2.4.26 anyway, just to make sure. I have turned on debugging in st.c (as suggested on http://www.kolumbus.fi/kai.makisara/st-eot.html).

I am pretty sure it is not a hardware problem, but I do have a spare drive available and can swap it - but it's probably going to be somewhere late next week before I can do that.

About filing an upstream bug... feel free to assign the bug to me and let me handle the communication with Kai. If needed, I am even prepared to give him access to my machine, since this bug is driving me crazy ;-) If and when the problem is solved, I would need, however, someone to integrate a patch into the Gentoo-kernel tree.

The new kernel is compiling right now. I will do some tests as soon as it is done - and have an opportunity to reboot the system.

Comment 17 Jeroen Roos 2004-06-23 14:26:31 UTC

I have just installed the 2.4.26 vanilla kernel. I was easily able to reproduce the problem. As said, I have turned DEBUG on and as long as things kept running I indeed saw some more debugging info in my /var/log/messages - but when things went wrong, nothing appeared. I was doing a tail -f /var/log/messages in a separate ssh session. Also monitored the console for messages, but nothing there either.

I did, however, find out something interesting: the system keeps on responding to pings, even after it has frozen. I did some tests with ethereal on a different system, and in fact, it was even answering incoming TCP SYN requests with a SYN ACK on for example port 25, where qmail is running - but qmail was no longer answering. 

Any ideas on what to try next?

Oh, about the lights: I've never really figured out what the lights mean, but I think they operate as normal, even after the system hangs. I can also normally retrieve the tape from the drive while the system hangs.

I do not think it is caused by a "totally chewed tape" since the problem keeps on occuring on the same tape, but sometimes goes away after some time, after that I can use the tape normally. Also, the tape either hangs the entire machine, or works fine, nothing in between like read errors, write errors, etc. But as said, a tape can change from a "bad tape" to a "good tape" and v.v.
I frequently clean my drive.

Comment 18 Tim Yamin (RETIRED) gentoo-dev

2004-07-03 13:05:32 UTC

If the machine is still replying to TCP packets, then this looks like this is a kernel bug or deadlock of some sort. Since we don't maintain the SCSI / Tape code could you please file a bug for this at http://bugme.osdl.org so the upstream kernel developers can have a stab at this? Thanks!

Comment 19 Jeroen Roos 2004-07-03 14:37:28 UTC

I have sent a message to the developper of this driver, Kai Makisara, since he specifially asks for that on http://bugs.gentoo.org/show_bug.cgi?id=34545; depending on his answer I may log a bug at OSDL too.
You can assign this bug to me, if that is convenient for you.

Comment 20 Jeroen Roos 2004-07-03 14:39:00 UTC

Oops, wrong URL pasted:
...since he specifically asks for that on http://www.kolumbus.fi/kai.makisara/st-eot.html ...

Comment 21 Daniel Drake (RETIRED) gentoo-dev

2004-09-05 16:08:37 UTC

I assume this is in upstreams hands now.

Comment 22 Jeroen Roos 2004-09-05 23:42:39 UTC

I have contacted Kai Makisara, and received an answer from him within days. He advised me too look at the drivers for my SCSI card, since he didn't think it is related to his drivers. I have downgraded my SCSI drivers from aic7xxx to aic7xxx_old and I haven't seen the problem since I did that. So, this bug can be closed and will log a new bug for that problem as soon as I have time to gather some information...