182606 – SCSI error: return code = 0x00040000 -> sata-drive isn't working, rootfs gone

Bug 182606 - SCSI error: return code = 0x00040000 -> sata-drive isn't working, rootfs gone

Summary: SCSI error: return code = 0x00040000 -> sata-drive isn't working, rootfs gone

Status:	RESOLVED NEEDINFO

Alias:	None

Product:	Gentoo Linux
Classification:	Unclassified
Component:	[OLD] Core system (show other bugs)
Hardware:	x86 Linux

Importance:	High critical
Assignee:	Gentoo Kernel Bug Wranglers and Kernel Maintainers

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2007-06-19 19:48 UTC by sECuRE
Modified:	2007-09-07 23:19 UTC (History)
CC List:	1 user (show)

See Also:
Package list:
Runtime testing required:	---

Attachments
current dmesg-output (dmesg.txt,25.11 KB, text/plain) 2007-06-27 17:35 UTC, sECuRE	Details
current kernel configuration (kernel_config.txt,9.06 KB, text/plain) 2007-06-27 17:35 UTC, sECuRE	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description sECuRE 2007-06-19 19:48:39 UTC

My system is an AMD XP 2400+ with 1 GB of RAM on an EPoX 8RDA3+ with a WD Raptor 74GB SATA-HDD. Since using Gentoo (could also be the newer kernel, don't know that, it was Debian before) I encounter a strange problem. Once in a while (about once in a month), the machine just times out. Today, i was "live" there when it happened. I managed to scp (yes, sshd and other running apps still work fine) the dmesg-executable and needed libs and that's how the dmesg-output looks like:

sd 3:0:0:0: SCSI error: return code = 0x00040000
end_request: I/O error, dev sdc, sector 19220819
Buffer I/O error on device dm-0, logical block 1898306
lost page write due to I/O error on dm-0
Buffer I/O error on device dm-0, logical block 1898307
lost page write due to I/O error on dm-0
sd 3:0:0:0: SCSI error: return code = 0x00040000
end_request: I/O error, dev sdc, sector 19220835
Buffer I/O error on device dm-0, logical block 1898308
lost page write due to I/O error on dm-0
Buffer I/O error on device dm-0, logical block 1898309
lost page write due to I/O error on dm-0
Buffer I/O error on device dm-0, logical block 1898310
lost page write due to I/O error on dm-0
Buffer I/O error on device dm-0, logical block 1898311
lost page write due to I/O error on dm-0
Buffer I/O error on device dm-0, logical block 1898312
lost page write due to I/O error on dm-0
Buffer I/O error on device dm-0, logical block 1898313
lost page write due to I/O error on dm-0
Buffer I/O error on device dm-0, logical block 1898314
lost page write due to I/O error on dm-0
Buffer I/O error on device dm-0, logical block 1898315
lost page write due to I/O error on dm-0
sd 3:0:0:0: SCSI error: return code = 0x00040000
end_request: I/O error, dev sdc, sector 19220947
sd 3:0:0:0: SCSI error: return code = 0x00040000

And so on. As you can imagine, the root filesystem was gone, so the only chance to get this "fixed" was a reboot. After that, the system works fine again, so I guess it's a software bug. To be sure, I swapped the SATA-Cable after it happened. The controller seems to work fine, as the other HDDs were still accessible and they also have a much higher traffic to handle.

Reproducible: Couldn't Reproduce

Steps to Reproduce:

Actual Results:  
/

Expected Results:  
/

18:51:17 up 32 days,  5:53,  4 users,  load average: 1.87, 1.98, 1.90
Load is so high because some running programs try accessing the gone HDD continuously

Running processes are all normal, nothing unusual.

Comment 1 Jakub Moc (RETIRED) gentoo-dev

2007-06-19 19:52:22 UTC

> lost page write due to I/O error

So; how exactly is something that appears as a clear hardware error a Gentoo-specific issue?

Comment 2 sECuRE 2007-06-19 20:04:07 UTC

(In reply to comment #1)
> So; how exactly is something that appears as a clear hardware error a
> Gentoo-specific issue?
> 
As I said, I think it's a software-error, because the Controller is OK and the Harddisk aswell. Can you exclude that it's a kernel-bug in the SATA-driver?

Comment 3 Duane Griffin 2007-06-27 16:54:00 UTC

Could you please post the entire dmesg following the failure, if you still have it. If not then a full dmesg from the system now would be better than nothing. Also, please post your kernel config.

It might also be useful to get a SMART report for the hard drive. If you haven't already emerge sys-apps/smartmontools, then post the output from:

/usr/sbin/smartctl -data --all /dev/sdc

It might be worth doing a full SMART self-test on the drive while you're about it.

Comment 4 sECuRE 2007-06-27 17:35:01 UTC

(In reply to comment #3)
> Could you please post the entire dmesg following the failure, if you still have
> it. If not then a full dmesg from the system now would be better than nothing.
Unfortunately, I don't have the one from the crash, but here is the current one:

> Also, please post your kernel config.
I've attached it.

> /usr/sbin/smartctl -data --all /dev/sdc
# smartctl -data --all /dev/sdc
smartctl version 5.36 [i686-pc-linux-gnu] Copyright (C) 2002-6 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Raptor family
Device Model:     WDC WD740GD-00FLA2
Serial Number:    WD-WMAKE1865870
Firmware Version: 31.08F31
User Capacity:    74,355,769,344 bytes
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   6
ATA Standard is:  Exact ATA specification draft version not indicated
Local Time is:    Wed Jun 27 19:10:24 2007 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x84)	Offline data collection activity
					was suspended by an interrupting command from host.
					Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0)	The previous self-test routine completed
					without error or no self-test has ever 
					been run.
Total time to complete Offline 
data collection: 		 (1725) seconds.
Offline data collection
capabilities: 			 (0x7b) SMART execute Offline immediate.
					Auto Offline data collection on/off support.
					Suspend Offline collection upon new
					command.
					Offline surface scan supported.
					Self-test supported.
					Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
					No General Purpose Logging support.
Short self-test routine 
recommended polling time: 	 (   2) minutes.
Extended self-test routine
recommended polling time: 	 (  30) minutes.
Conveyance self-test routine
recommended polling time: 	 (   5) minutes.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0007   120   119   021    Pre-fail  Always       -       4533
  4 Start_Stop_Count        0x0032   100   100   040    Old_age   Always       -       474
  5 Reallocated_Sector_Ct   0x0033   199   199   140    Pre-fail  Always       -       3
  7 Seek_Error_Rate         0x000b   100   253   051    Pre-fail  Always       -       0
  9 Power_On_Hours          0x0032   079   079   000    Old_age   Always       -       16049
 10 Spin_Retry_Count        0x0013   100   100   051    Pre-fail  Always       -       0
 11 Calibration_Retry_Count 0x0013   100   100   051    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       459
194 Temperature_Celsius     0x0022   117   095   000    Old_age   Always       -       33
196 Reallocated_Event_Count 0x0032   197   197   000    Old_age   Always       -       3
197 Current_Pending_Sector  0x0012   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0012   200   200   000    Old_age   Always       -       0
199 UDMA_CRC_Error_Count    0x000a   200   253   000    Old_age   Always       -       1
200 Multi_Zone_Error_Rate   0x0009   200   179   051    Pre-fail  Offline      -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]


SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

> It might be worth doing a full SMART self-test on the drive while you're about
> it.
I'm not sure if "long" means full but it was closest to it (from all the options in smartctl -h for -t). Unfortunately, the device does not seem to support logging the selftests?!

Comment 5 sECuRE 2007-06-27 17:35:35 UTC

Created attachment 123225 [details]
current dmesg-output

Comment 6 sECuRE 2007-06-27 17:35:55 UTC

Created attachment 123226 [details]
current kernel configuration

Comment 7 Duane Griffin 2007-06-28 10:32:53 UTC

Your SMART attributes all seem fine, nothing scary there. I'd recommend you look into getting the SMART daemon setup and running, just as a matter of general good practice (especially considering how many disks you have in that thing!), but that isn't relevant to our purpose here.

I see you are using an IDE driver for the nForce controller and an ATA driver for the promise controller. There shouldn't be anything wrong with that, but you may want to consider switching to using the new ATA drivers for your nForce as well. However I doubt that this has anything to do with the bug you are seeing, and it will mean your drive device names will all be changed.

There isn't too much more that I can think of to try at this point, without the full dmesg when the disk fails. You may want to consider setting up netconsole so you can capture the error when it occurs. This is described in Documentation/networking/netconsole.txt under your kernel source tree.

Another thing to check is whether you have a large enough power supply for your hardware. The reports I've found of similar errors were caused by the SATA port not properly handling a reset while trying to recover from an error. In one report the root cause was down to an under-powered power supply that occasionally couldn't keep up. This caused a transient disk failure, the error recovery failed in turn, and they end up in the sort of situation that you are seeing.

Comment 8 sECuRE 2007-06-29 16:32:21 UTC

(In reply to comment #7)
> Your SMART attributes all seem fine, nothing scary there. I'd recommend you
> look into getting the SMART daemon setup and running, just as a matter of
> general good practice (especially considering how many disks you have in that
> thing!), but that isn't relevant to our purpose here.
Thanks, I've enabled it now.

> I see you are using an IDE driver for the nForce controller and an ATA driver
> for the promise controller. There shouldn't be anything wrong with that, but
> you may want to consider switching to using the new ATA drivers for your nForce
> as well. However I doubt that this has anything to do with the bug you are
> seeing, and it will mean your drive device names will all be changed.
Ah, I thought I'd need the nForce-driver anyway for other features of this board? Seems like I'm wrong. Which parameters would need to be changed for this exactly (I'll enable it when I have to do some maintenance in that area anyway)?

> There isn't too much more that I can think of to try at this point, without the
> full dmesg when the disk fails. You may want to consider setting up netconsole
> so you can capture the error when it occurs. This is described in
> Documentation/networking/netconsole.txt under your kernel source tree.
I've also enabled it, however it will only be active after the next reboot, as there's no other configuration possibility than using a boot-parameter (I've built it in as you can see in my kernel config), right?

> Another thing to check is whether you have a large enough power supply for your
> hardware. The reports I've found of similar errors were caused by the SATA port
> not properly handling a reset while trying to recover from an error. In one
> report the root cause was down to an under-powered power supply that
> occasionally couldn't keep up. This caused a transient disk failure, the error
> recovery failed in turn, and they end up in the sort of situation that you are
> seeing.
That's an interesting point. However, the system has a 550W BeQuiet! power supply and this should be enough. Also, the system worked fine for several months using Debian (no hardware changes).

Thank you for all the tips you gave :-).

Comment 9 Duane Griffin 2007-06-29 17:49:21 UTC

(In reply to comment #8)
> Ah, I thought I'd need the nForce-driver anyway for other features of this
> board? Seems like I'm wrong.

The non-[SP]ATA functions of the chipset shouldn't be affected by the choice of IDE or libata driver. As far as I know the libata nv driver is fully functional. Of course, libata PATA support is still marked as experimental. It *should* be safe to use, though.

You have backups, right? ;)

> Which parameters would need to be changed for this
> exactly (I'll enable it when I have to do some maintenance in that area
> anyway)?

You already have the libata drivers compiled in, along with the IDE drivers. If you simply deselect IDE (the "ATA/ATAPI/MFM/RLL support" config option) then the libata nv driver should take over driving your hardware.

This will mean that your disk device names will change. My guess would be that what is currently hd[abc] would be renamed to sd[abc] and what is currently sd[abc] would then be sd[def]. But it might be the other way around.

Anywhere you use the device name will have to be changed, e.g. /etc/fstab, your grub & SMART configs, etc. If you are using disk labels or IDs instead of the device name then you won't need to worry. You can also use udev rules to set specific disks to specific device names.

It all sounds a bit scary but if you are comfortable booting off CD, mounting your system disk, and editing files until you get it right, then you should be fine. Oh, and did I mention backups already? ;)

Seriously, don't embark on this unless you are comfortable with it.

> I've also enabled it, however it will only be active after the next reboot, as
> there's no other configuration possibility than using a boot-parameter (I've
> built it in as you can see in my kernel config), right?

Yep, that's right.

> That's an interesting point. However, the system has a 550W BeQuiet! power
> supply and this should be enough. Also, the system worked fine for several
> months using Debian (no hardware changes).

The smooth operation under Debian does argue against a hardware issue, although these things can be rather subtle. For example, if all your disks spin up together they will draw a much higher load than if they spin up one-by-one (or stay spun up). Changing the OS/distro could change the disk access pattern and make this more likely. However, this is just idle speculation at this point.

If you wanted to investigate this sort of thing you could try stressing your disks. E.g. use a script to spin down then start a benchmark on all of them simultaneously (with hdparm), inside a loop, continuously for a few hours. I'm not sure it would be worth the effort though. A negative result wouldn't be definitive anyway.

> Thank you for all the tips you gave :-).

No worries! If you manage to get a full dmesg out then update the bug with it. That might help us figure out the root cause.

Comment 10 Duane Griffin 2007-07-10 00:00:47 UTC

FYI: I've just noticed a post from Alan Cox on LKML where he mentioned that "there are some cases where trying to load both old and new IDE support for the same chip will do strange things."

I suggest you try disabling libata support for your nForce controller (SATA_NV=n). Note that this will NOT affect your device nodes or do anything else at all scary. I'd be interested in hearing whether this helped, although given how infrequent the problem was it might be hard to tell.

Comment 11 sECuRE 2007-07-11 12:33:52 UTC

(In reply to comment #10)
> FYI: I've just noticed a post from Alan Cox on LKML where he mentioned that
> "there are some cases where trying to load both old and new IDE support for the
> same chip will do strange things."
> 
> I suggest you try disabling libata support for your nForce controller
> (SATA_NV=n). Note that this will NOT affect your device nodes or do anything
> else at all scary. I'd be interested in hearing whether this helped, although
> given how infrequent the problem was it might be hard to tell.

Thanks for the hint, I've disabled it and will let you know if it still happens after the next reboot (this may take some time..)

Comment 12 Daniel Drake (RETIRED) gentoo-dev

2007-07-22 04:43:08 UTC

Please also make sure you are running 2.6.22 for new libata EH. Next time this happens, please attach the full dmesg from that session, otherwise we have to guess...

Comment 13 Daniel Drake (RETIRED) gentoo-dev

2007-08-05 22:20:30 UTC

Please reopen if you see this again.

Comment 14 sECuRE 2007-09-07 18:43:26 UTC

(In reply to comment #10)
> (SATA_NV=n). Note that this will NOT affect your device nodes or do anything
> else at all scary.

Due to a power outage I've rebooted today and upgraded the kernel to 2.6.22.6 without SATA_NV. This did however affect the device nodes! There are 2x2 connectors on the board, I'll just call them bus1 and bus2 (with 2 SATA ports each) even if they're not busses. Disabling SATA_NV switched bus1 with bus2, so now I have to boot from /dev/sda instead of /dev/sdc.

Comment 15 Duane Griffin 2007-09-07 23:19:13 UTC

Hmm, that is interesting. At least we know it has an effect, then! Let's see if the problem recurs with that change.