Summary: | SCSI error: return code = 0x00040000 -> sata-drive isn't working, rootfs gone | ||
---|---|---|---|
Product: | Gentoo Linux | Reporter: | sECuRE <sECuRE> |
Component: | [OLD] Core system | Assignee: | Gentoo Kernel Bug Wranglers and Kernel Maintainers <kernel> |
Status: | RESOLVED NEEDINFO | ||
Severity: | critical | CC: | duaneg |
Priority: | High | ||
Version: | unspecified | ||
Hardware: | x86 | ||
OS: | Linux | ||
Whiteboard: | |||
Package list: | Runtime testing required: | --- | |
Attachments: |
current dmesg-output
current kernel configuration |
Description
sECuRE
2007-06-19 19:48:39 UTC
> lost page write due to I/O error
So; how exactly is something that appears as a clear hardware error a Gentoo-specific issue?
(In reply to comment #1) > So; how exactly is something that appears as a clear hardware error a > Gentoo-specific issue? > As I said, I think it's a software-error, because the Controller is OK and the Harddisk aswell. Can you exclude that it's a kernel-bug in the SATA-driver? Could you please post the entire dmesg following the failure, if you still have it. If not then a full dmesg from the system now would be better than nothing. Also, please post your kernel config. It might also be useful to get a SMART report for the hard drive. If you haven't already emerge sys-apps/smartmontools, then post the output from: /usr/sbin/smartctl -data --all /dev/sdc It might be worth doing a full SMART self-test on the drive while you're about it. (In reply to comment #3) > Could you please post the entire dmesg following the failure, if you still have > it. If not then a full dmesg from the system now would be better than nothing. Unfortunately, I don't have the one from the crash, but here is the current one: > Also, please post your kernel config. I've attached it. > /usr/sbin/smartctl -data --all /dev/sdc # smartctl -data --all /dev/sdc smartctl version 5.36 [i686-pc-linux-gnu] Copyright (C) 2002-6 Bruce Allen Home page is http://smartmontools.sourceforge.net/ === START OF INFORMATION SECTION === Model Family: Western Digital Raptor family Device Model: WDC WD740GD-00FLA2 Serial Number: WD-WMAKE1865870 Firmware Version: 31.08F31 User Capacity: 74,355,769,344 bytes Device is: In smartctl database [for details use: -P show] ATA Version is: 6 ATA Standard is: Exact ATA specification draft version not indicated Local Time is: Wed Jun 27 19:10:24 2007 CEST SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x84) Offline data collection activity was suspended by an interrupting command from host. Auto Offline Data Collection: Enabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: (1725) seconds. Offline data collection capabilities: (0x7b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. No General Purpose Logging support. Short self-test routine recommended polling time: ( 2) minutes. Extended self-test routine recommended polling time: ( 30) minutes. Conveyance self-test routine recommended polling time: ( 5) minutes. SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000b 200 200 051 Pre-fail Always - 0 3 Spin_Up_Time 0x0007 120 119 021 Pre-fail Always - 4533 4 Start_Stop_Count 0x0032 100 100 040 Old_age Always - 474 5 Reallocated_Sector_Ct 0x0033 199 199 140 Pre-fail Always - 3 7 Seek_Error_Rate 0x000b 100 253 051 Pre-fail Always - 0 9 Power_On_Hours 0x0032 079 079 000 Old_age Always - 16049 10 Spin_Retry_Count 0x0013 100 100 051 Pre-fail Always - 0 11 Calibration_Retry_Count 0x0013 100 100 051 Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 459 194 Temperature_Celsius 0x0022 117 095 000 Old_age Always - 33 196 Reallocated_Event_Count 0x0032 197 197 000 Old_age Always - 3 197 Current_Pending_Sector 0x0012 200 200 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0012 200 200 000 Old_age Always - 0 199 UDMA_CRC_Error_Count 0x000a 200 253 000 Old_age Always - 1 200 Multi_Zone_Error_Rate 0x0009 200 179 051 Pre-fail Offline - 0 SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 1 No self-tests have been logged. [To run self-tests, use: smartctl -t] SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay. > It might be worth doing a full SMART self-test on the drive while you're about > it. I'm not sure if "long" means full but it was closest to it (from all the options in smartctl -h for -t). Unfortunately, the device does not seem to support logging the selftests?! Created attachment 123225 [details]
current dmesg-output
Created attachment 123226 [details]
current kernel configuration
Your SMART attributes all seem fine, nothing scary there. I'd recommend you look into getting the SMART daemon setup and running, just as a matter of general good practice (especially considering how many disks you have in that thing!), but that isn't relevant to our purpose here. I see you are using an IDE driver for the nForce controller and an ATA driver for the promise controller. There shouldn't be anything wrong with that, but you may want to consider switching to using the new ATA drivers for your nForce as well. However I doubt that this has anything to do with the bug you are seeing, and it will mean your drive device names will all be changed. There isn't too much more that I can think of to try at this point, without the full dmesg when the disk fails. You may want to consider setting up netconsole so you can capture the error when it occurs. This is described in Documentation/networking/netconsole.txt under your kernel source tree. Another thing to check is whether you have a large enough power supply for your hardware. The reports I've found of similar errors were caused by the SATA port not properly handling a reset while trying to recover from an error. In one report the root cause was down to an under-powered power supply that occasionally couldn't keep up. This caused a transient disk failure, the error recovery failed in turn, and they end up in the sort of situation that you are seeing. (In reply to comment #7) > Your SMART attributes all seem fine, nothing scary there. I'd recommend you > look into getting the SMART daemon setup and running, just as a matter of > general good practice (especially considering how many disks you have in that > thing!), but that isn't relevant to our purpose here. Thanks, I've enabled it now. > I see you are using an IDE driver for the nForce controller and an ATA driver > for the promise controller. There shouldn't be anything wrong with that, but > you may want to consider switching to using the new ATA drivers for your nForce > as well. However I doubt that this has anything to do with the bug you are > seeing, and it will mean your drive device names will all be changed. Ah, I thought I'd need the nForce-driver anyway for other features of this board? Seems like I'm wrong. Which parameters would need to be changed for this exactly (I'll enable it when I have to do some maintenance in that area anyway)? > There isn't too much more that I can think of to try at this point, without the > full dmesg when the disk fails. You may want to consider setting up netconsole > so you can capture the error when it occurs. This is described in > Documentation/networking/netconsole.txt under your kernel source tree. I've also enabled it, however it will only be active after the next reboot, as there's no other configuration possibility than using a boot-parameter (I've built it in as you can see in my kernel config), right? > Another thing to check is whether you have a large enough power supply for your > hardware. The reports I've found of similar errors were caused by the SATA port > not properly handling a reset while trying to recover from an error. In one > report the root cause was down to an under-powered power supply that > occasionally couldn't keep up. This caused a transient disk failure, the error > recovery failed in turn, and they end up in the sort of situation that you are > seeing. That's an interesting point. However, the system has a 550W BeQuiet! power supply and this should be enough. Also, the system worked fine for several months using Debian (no hardware changes). Thank you for all the tips you gave :-). (In reply to comment #8) > Ah, I thought I'd need the nForce-driver anyway for other features of this > board? Seems like I'm wrong. The non-[SP]ATA functions of the chipset shouldn't be affected by the choice of IDE or libata driver. As far as I know the libata nv driver is fully functional. Of course, libata PATA support is still marked as experimental. It *should* be safe to use, though. You have backups, right? ;) > Which parameters would need to be changed for this > exactly (I'll enable it when I have to do some maintenance in that area > anyway)? You already have the libata drivers compiled in, along with the IDE drivers. If you simply deselect IDE (the "ATA/ATAPI/MFM/RLL support" config option) then the libata nv driver should take over driving your hardware. This will mean that your disk device names will change. My guess would be that what is currently hd[abc] would be renamed to sd[abc] and what is currently sd[abc] would then be sd[def]. But it might be the other way around. Anywhere you use the device name will have to be changed, e.g. /etc/fstab, your grub & SMART configs, etc. If you are using disk labels or IDs instead of the device name then you won't need to worry. You can also use udev rules to set specific disks to specific device names. It all sounds a bit scary but if you are comfortable booting off CD, mounting your system disk, and editing files until you get it right, then you should be fine. Oh, and did I mention backups already? ;) Seriously, don't embark on this unless you are comfortable with it. > I've also enabled it, however it will only be active after the next reboot, as > there's no other configuration possibility than using a boot-parameter (I've > built it in as you can see in my kernel config), right? Yep, that's right. > That's an interesting point. However, the system has a 550W BeQuiet! power > supply and this should be enough. Also, the system worked fine for several > months using Debian (no hardware changes). The smooth operation under Debian does argue against a hardware issue, although these things can be rather subtle. For example, if all your disks spin up together they will draw a much higher load than if they spin up one-by-one (or stay spun up). Changing the OS/distro could change the disk access pattern and make this more likely. However, this is just idle speculation at this point. If you wanted to investigate this sort of thing you could try stressing your disks. E.g. use a script to spin down then start a benchmark on all of them simultaneously (with hdparm), inside a loop, continuously for a few hours. I'm not sure it would be worth the effort though. A negative result wouldn't be definitive anyway. > Thank you for all the tips you gave :-). No worries! If you manage to get a full dmesg out then update the bug with it. That might help us figure out the root cause. FYI: I've just noticed a post from Alan Cox on LKML where he mentioned that "there are some cases where trying to load both old and new IDE support for the same chip will do strange things." I suggest you try disabling libata support for your nForce controller (SATA_NV=n). Note that this will NOT affect your device nodes or do anything else at all scary. I'd be interested in hearing whether this helped, although given how infrequent the problem was it might be hard to tell. (In reply to comment #10) > FYI: I've just noticed a post from Alan Cox on LKML where he mentioned that > "there are some cases where trying to load both old and new IDE support for the > same chip will do strange things." > > I suggest you try disabling libata support for your nForce controller > (SATA_NV=n). Note that this will NOT affect your device nodes or do anything > else at all scary. I'd be interested in hearing whether this helped, although > given how infrequent the problem was it might be hard to tell. Thanks for the hint, I've disabled it and will let you know if it still happens after the next reboot (this may take some time..) Please also make sure you are running 2.6.22 for new libata EH. Next time this happens, please attach the full dmesg from that session, otherwise we have to guess... Please reopen if you see this again. (In reply to comment #10) > (SATA_NV=n). Note that this will NOT affect your device nodes or do anything > else at all scary. Due to a power outage I've rebooted today and upgraded the kernel to 2.6.22.6 without SATA_NV. This did however affect the device nodes! There are 2x2 connectors on the board, I'll just call them bus1 and bus2 (with 2 SATA ports each) even if they're not busses. Disabling SATA_NV switched bus1 with bus2, so now I have to boot from /dev/sda instead of /dev/sdc. Hmm, that is interesting. At least we know it has an effect, then! Let's see if the problem recurs with that change. |