After copying about 20 to 30 GB (15 were files, those were transferred at once, the other gigabytes were backup of another machine which was coming gzip-compressed during 52 mins, so it was kind of a continuous load), I got messages in `dmesg`, saying that the SATA link was lost and couldn't be re-established. Re-mounting didn't help, only rebooting. System uptime was 15 days. dmesg-output is attached I've had similar problems before, see http://bugs.gentoo.org/show_bug.cgi?id=182606 Hard disk was: Device: ATA SAMSUNG HD400LJ Version: ZZ10 Serial number: S0H2J1WL815176 Device type: disk Local Time is: Sat Sep 22 21:59:02 2007 CEST Reproducible: Sometimes Steps to Reproduce: 1. Sorry, no idea. Maybe copy much data or something. `lspci`: 00:00.0 Host bridge: nVidia Corporation nForce2 AGP (different version?) (rev c1) 00:00.1 RAM memory: nVidia Corporation nForce2 Memory Controller 1 (rev c1) 00:00.2 RAM memory: nVidia Corporation nForce2 Memory Controller 4 (rev c1) 00:00.3 RAM memory: nVidia Corporation nForce2 Memory Controller 3 (rev c1) 00:00.4 RAM memory: nVidia Corporation nForce2 Memory Controller 2 (rev c1) 00:00.5 RAM memory: nVidia Corporation nForce2 Memory Controller 5 (rev c1) 00:01.0 ISA bridge: nVidia Corporation nForce2 ISA Bridge (rev a4) 00:01.1 SMBus: nVidia Corporation nForce2 SMBus (MCP) (rev a2) 00:02.0 USB Controller: nVidia Corporation nForce2 USB Controller (rev a4) 00:02.1 USB Controller: nVidia Corporation nForce2 USB Controller (rev a4) 00:02.2 USB Controller: nVidia Corporation nForce2 USB Controller (rev a4) 00:06.0 Multimedia audio controller: nVidia Corporation nForce2 AC97 Audio Controler (MCP) (rev a1) 00:08.0 PCI bridge: nVidia Corporation nForce2 External PCI Bridge (rev a3) 00:09.0 IDE interface: nVidia Corporation nForce2 IDE (rev a2) 00:1e.0 PCI bridge: nVidia Corporation nForce2 AGP (rev c1) 01:06.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL-8139/8139C/8139C+ (rev 10) 01:07.0 Ethernet controller: D-Link System Inc DGE-528T Gigabit Ethernet Adapter (rev 10) 01:08.0 Mass storage controller: Promise Technology, Inc. PDC40718 (SATA 300 TX4) (rev 02) 01:09.0 Network controller: Cologne Chip Designs GmbH ISDN network controller [HFC-PCI] (rev 02) 02:00.0 VGA compatible controller: nVidia Corporation NV15DDR [GeForce2 Ti] (rev a4)
Created attachment 131637 [details] dmesg output
Can you please post your kernel .config? > Reproducible: Sometimes > > Steps to Reproduce: > 1. Sorry, no idea. Maybe copy much data or something. So does this mean you are able to reproduce this behavior or not? Can you try to reproduce, by copying another 20-30GB to the disk, to confirm this wasn't a one-time occurence?
(In reply to comment #2) > Can you please post your kernel .config? Sure, I've attached it. > > Reproducible: Sometimes > > > > Steps to Reproduce: > > 1. Sorry, no idea. Maybe copy much data or something. > > So does this mean you are able to reproduce this behavior or not? Can you try > to reproduce, by copying another 20-30GB to the disk, to confirm this wasn't a > one-time occurence? As written, I've had it before with slightly different symptoms. I just re-ran the backup and this time it went through. Also, I've backed up all the files I've copied to the disk before in order to avoid losing the data if it's the disk (I hope not so). It did not happen again yet, but this behaviour in general is really not what I'd call stable :-\.
Created attachment 131653 [details] kernel configuration
I see you're using both the old IDE subsystem (CONFIG_IDE=y) as well as the new libata (CONFIG_ATA=y). It's also mentioned on the earlier bug you linked to, that this may cause trouble. You also have SCSI support (CONFIG_SCSI=y) compiled in, is that because you have SCSI HDDs, or because you use something else that needs SCSI support, like USB mass storage devices? Can you try to compile a kernel without CONFIG_IDE, and if possible without CONFIG_SCSI, and then try to reproduce this bug?
(In reply to comment #5) > I see you're using both the old IDE subsystem (CONFIG_IDE=y) as well as the new > libata (CONFIG_ATA=y). It's also mentioned on the earlier bug you linked to, > that this may cause trouble. Ah, OK. I thought I'd need CONFIG_IDE for CDROM-support, is that wrong? > You also have SCSI support (CONFIG_SCSI=y) compiled in, is that because you > have SCSI HDDs, or because you use something else that needs SCSI support, like > USB mass storage devices? Yes, I'd like to use USB storage devices on that box, but I could leave it off for a while to test. > Can you try to compile a kernel without CONFIG_IDE, and if possible without > CONFIG_SCSI, and then try to reproduce this bug? Will do. Thanks for your hints.
> Ah, OK. I thought I'd need CONFIG_IDE for CDROM-support, is that wrong? The new libata can also handle IDE optical devices. These should be detected automatically, if not try the "libata.atapi_enabled=1" kernel boot parameter. The first optical device will be named /dev/sr0. > > You also have SCSI support (CONFIG_SCSI=y) compiled in, is that because you > > have SCSI HDDs, or because you use something else that needs SCSI support, like > > USB mass storage devices? > Yes, I'd like to use USB storage devices on that box, but I could leave it off > for a while to test. Please do so. > > Can you try to compile a kernel without CONFIG_IDE, and if possible without > > CONFIG_SCSI, and then try to reproduce this bug? > Will do. Thanks for your hints. No problem, thank you for helping us determine the cause of this bug.
(In reply to comment #7) > > > You also have SCSI support (CONFIG_SCSI=y) compiled in, is that because you > > > have SCSI HDDs, or because you use something else that needs SCSI support, like > > > USB mass storage devices? > > Yes, I'd like to use USB storage devices on that box, but I could leave it off > > for a while to test. > Please do so. OK, I cannot. It seems like CONFIG_ATA depends on CONFIG_SCSI, I've also noticed that on another box. Without CONFIG_CHR_DEV_SG, there are no /dev/sd*-devices. Is this intended or did I misconfigure something? (New kernel config attached)
Created attachment 131684 [details] new kernel configuration (tried CONFIG_SCSI=n, but make did change it)
(In reply to comment #8) > noticed that on another box. Without CONFIG_CHR_DEV_SG, there are no Oops, I meant CONFIG_BLK_DEV_SD.
Indeed, you need both SCSI and SCSI disk support for libata to work, please ignore that suggestion.
(In reply to comment #11) > Indeed, you need both SCSI and SCSI disk support for libata to work, please > ignore that suggestion. OK. I've now installed 2.6.22.7 and noticed that I do need CONFIG_IDE for my IDE hard disks. They aren't detected when I try to use libata (is there an option specific to nForce 2 chipsets? I didn't find one). Current kernel config is attached.
Created attachment 131693 [details] current kernel configuration (with CONFIG_IDE and CONFIG_SCSI and CONFIG_ATA)
Forgot to mention this (twice! sorry) Your disks will change to /dev/sda (rather than hda). You also need the PATA_AMD driver.
It happened again :-(. dmesg-Output attached. I just copied 65 GB from this disk onto another one. System Uptime was 10 days.
Created attachment 132509 [details] dmesg-output of the new crash
This could probably be a bug in sata_promise-driver, as this is the scsi_host which doesn't show its devices anymore (even after rescan. any ideas what to try before I reboot?)
As stated before, don't use CONFIG_IDE, it's not needed. Just take note that your HDDs will be called /dev/sdX instead of /dev/hdX. I see you use both Promise drivers (CONFIG_SATA_PROMISE and CONFIG_SATA_SX4), do you need them both? Please test with the latest development kernel, 2.6.23-rc8 as of this writing.
(In reply to comment #18) > As stated before, don't use CONFIG_IDE, it's not needed. Just take note that > your HDDs will be called /dev/sdX instead of /dev/hdX. OK, did that. > I see you use both Promise drivers (CONFIG_SATA_PROMISE and CONFIG_SATA_SX4), > do you need them both? No, I attached the new kernel configuration. > Please test with the latest development kernel, 2.6.23-rc8 as of this writing. 2.6.23 is released now, I've installed it. Also, the machine crashed before installing it (running 2.6.22.7), stating "Journal commit error". Unfortunately, I don't have dmesg-output or anything, because it just froze.
Created attachment 133753 [details] current kernel configuration (2.6.23)
So do the same crashes still occur when running kernel 2.6.23? If so, please post a new dmesg output.
Have you had a chance to perform the test requested in comment #21 ?
(In reply to comment #22) > Have you had a chance to perform the test requested in comment #21 ? No, not yet, but if my ordered hard drives arrive this week I'll surely have a chance (backing up everything then) ;-).
So, it just happened again. I was copying a ~ 1 GB big file via SCP and suddenly the machine just hung. I got log output from /var/log/messages after hard-resetting the machine, I'm especially wondering about the exception messages...? Could anyone explain the meaning of those to me please? See attachment msg_crash_24.11.07.txt
Created attachment 136832 [details] Output in /var/log/messages from the crash with 2.6.23 @ 2007/11/24
I could "reproduce" it today quite often. I got my new hard disks and another controller (3ware 9xxx, kinda good) which just works fine. I then tried to backup onto this device (> 100 GB, high I/O load on the PCI-bus) and my root hard disk was gone (on the promise controller). I have just ordered another cheap controller, this time with the SIL-chipset (sata_sil), as I'm just damn annoyed by this bug. Thanks for your time but it looks like you'll have to find someone else to test possible fixes. Things I have tried: - 2.6.24-rc3 (as 2.6.24-rc2 should have some fix for eventhandlers, didn't change anything) - Setting the controller into 1.5GBps-mode instead of 3.0GBps via a patch, didn't change anything aswell
(In reply to comment #26) > I have just ordered another cheap controller, this time with the SIL-chipset > (sata_sil), as I'm just damn annoyed by this bug. Thanks for your time but it > looks like you'll have to find someone else to test possible fixes. I now have installed the controller and it works better than the promise. I do get a lot of ext3-errors, though, but maybe this is a broken ext3 because of the unclean shutdowns (= freezes) before or maybe the disk is dying. The interesting thing is here: The SATA exception does occur aswell, but the SIL is able to recover: ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen ata1.00: cmd ca/00:18:43:8f:ed/00:00:00:00:00/e0 tag 0 cdb 0x0 data 12288 out res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) ata1.00: status: { DRDY } ata1: soft resetting link ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 310) ata1.00: configured for UDMA/100 ata1: EH complete sd 1:0:0:0: [sdb] 145226112 512-byte hardware sectors (74356 MB) sd 1:0:0:0: [sdb] Write Protect is off sd 1:0:0:0: [sdb] Mode Sense: 00 3a 00 00 sd 1:0:0:0: [sdb] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA Here are the SATA-Errors I get: EXT3-fs error (device dm-0): ext3_add_entry: bad entry in directory #673528: rec_len % 4 != 0 - offset=0, inode=2497248225, rec_len=38174, name_len=30 EXT3-fs error (device dm-0): ext3_add_entry: bad entry in directory #673528: rec_len % 4 != 0 - offset=0, inode=2497248225, rec_len=38174, name_len=30 ... goes on with the same message for about 150 messages ... EXT3-fs error (device dm-0): ext3_new_block: Allocating block in system zone - blocks from 14516225, length 1 ... goes on for about 500 msgs ... EXT3-fs error (device dm-0): ext3_free_blocks: Freeing blocks in system zones - Block = 14516406, count = 1 ... also for ~ 500 msgs ... EXT3-fs error (device dm-0): ext3_free_blocks_sb: bit already cleared for block 14519661 ... also for ~ 500 msgs ... Checking df, I have 3 GB of unusable space on that disk now: Filesystem 1K-blocks Used Available Use% Mounted on /dev/mapper/root 69480084 18288684 47661972 28% / How can I find out which the affected directory was? It seems to correlate with an MySQL-Error I got (table marked as crashed), but I'm not sure about that. I want to use this disk in another computer with native SATA and check SMART-status later (as using SMART via sata_sil doesn't work, see below) and do an FSCK. Output of smartctl --all /dev/sdb: Device: ATA WDC WD740GD-00FL Version: 31.0 Serial number: WD-WMAKE1865870 Device type: disk Local Time is: Sun Dec 23 11:54:59 2007 CET Device does not support SMART
My research indicates you might have a corrupted file system. I've also read that you can find the affected directory in the following manner. You can run debugfs on the mounted filesystem and give the command: debugfs: ls <673528> You can figure out its pathname by doing a: debugfs: cd <673528> debugfs: pwd Let us know the results if you don't mind. root # qfile debugfs sys-fs/e2fsprogs (/sbin/debugfs)
(In reply to comment #28) > My research indicates you might have a corrupted file system. I've also read > that you can find the affected directory in the following manner. Thanks for your explanation of the debugfs-command. Are you definitely sure that the filesystem can be corrupted when an fsck does not show any errors? In the meantime, I have swapped this controller and I'm now using SATA_MV (experimental, but working without exceptions at all, so far). So, sorry, can't try it out anymore ;).
(In reply to comment #29) > Are you definitely sure that the filesystem can be corrupted when an fsck does > not show any errors? No, that's why I used the *might* in there. :) OK, I'm going to close this as CANTFIX since the hardware doesn't exist anymore. Sorry we could not resolve this better, it was kinda tricky.