193448 – SATA connection crashed after copying lots of gigabytes, only reboot helps

Bug 193448 - SATA connection crashed after copying lots of gigabytes, only reboot helps

Summary: SATA connection crashed after copying lots of gigabytes, only reboot helps

Status:	RESOLVED CANTFIX

Alias:	None

Product:	Gentoo Linux
Classification:	Unclassified
Component:	[OLD] Core system (show other bugs)
Hardware:	x86 Linux

Importance:	High critical
Assignee:	Gentoo Kernel Bug Wranglers and Kernel Maintainers

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2007-09-22 19:56 UTC by sECuRE
Modified:	2008-04-16 13:02 UTC (History)
CC List:	0 users

See Also:
Package list:
Runtime testing required:	---

Attachments
dmesg output (fs_hdd_crash,61.37 KB, text/plain) 2007-09-22 19:57 UTC, sECuRE	Details
kernel configuration (fs_config,9.58 KB, text/plain) 2007-09-22 22:39 UTC, sECuRE	Details
new kernel configuration (tried CONFIG_SCSI=n, but make did change it) (fs_config_2.6.22.7,33.13 KB, text/plain) 2007-09-23 12:22 UTC, sECuRE	Details
current kernel configuration (with CONFIG_IDE and CONFIG_SCSI and CONFIG_ATA) (fs_config_2.6.22.7,9.25 KB, text/plain) 2007-09-23 13:14 UTC, sECuRE	Details
dmesg-output of the new crash (dmesg_crash_041007,26.33 KB, text/plain) 2007-10-03 23:32 UTC, sECuRE	Details
current kernel configuration (2.6.23) (fs_config_2.6.23,8.27 KB, text/plain) 2007-10-18 13:41 UTC, sECuRE	Details
Output in /var/log/messages from the crash with 2.6.23 @ 2007/11/24 (msg_crash_24.11.07.txt,34.53 KB, text/plain) 2007-11-24 00:45 UTC, sECuRE	Details
Show Obsolete (3) View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description sECuRE 2007-09-22 19:56:56 UTC

After copying about 20 to 30 GB (15 were files, those were transferred at once, the other gigabytes were backup of another machine which was coming gzip-compressed during 52 mins, so it was kind of a continuous load), I got messages in `dmesg`, saying that the SATA link was lost and couldn't be re-established. Re-mounting didn't help, only rebooting.

System uptime was 15 days.

dmesg-output is attached

I've had similar problems before, see http://bugs.gentoo.org/show_bug.cgi?id=182606

Hard disk was:
Device: ATA      SAMSUNG HD400LJ  Version: ZZ10
Serial number: S0H2J1WL815176      
Device type: disk
Local Time is: Sat Sep 22 21:59:02 2007 CEST


Reproducible: Sometimes

Steps to Reproduce:
1. Sorry, no idea. Maybe copy much data or something.



`lspci`:
00:00.0 Host bridge: nVidia Corporation nForce2 AGP (different version?) (rev c1)
00:00.1 RAM memory: nVidia Corporation nForce2 Memory Controller 1 (rev c1)
00:00.2 RAM memory: nVidia Corporation nForce2 Memory Controller 4 (rev c1)
00:00.3 RAM memory: nVidia Corporation nForce2 Memory Controller 3 (rev c1)
00:00.4 RAM memory: nVidia Corporation nForce2 Memory Controller 2 (rev c1)
00:00.5 RAM memory: nVidia Corporation nForce2 Memory Controller 5 (rev c1)
00:01.0 ISA bridge: nVidia Corporation nForce2 ISA Bridge (rev a4)
00:01.1 SMBus: nVidia Corporation nForce2 SMBus (MCP) (rev a2)
00:02.0 USB Controller: nVidia Corporation nForce2 USB Controller (rev a4)
00:02.1 USB Controller: nVidia Corporation nForce2 USB Controller (rev a4)
00:02.2 USB Controller: nVidia Corporation nForce2 USB Controller (rev a4)
00:06.0 Multimedia audio controller: nVidia Corporation nForce2 AC97 Audio Controler (MCP) (rev a1)
00:08.0 PCI bridge: nVidia Corporation nForce2 External PCI Bridge (rev a3)
00:09.0 IDE interface: nVidia Corporation nForce2 IDE (rev a2)
00:1e.0 PCI bridge: nVidia Corporation nForce2 AGP (rev c1)
01:06.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL-8139/8139C/8139C+ (rev 10)
01:07.0 Ethernet controller: D-Link System Inc DGE-528T Gigabit Ethernet Adapter (rev 10)
01:08.0 Mass storage controller: Promise Technology, Inc. PDC40718 (SATA 300 TX4) (rev 02)
01:09.0 Network controller: Cologne Chip Designs GmbH ISDN network controller [HFC-PCI] (rev 02)
02:00.0 VGA compatible controller: nVidia Corporation NV15DDR [GeForce2 Ti] (rev a4)

Comment 1 sECuRE 2007-09-22 19:57:16 UTC

Created attachment 131637 [details]
dmesg output

Comment 2 Maarten Bressers (RETIRED) gentoo-dev

2007-09-22 21:17:58 UTC

Can you please post your kernel .config?

> Reproducible: Sometimes
> 
> Steps to Reproduce:
> 1. Sorry, no idea. Maybe copy much data or something.

So does this mean you are able to reproduce this behavior or not? Can you try to reproduce, by copying another 20-30GB to the disk, to confirm this wasn't a one-time occurence?

Comment 3 sECuRE 2007-09-22 22:39:08 UTC

(In reply to comment #2)
> Can you please post your kernel .config?
Sure, I've attached it.

> > Reproducible: Sometimes
> > 
> > Steps to Reproduce:
> > 1. Sorry, no idea. Maybe copy much data or something.
> 
> So does this mean you are able to reproduce this behavior or not? Can you try
> to reproduce, by copying another 20-30GB to the disk, to confirm this wasn't a
> one-time occurence?
As written, I've had it before with slightly different symptoms. I just re-ran the backup and this time it went through. Also, I've backed up all the files I've copied to the disk before in order to avoid losing the data if it's the disk (I hope not so).

It did not happen again yet, but this behaviour in general is really not what I'd call stable :-\.

Comment 4 sECuRE 2007-09-22 22:39:34 UTC

Created attachment 131653 [details]
kernel configuration

Comment 5 Maarten Bressers (RETIRED) gentoo-dev

2007-09-23 00:06:36 UTC

I see you're using both the old IDE subsystem (CONFIG_IDE=y) as well as the new libata (CONFIG_ATA=y). It's also mentioned on the earlier bug you linked to, that this may cause trouble. 

You also have SCSI support (CONFIG_SCSI=y) compiled in, is that because you have SCSI HDDs, or because you use something else that needs SCSI support, like USB mass storage devices?

Can you try to compile a kernel without CONFIG_IDE, and if possible without CONFIG_SCSI, and then try to reproduce this bug?

Comment 6 sECuRE 2007-09-23 00:26:50 UTC

(In reply to comment #5)
> I see you're using both the old IDE subsystem (CONFIG_IDE=y) as well as the new
> libata (CONFIG_ATA=y). It's also mentioned on the earlier bug you linked to,
> that this may cause trouble. 
Ah, OK. I thought I'd need CONFIG_IDE for CDROM-support, is that wrong?

> You also have SCSI support (CONFIG_SCSI=y) compiled in, is that because you
> have SCSI HDDs, or because you use something else that needs SCSI support, like
> USB mass storage devices?
Yes, I'd like to use USB storage devices on that box, but I could leave it off for a while to test.

> Can you try to compile a kernel without CONFIG_IDE, and if possible without
> CONFIG_SCSI, and then try to reproduce this bug?
Will do. Thanks for your hints.

Comment 7 Maarten Bressers (RETIRED) gentoo-dev

2007-09-23 01:24:24 UTC

> Ah, OK. I thought I'd need CONFIG_IDE for CDROM-support, is that wrong?
The new libata can also handle IDE optical devices. These should be detected automatically, if not try the "libata.atapi_enabled=1" kernel boot parameter. The first optical device will be named /dev/sr0.

> > You also have SCSI support (CONFIG_SCSI=y) compiled in, is that because you
> > have SCSI HDDs, or because you use something else that needs SCSI support, like
> > USB mass storage devices?
> Yes, I'd like to use USB storage devices on that box, but I could leave it off
> for a while to test.
Please do so.
 
> > Can you try to compile a kernel without CONFIG_IDE, and if possible without
> > CONFIG_SCSI, and then try to reproduce this bug?
> Will do. Thanks for your hints.
No problem, thank you for helping us determine the cause of this bug.

Comment 8 sECuRE 2007-09-23 12:22:05 UTC

(In reply to comment #7)
> > > You also have SCSI support (CONFIG_SCSI=y) compiled in, is that because you
> > > have SCSI HDDs, or because you use something else that needs SCSI support, like
> > > USB mass storage devices?
> > Yes, I'd like to use USB storage devices on that box, but I could leave it off
> > for a while to test.
> Please do so.
OK, I cannot. It seems like CONFIG_ATA depends on CONFIG_SCSI, I've also noticed that on another box. Without CONFIG_CHR_DEV_SG, there are no /dev/sd*-devices.

Is this intended or did I misconfigure something? (New kernel config attached)

Comment 9 sECuRE 2007-09-23 12:22:49 UTC

Created attachment 131684 [details]
new kernel configuration (tried CONFIG_SCSI=n, but make did change it)

Comment 10 sECuRE 2007-09-23 12:23:53 UTC

(In reply to comment #8)
> noticed that on another box. Without CONFIG_CHR_DEV_SG, there are no
Oops, I meant CONFIG_BLK_DEV_SD.

Comment 11 Daniel Drake (RETIRED) gentoo-dev

2007-09-23 12:39:34 UTC

Indeed, you need both SCSI and SCSI disk support for libata to work, please ignore that suggestion.

Comment 12 sECuRE 2007-09-23 13:13:39 UTC

(In reply to comment #11)
> Indeed, you need both SCSI and SCSI disk support for libata to work, please
> ignore that suggestion.
OK. I've now installed 2.6.22.7 and noticed that I do need CONFIG_IDE for my IDE hard disks. They aren't detected when I try to use libata (is there an option specific to nForce 2 chipsets? I didn't find one).

Current kernel config is attached.

Comment 13 sECuRE 2007-09-23 13:14:09 UTC

Created attachment 131693 [details]
current kernel configuration (with CONFIG_IDE and CONFIG_SCSI and CONFIG_ATA)

Comment 14 Daniel Drake (RETIRED) gentoo-dev

2007-09-24 16:24:06 UTC

Forgot to mention this (twice! sorry)

Your disks will change to /dev/sda (rather than hda).
You also need the PATA_AMD driver.

Comment 15 sECuRE 2007-10-03 23:31:12 UTC

It happened again :-(. dmesg-Output attached. I just copied 65 GB from this disk onto another one. System Uptime was 10 days.

Comment 16 sECuRE 2007-10-03 23:32:09 UTC

Created attachment 132509 [details]
dmesg-output of the new crash

Comment 17 sECuRE 2007-10-03 23:42:15 UTC

This could probably be a bug in sata_promise-driver, as this is the scsi_host which doesn't show its devices anymore (even after rescan. any ideas what to try before I reboot?)

Comment 18 Maarten Bressers (RETIRED) gentoo-dev

2007-10-06 21:42:48 UTC

As stated before, don't use CONFIG_IDE, it's not needed. Just take note that your HDDs will be called /dev/sdX instead of /dev/hdX.

I see you use both Promise drivers (CONFIG_SATA_PROMISE and CONFIG_SATA_SX4), do you need them both?

Please test with the latest development kernel, 2.6.23-rc8 as of this writing.

Comment 19 sECuRE 2007-10-18 13:39:07 UTC

(In reply to comment #18)
> As stated before, don't use CONFIG_IDE, it's not needed. Just take note that
> your HDDs will be called /dev/sdX instead of /dev/hdX.
OK, did that.

> I see you use both Promise drivers (CONFIG_SATA_PROMISE and CONFIG_SATA_SX4),
> do you need them both?
No, I attached the new kernel configuration.

> Please test with the latest development kernel, 2.6.23-rc8 as of this writing.
2.6.23 is released now, I've installed it.

Also, the machine crashed before installing it (running 2.6.22.7), stating "Journal commit error". Unfortunately, I don't have dmesg-output or anything, because it just froze.

Comment 20 sECuRE 2007-10-18 13:41:47 UTC

Created attachment 133753 [details]
current kernel configuration (2.6.23)

Comment 21 Maarten Bressers (RETIRED) gentoo-dev

2007-10-24 18:47:03 UTC

So do the same crashes still occur when running kernel 2.6.23? If so, please post a new dmesg output.

Comment 22 Mike Pagano gentoo-dev

2007-11-12 15:04:37 UTC

Have you had a chance to perform the test requested in comment #21 ?

Comment 23 sECuRE 2007-11-12 16:48:17 UTC

(In reply to comment #22)
> Have you had a chance to perform the test requested in comment #21 ?
No, not yet, but if my ordered hard drives arrive this week I'll surely have a chance (backing up everything then) ;-).

Comment 24 sECuRE 2007-11-24 00:45:13 UTC

So, it just happened again. I was copying a ~ 1 GB big file via SCP and suddenly the machine just hung. I got log output from /var/log/messages after hard-resetting the machine, I'm especially wondering about the exception messages...? Could anyone explain the meaning of those to me please?

See attachment msg_crash_24.11.07.txt

Comment 25 sECuRE 2007-11-24 00:45:52 UTC

Created attachment 136832 [details]
Output in /var/log/messages from the crash with 2.6.23 @ 2007/11/24

Comment 26 sECuRE 2007-12-16 00:32:10 UTC

I could "reproduce" it today quite often. I got my new hard disks and another controller (3ware 9xxx, kinda good) which just works fine. I then tried to backup onto this device (> 100 GB, high I/O load on the PCI-bus) and my root hard disk was gone (on the promise controller).

I have just ordered another cheap controller, this time with the SIL-chipset (sata_sil), as I'm just damn annoyed by this bug. Thanks for your time but it looks like you'll have to find someone else to test possible fixes.

Things I have tried:
- 2.6.24-rc3 (as 2.6.24-rc2 should have some fix for eventhandlers, didn't change anything)
- Setting the controller into 1.5GBps-mode instead of 3.0GBps via a patch, didn't change anything aswell

Comment 27 sECuRE 2007-12-23 10:55:30 UTC

(In reply to comment #26)
> I have just ordered another cheap controller, this time with the SIL-chipset
> (sata_sil), as I'm just damn annoyed by this bug. Thanks for your time but it
> looks like you'll have to find someone else to test possible fixes.
I now have installed the controller and it works better than the promise. I do get a lot of ext3-errors, though, but maybe this is a broken ext3 because of the unclean shutdowns (= freezes) before or maybe the disk is dying.

The interesting thing is here: The SATA exception does occur aswell, but the SIL is able to recover:
ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
ata1.00: cmd ca/00:18:43:8f:ed/00:00:00:00:00/e0 tag 0 cdb 0x0 data 12288 out
         res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
ata1.00: status: { DRDY }
ata1: soft resetting link
ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
ata1.00: configured for UDMA/100
ata1: EH complete
sd 1:0:0:0: [sdb] 145226112 512-byte hardware sectors (74356 MB)
sd 1:0:0:0: [sdb] Write Protect is off
sd 1:0:0:0: [sdb] Mode Sense: 00 3a 00 00
sd 1:0:0:0: [sdb] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA

Here are the SATA-Errors I get:
EXT3-fs error (device dm-0): ext3_add_entry: bad entry in directory #673528: rec_len % 4 != 0 - offset=0, inode=2497248225, rec_len=38174, name_len=30
EXT3-fs error (device dm-0): ext3_add_entry: bad entry in directory #673528: rec_len % 4 != 0 - offset=0, inode=2497248225, rec_len=38174, name_len=30
... goes on with the same message for about 150 messages ...
EXT3-fs error (device dm-0): ext3_new_block: Allocating block in system zone - blocks from 14516225, length 1
... goes on for about 500 msgs ...
EXT3-fs error (device dm-0): ext3_free_blocks: Freeing blocks in system zones - Block = 14516406, count = 1
... also for ~ 500 msgs ...
EXT3-fs error (device dm-0): ext3_free_blocks_sb: bit already cleared for block 14519661
... also for ~ 500 msgs ...

Checking df, I have 3 GB of unusable space on that disk now:
Filesystem           1K-blocks      Used Available Use% Mounted on
/dev/mapper/root      69480084  18288684  47661972  28% /

How can I find out which the affected directory was? It seems to correlate with an MySQL-Error I got (table marked as crashed), but I'm not sure about that.

I want to use this disk in another computer with native SATA and check SMART-status later (as using SMART via sata_sil doesn't work, see below) and do an FSCK.

Output of smartctl --all /dev/sdb:
Device: ATA      WDC WD740GD-00FL Version: 31.0
Serial number: WD-WMAKE1865870
Device type: disk
Local Time is: Sun Dec 23 11:54:59 2007 CET
Device does not support SMART

Comment 28 Mike Pagano gentoo-dev

2008-04-16 12:32:57 UTC

My research indicates you might have a corrupted file system. I've also read that you can find the affected directory in the following manner.

You can run debugfs on the mounted filesystem and give the command:

debugfs: ls <673528>

You can figure out its pathname by doing a:

debugfs: cd <673528>
debugfs: pwd

Let us know the results if you don't mind.
root # qfile debugfs
sys-fs/e2fsprogs (/sbin/debugfs)

Comment 29 sECuRE 2008-04-16 12:42:08 UTC

(In reply to comment #28)
> My research indicates you might have a corrupted file system. I've also read
> that you can find the affected directory in the following manner.
Thanks for your explanation of the debugfs-command.

Are you definitely sure that the filesystem can be corrupted when an fsck does not show any errors?

In the meantime, I have swapped this controller and I'm now using SATA_MV (experimental, but working without exceptions at all, so far). So, sorry, can't try it out anymore ;).

Comment 30 Mike Pagano gentoo-dev

2008-04-16 13:02:56 UTC

(In reply to comment #29)
> Are you definitely sure that the filesystem can be corrupted when an fsck does
> not show any errors?

No, that's why I used the *might* in there. :)

OK, I'm going to close this as CANTFIX since the hardware doesn't exist anymore.  Sorry we could not resolve this better, it was kinda tricky.