144657 – sys-apps/smartmontools - smartd can lead to failures with software raid-5

Bug 144657 - sys-apps/smartmontools - smartd can lead to failures with software raid-5

Summary: sys-apps/smartmontools - smartd can lead to failures with software raid-5

Status:	RESOLVED UPSTREAM

Alias:	None

Product:	Gentoo Linux
Classification:	Unclassified
Component:	[OLD] Core system (show other bugs)
Hardware:	All Linux

Importance:	High critical
Assignee:	Gentoo's Team for Core System packages

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2006-08-21 09:35 UTC by georg.lippold
Modified:	2006-08-28 00:31 UTC (History)
CC List:	0 users

See Also:
Package list:
Runtime testing required:	---

Attachments
hdparm -I /dev/sd{a,b,c,d} (hdparm-I_dev_sdx,6.80 KB, text/plain) 2006-08-21 12:39 UTC, georg.lippold	Details
/etc/smartd.conf (smartd.conf,5.07 KB, text/plain) 2006-08-21 12:40 UTC, georg.lippold	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description georg.lippold 2006-08-21 09:35:23 UTC

I'm having a software raid-5 using a Promise SATA 300 TX-4 Serial ATA controller with the sata_promise driver statically linked into the kernel and four attached Seagate 300 GB SATA-Disks. When I use

/etc/init.d/smartd start

I always get the following errors while rebuilding the raid array:

Aug 21 18:10:46 backup kernel: ata1: PIO error
Aug 21 18:10:46 backup kernel: ata1: status=0x50 { DriveReady SeekComplete }
Aug 21 18:10:46 backup kernel: ata1: PIO error
Aug 21 18:10:46 backup kernel: ata1: status=0x50 { DriveReady SeekComplete }
Aug 21 18:10:46 backup kernel: ata2: PIO error
Aug 21 18:10:46 backup kernel: ata2: status=0x50 { DriveReady SeekComplete }
Aug 21 18:10:47 backup kernel: ata2: PIO error
Aug 21 18:10:47 backup kernel: ata2: status=0x50 { DriveReady SeekComplete }
Aug 21 18:10:47 backup kernel: ata3: PIO error
Aug 21 18:10:47 backup kernel: ata3: status=0x50 { DriveReady SeekComplete }
Aug 21 18:10:47 backup kernel: ata3: PIO error
Aug 21 18:10:47 backup kernel: ata3: status=0x50 { DriveReady SeekComplete }
Aug 21 18:10:47 backup kernel: ata4: PIO error
Aug 21 18:10:47 backup kernel: ata4: status=0x50 { DriveReady SeekComplete }
Aug 21 18:10:48 backup kernel: ata4: PIO error
Aug 21 18:10:48 backup kernel: ata4: status=0x50 { DriveReady SeekComplete }

Sometimes, even worse things happen:

Aug 21 17:46:15 backup kernel: ata1: PIO error
Aug 21 17:46:15 backup kernel: ata1: status=0x50 { DriveReady SeekComplete }
Aug 21 17:46:20 backup kernel: ata1: PIO error
Aug 21 17:46:20 backup kernel: ata1: status=0x50 { DriveReady SeekComplete }
Aug 21 17:46:21 backup kernel: ata2: PIO error
Aug 21 17:46:21 backup kernel: ata2: status=0x50 { DriveReady SeekComplete }
Aug 21 17:46:22 backup kernel: ata2: PIO error
Aug 21 17:46:22 backup kernel: ata2: status=0x50 { DriveReady SeekComplete }
Aug 21 17:46:22 backup kernel: ata3: PIO error
Aug 21 17:46:22 backup kernel: ata3: status=0x50 { DriveReady SeekComplete }
Aug 21 17:46:22 backup kernel: ata3: PIO error
Aug 21 17:46:22 backup kernel: ata3: status=0x50 { DriveReady SeekComplete }
Aug 21 17:46:22 backup kernel: ata4: PIO error
Aug 21 17:46:22 backup kernel: ata4: status=0x50 { DriveReady SeekComplete }
Aug 21 17:46:22 backup kernel: ata4: PIO error
Aug 21 17:46:22 backup kernel: ata4: status=0x50 { DriveReady SeekComplete }
Aug 21 17:46:33 backup kernel: ata1: status=0xff { Busy }
Aug 21 17:47:08 backup kernel: ata1: status=0xd0 { Busy }
Aug 21 17:47:38 backup kernel: ata1: status=0xff { Busy }
Aug 21 17:47:38 backup kernel: sd 0:0:0:0: SCSI error: return code = 0x8000002
Aug 21 17:47:38 backup kernel: sda: Current: sense key: Aborted Command
Aug 21 17:47:38 backup kernel:     Additional sense: Scsi parity error
Aug 21 17:47:38 backup kernel: end_request: I/O error, dev sda, sector 2709447
Aug 21 17:48:08 backup kernel: ata1: status=0xff { Busy }
Aug 21 17:48:08 backup kernel: sd 0:0:0:0: SCSI error: return code = 0x8000002
Aug 21 17:48:08 backup kernel: sda: Current: sense key: Aborted Command
Aug 21 17:48:08 backup kernel:     Additional sense: Scsi parity error
Aug 21 17:48:08 backup kernel: end_request: I/O error, dev sda, sector 2709455
Aug 21 17:48:38 backup kernel: ata1: status=0xff { Busy }
Aug 21 17:48:38 backup kernel: sd 0:0:0:0: SCSI error: return code = 0x8000002
Aug 21 17:48:38 backup kernel: sda: Current: sense key: Aborted Command
Aug 21 17:48:38 backup kernel:     Additional sense: Scsi parity error
Aug 21 17:48:38 backup kernel: end_request: I/O error, dev sda, sector 2709463
Aug 21 17:49:08 backup kernel: ata1: status=0xff { Busy }
Aug 21 17:49:08 backup kernel: sd 0:0:0:0: SCSI error: return code = 0x8000002
Aug 21 17:49:08 backup kernel: sda: Current: sense key: Aborted Command
Aug 21 17:49:08 backup kernel:     Additional sense: Scsi parity error
Aug 21 17:49:08 backup kernel: end_request: I/O error, dev sda, sector 2709471

Although after a reboot, the disks do not have bad sectors (as one would expect). Additionally, I got a corrupted (and unrecoverable) ext2 filesystem on the raid-5 device after a running time of appox. 1 week. It seems that the smartmontools issue some commands that interfere with disk i/o and can lead to filesystem (and worse, raid) corruption.

Best Regards,

Georg


Portage 2.1-r1 (default-linux/x86/2006.0, gcc-3.4.6, glibc-2.3.6-r4, 2.6.17-gentoo-r4 i686)
=================================================================
System uname: 2.6.17-gentoo-r4 i686 AMD Sempron(tm) 2400+
Gentoo Base System version 1.6.15
app-admin/eselect-compiler: [Not Present]
dev-lang/python:     2.3.5, 2.4.3-r1
dev-python/pycrypto: 2.0.1-r5
dev-util/ccache:     [Not Present]
dev-util/confcache:  [Not Present]
sys-apps/sandbox:    1.2.17
sys-devel/autoconf:  2.13, 2.59-r7
sys-devel/automake:  1.4_p6, 1.5, 1.6.3, 1.7.9-r1, 1.8.5-r3, 1.9.6-r2
sys-devel/binutils:  2.16.1-r3
sys-devel/gcc-config: 1.3.13-r3
sys-devel/libtool:   1.5.22
virtual/os-headers:  2.6.11-r2
ACCEPT_KEYWORDS="x86"
AUTOCLEAN="yes"
CBUILD="i686-pc-linux-gnu"
CFLAGS="-O2 -march=i686 -fomit-frame-pointer"
CHOST="i686-pc-linux-gnu"
CONFIG_PROTECT="/etc /usr/share/X11/xkb"
CONFIG_PROTECT_MASK="/etc/env.d /etc/gconf /etc/revdep-rebuild /etc/terminfo"
CXXFLAGS="-O2 -march=i686 -fomit-frame-pointer"
DISTDIR="/usr/portage/distfiles"
FEATURES="autoconfig distlocks metadata-transfer sandbox sfperms strict"
GENTOO_MIRRORS="http://linux.rz.ruhr-uni-bochum.de/download/gentoo-mirror/ ftp://ftp.wh2.tu-dresden.de/pub/mirrors/gentoo http://mirrors.sec.informatik.tu-darmstadt.de/gentoo/ http://ftp-stud.fht-esslingen.de/pub/Mirrors/gentoo/"
MAKEOPTS="-j2"
PKGDIR="/usr/portage/packages"
PORTAGE_RSYNC_OPTS="--recursive --links --safe-links --perms --times --compress --force --whole-file --delete --delete-after --stats --timeout=180 --exclude='/distfiles' --exclude='/local' --exclude='/packages'"
PORTAGE_TMPDIR="/var/tmp"
PORTDIR="/usr/portage"
SYNC="rsync://rsync.de.gentoo.org/gentoo-portage"
USE="x86 X acpi acpi4linux alsa apache2 apm cli crypt dlloader dri eds emboss fortran gstreamer ipv6 isdnlog mp3 nptl ogg pam pcre png pppd qt3 qt4 readline reflection session spl ssl tcpd truetype-fonts type1-fonts udev vorbis xml xorg zlib elibc_glibc input_devices_keyboard input_devices_mouse input_devices_evdev kernel_linux userland_GNU"
Unset:  CTARGET, EMERGE_DEFAULT_OPTS, INSTALL_MASK, LANG, LC_ALL, LDFLAGS, LINGUAS, PORTAGE_RSYNC_EXTRA_OPTS, PORTDIR_OVERLAY

Comment 1 Jakub Moc (RETIRED) gentoo-dev

2006-08-21 09:40:18 UTC

Reopen with /etc/smartd.conf and 'hdparm -I /dev/sd?' output attached.

Comment 2 georg.lippold 2006-08-21 12:39:50 UTC

Created attachment 94796 [details]
hdparm -I /dev/sd{a,b,c,d}

Comment 3 georg.lippold 2006-08-21 12:40:17 UTC

Created attachment 94797 [details]
/etc/smartd.conf

Comment 4 georg.lippold 2006-08-21 12:40:47 UTC

Attachments as requested.

Comment 5 SpanKY gentoo-dev

2006-08-21 16:11:54 UTC

since you've made your own custom config file, why dont you start with the stock one and see what options are causing you troubles

Comment 6 georg.lippold 2006-08-22 01:23:09 UTC

I won't use this software any more. It took me too long to figure the bug out. Plus, it is only a nice addon for raid storage but not necessary since disk failures are not crucial.

Here's some additional information: It is _always_ /dev/sda that fails, regardless of the attached disk (I swapped them). Additionally, the disks attached to the controller are not detected in the correct order. The disk on port1 is detected as /dev/sdd, on port 3 as /dev/sda and so on. The incorrect detection may be a driver specific issue and not related to the data corruption.

I strongly doubt that it has anything to do with the custom config file because I only set custom device checking intervals.

Regards,

Georg

Comment 7 Jakub Moc (RETIRED) gentoo-dev

2006-08-22 02:51:34 UTC

Well, to be honest I think your SATA cable or your controller is faulty, and this has nothing in common w/ smartmontools. Not much we could do here if you are not going to test anything for us.

Comment 8 georg.lippold 2006-08-22 03:19:16 UTC

Problem is, that this is the backup server at work (and unfortunately it is a small business that cannot afford a production and a testing system). I can definitely say that these errors never happened for about 1 year. They occured first when I used smartmontools about two weeks ago to additionally watch the disks for errors. Now that I disabled smartmontools, everything works as expected. I am currently restoring as many backups as possible from our external hard disks and do not want to break things again. Sorry that testing is not possible. It may be an error in the SATA-Controller, but I rather think it is in the driver or in the way smartmontools accesses SATA disks. The error occurs almost only with heavy load, such as rebuilding the array or copying large chunks of data to it.

Regards,

Georg

Comment 9 SpanKY gentoo-dev

2006-08-28 00:31:26 UTC

those PIO errors are normal ... it means you're trying to use an option the device does not understand

if you read more of your logs or just ran smartd with -d, you'd see like:
Device: /dev/sdb, opened
Device: /dev/sdb, not found in smartd database.
Error SMART Enable Auto-save failed: Input/output error
Device: /dev/sdb, could not enable SMART Attribute Autosave.
Error SMART Enable Automatic Offline failed: Input/output error
Device: /dev/sdb, enable SMART Automatic Offline Testing failed.

and then the kernel would spit:
ata2: PIO error
ata2: status=0x50 { DriveReady SeekComplete }
for each feature that failed

as for the I/O errors, i'm pretty sure that is not smartd's fault