Gentoo Websites Logo
Go to: Gentoo Home Documentation Forums Lists Bugs Planet Store Wiki Get Gentoo!
Bug 78762 - Maxtor 7Y250M0 250GB SATA sporadic drive failures (piix/ih6/libata/ata_piix)
Summary: Maxtor 7Y250M0 250GB SATA sporadic drive failures (piix/ih6/libata/ata_piix)
Status: RESOLVED UPSTREAM
Alias: None
Product: Gentoo Linux
Classification: Unclassified
Component: [OLD] Core system (show other bugs)
Hardware: x86 Linux
: High critical (vote)
Assignee: Gentoo Kernel Bug Wranglers and Kernel Maintainers
URL: http://forums.gentoo.org/viewtopic.ph...
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2005-01-19 19:29 UTC by Gregg Casillo
Modified: 2005-01-28 03:52 UTC (History)
1 user (show)

See Also:
Package list:
Runtime testing required: ---


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Gregg Casillo 2005-01-19 19:29:14 UTC
Follow the link to the forum thread where I have documented my problem in full. In short, I have four servers running on Intel 865PERL motherboards and Maxtor 7Y250M0 250GB SATA drives, and two of them are failing with drive errors with some frequency now. They had been running reliably on kernels up to the gentoo-dev-sources-2.6.9 series. I recently updated to 2.6.10-r4 and 2.6.10-r6 kernels. That is when the problems began. I am using the SCSI low-level drivers under libata and Intel PIIX (see my kernel config in the thread).

These are production MPEG encoders that are in great demand at the moment, so I can't afford this kind of downtime from stable packages. I may downgrade back to the latest 2.6.9 kernel, but any guidance would be appreciated. Also please take a look at those three photos of the error(s) I was seeing. Thanks.

Reproducible: Always
Steps to Reproduce:
1. Boot into a gentoo-dev-sources-2.6.10 kernel.
2. Maybe boot normally. Maybe not.
3. At some point, the machine dies with the errors shown in the photos I took (links to them are in the forums thread).
4. Add another piece of hardware or software to my collection of things that do not work. ;-)
Actual Results:  
See step 4.

Expected Results:  
No drive failures.
Comment 1 Daniel Drake (RETIRED) gentoo-dev 2005-01-20 08:26:57 UTC
Could you please see if the issue remains in 2.6.11-rc1?
Comment 2 Gregg Casillo 2005-01-20 13:10:33 UTC
I may try the new kernel soon. Last night, I got one of the machines back up and running by simply power cycling. After logging in, I stopped samba and nfs services. I wanted to see if I could encode some MPEG video files overnight without the machine crashing.

It worked. I came into work this morning to find some MPEG files waiting for me on a machine that hadn't crashed.

FWIW, this could be a Samba (very suspicious) or NFS (much less likely) issue. I very rarely use NFS to transfer files between these Gentoo Linux based MPEG encoder machines. Very, very rarely. However, I have a Samba share on each of these so that folks on my LAN (namely me) can view and manipulate the MPEG files.

A little background info... We have a program, FlipFactory, that monitors these Samba shares for new MPEG files, and when it finds them, it transcodes the MPEG files to WMV, RM, and MOV streaming media files. Those files are written to a NAS server, so it's just reading the MPEG files from the Samba share.

I don't think it is a coincidence that the two machines I have suffered drive failures on are the busiest of the four, with one in particular that encodes MPEG files most nights and gets hit by humans looking at those files by day.

The other machine that I am having to rebuild from scratch, I am installing 2.6.10-r6, but I am going to try a ext3 filesystem instead of reiserfs. I've used reiserfs now for 2-3 years with no major incidents, but I'm grasping for anything. Maybe it's a filesystem issue, but I have doubts.

If you're interested, here's the /etc/samba/smb.conf from the box that I can get running with a power cycle (any commented out, I removed for brevity):

[global]
  workgroup = webteam
  netbios name = ketenc
  server string = MPEG Encoder
  log file = /var/log/samba3/log.%m
  max log size = 50
  map to guest = bad user
  security = user
  encrypt passwords = yes
  smb passwd file = /etc/samba/private/smbpasswd
  socket options = TCP_NODELAY SO_RCVBUF=8192 SO_SNDBUF=8192
  dns proxy = no 
[video]
  path = /mnt/video
  public = yes
  only guest = yes
  writable = yes
  printable = no

Finally, whenever I've built a kernel over the last six months or so (dozens), I've been seeing this when completing my "make menuconfig:"

livecd linux # make menuconfig
  HOSTCC  scripts/basic/fixdep
  HOSTCC  scripts/basic/split-include
  HOSTCC  scripts/basic/docproc
  SHIPPED scripts/kconfig/zconf.tab.h
  HOSTCC  scripts/kconfig/conf.o
  HOSTCC  scripts/kconfig/mconf.o
  SHIPPED scripts/kconfig/zconf.tab.c
  SHIPPED scripts/kconfig/lex.zconf.c
  HOSTCC  scripts/kconfig/zconf.tab.o
  HOSTLD  scripts/kconfig/mconf
  HOSTCC  scripts/lxdialog/checklist.o
  HOSTCC  scripts/lxdialog/inputbox.o
  HOSTCC  scripts/lxdialog/lxdialog.o
  HOSTCC  scripts/lxdialog/menubox.o
  HOSTCC  scripts/lxdialog/msgbox.o
  HOSTCC  scripts/lxdialog/textbox.o
  HOSTCC  scripts/lxdialog/util.o
  HOSTCC  scripts/lxdialog/yesno.o
  HOSTLD  scripts/lxdialog/lxdialog
scripts/kconfig/mconf arch/i386/Kconfig
#
# using defaults found in arch/i386/defconfig
#
arch/i386/defconfig:129: trying to assign nonexistent symbol PM_DISK
arch/i386/defconfig:176: trying to assign nonexistent symbol PCI_USE_VECTOR
arch/i386/defconfig:252: trying to assign nonexistent symbol BLK_DEV_CARMEL
arch/i386/defconfig:273: trying to assign nonexistent symbol IDE_TASKFILE_IO
arch/i386/defconfig:292: trying to assign nonexistent symbol BLK_DEV_ADMA
*** End of Linux kernel configuration.ign nonexistent symbol SCSI_MEGARAID
*** Execute 'make' to build the kernel or try 'make help'.ol NET_FASTROUTE
arch/i386/defconfig:571: trying to assign nonexistent symbol NET_HW_FLOWCONTROL
livecd linux # nfig:777: trying to assign nonexistent symbol QIC02_TAPE
arch/i386/defconfig:1248: trying to assign nonexistent symbol X86_STD_RESOURCES

I don't know if that's worth fretting over. Hasn't been caused any problem that I'm aware of. Just throwing out whatever I can that might be suspicious.
Comment 3 Gregg Casillo 2005-01-20 14:34:01 UTC
Now I very much feel this is related to Samba. I just started the Samba service on the machine that encoded some files last night. The share mounted, and I attempted to copy some file to my laptop across the network. I got a couple files, but then the machine froze. Went to the physical machine itself and attempted to login: it was locked up tight.

I'm going to stop Samba services on each of my encoders and see if I can make it through the weekend. If I can, then I'm going to point the finger at Samba. I have had version 3.0.9-r1 installed, and I recently updated to 3.0.10 on most if not all of them. This problem has spanned installation of both versions for me.
Comment 4 Gregg Casillo 2005-01-20 16:26:40 UTC
It just occurred to me that I didn't supply the link to the forums thread:
http://forums.gentoo.org/viewtopic.php?t=282362
Comment 5 Gregg Casillo 2005-01-25 17:57:12 UTC
Alright, I restarted Samba services on my MPEG encoders after making a change to FlipFactory. I was using regular "Network Folder" monitor as opposed to "Network Folder (Samba)." In short, FF is now using Samba protocols to access my shares instead of regular Windows networking (I think). No drive failures in two days now.

This would appear to be a FlipFactory/Samba bug, not something in Gentoo Linux. I should have used the Samba related monitor instead of the vanilla network folder monitor.
Comment 6 Daniel Drake (RETIRED) gentoo-dev 2005-01-28 03:52:43 UTC
Please report this to the relevant upstream developers.