the aacraid module will fail if the underlying controller is running in any mode other thatn Optimal. If the card is in Optimal mode the drives work fine. However if one of the drives is missing the system will no longer access the good drive. The controller is detected, but it can't be accessed with mount of fdisk. After the second drive is brought back on line and it is no longer reinitializing. It will work. It won't even work if both drives are present and syncing. Reproducible: Always Steps to Reproduce: 1. Use the 2.6 series kernel and compile in the aacraid module 2. Detatch one of the fail safe (RAID-1) drives 3. The system will load start menu and kernel but partitions can't be mounted Actual Results: The machine will only boot with the kernel-2.4. the kernel-2.6 will just hang and give up when trying to mount the root filesystem. Note this is only if the controller is operating in some mode other than Optimal. Again kernel-2.4 works. snipet from dmesg Red Hat/Adaptec aacraid driver (1.1.2-lk2 Jan 27 2005) ACPI: PCI interrupt 0000:02:01.0[A] -> GSI 24 (level, low) -> IRQ 24 AAC0: kernel 4.1.4 build 7403 AAC0: monitor 4.1.4 build 7403 AAC0: bios 4.1.0 build 7403 AAC0: serial bfccd5fafaf001 scsi0 : aacraid Vendor: DELL Model: DATA 1 Rev: V1.0 Type: Direct-Access ANSI SCSI revision: 02 Expected Results: Mounted and accessed the disks normally while the one good drive is still available.
Which kernels have you tried? Please try 2.6.11_rc2 if you haven't already.
the 2.6.11_rc2 seems to work, but the 2.6.11_rc1 does not! so it looks like that whatever has fixed it is very new, here is the information from the rc2 kernel: Red Hat/Adaptec aacraid driver (1.1.2-lk2 Jan 28 2005) ACPI: PCI interrupt 0000:02:01.0[A] -> GSI 24 (level, low) -> IRQ 24 AAC0: kernel 4.1.4 build 7403 AAC0: monitor 4.1.4 build 7403 AAC0: bios 4.1.0 build 7403 AAC0: serial bfccd5fafaf001 scsi0 : aacraid Vendor: DELL Model: DATA 1 Rev: V1.0 Type: Direct-Access ANSI SCSI revision: 02 ======================================================================= the rc1 kernel had this information: Red Hat/Adaptec aacraid driver (1.1.2-lk2 Jan 28 2005) ACPI: PCI interrupt 0000:02:01.0[A] -> GSI 24 (level, low) -> IRQ 24 AAC0: kernel 4.1.4 build 7403 AAC0: monitor 4.1.4 build 7403 AAC0: bios 4.1.0 build 7403 AAC0: serial bfccd5fafaf001 scsi0 : aacraid Vendor: DELL Model: DATA 1 Rev: V1.0 Type: Direct-Access ANSI SCSI revision: 02 =========================================================================== I don't see any difference above. The changelog between rc1 and rc2 had this to say regarding the aacraid driver: <coughlan@redhat.com> [PATCH] aacraid: remove aac_handle_aif When aac_command_thread detects an adapter event (AifCmdDriverNotify or AifCmdEventNotify) it calls aac_handle_aif. This routine sets a flag, calls fib_adapter_complete, and returns. The bad news is that after the return, aac_command_thread continues to process the command and calls fib_adapter_complete again. Under some circumstances this causes the driver to take the device offline. In my case, it happens with a Dell CERC SATA with a RAID 5 in the "building" state: aacraid: Host adapter reset request. SCSI hang ? aacraid: Host adapter appears dead scsi: Device offlined - not ready after error recovery: host 0 channel 0 id 0 lun 0 SCSI error : <0 0 0 0> return code = 0x6000000 end_request: I/O error, dev sda, sector 976537592 Mark Salyzyn says the intent is for aac_handle_aif to perform some plug-n-play actions based on the adapter event, and return, leaving the command completion to the caller. The attached patch solves the problem by removing aac_handle_aif entirely, since it is wrong, and there is currently no code in the driver to actually do anything with these events. Signed-off-by: James Bottomley <James.Bottomley@SteelEye.com> ========================================================================== I hope someone can assess this and figure out if the module is now stable.
Please apply this patch against 2.6.10 and let us know if it fixes it: http://linux.bkbits.net:8080/linux-2.6/gnupatch@41d9b79fvwpNcY-B1BabPhtde7JhuA
adaptec-1.1-5 aacraid driver plus a couple of fixes tested to work with gentoo-dev-sources-2.6.10-r5. I had 2 servers with crashing aacraid drivers with the stock driver. The servers have been running stable now for a week with no problems after using the 1.1-5+fixes... Being as this is bugzilla I made a patch from the 2.6.10-gentoo-r5 kernel to what I have currently. If you want me to send a tar.bz2 of the aacraid directory I can do that too...
Created attachment 49889 [details, diff] aacraid-1.1.2lk-to-aacraid-1.1.5w26patches
I'm not keen on applying such a big patch. I'd rather just apply the fix alone, which is probably the patch I mentioned in comment #3. If someone could confirm whether that alone solves the problem it would be appreciated.
I have applied to the patch in comment 3 and I now can use the module normally. I have applied the patch to vanilla-sources and gentoo-sources 2.6.10 verion and they both work. I have also done some load testing on one drive and dual drive operation and have not seen any adverse effects. So far the server is running stable. Will this be added to gentoo-sources any time soon?
Thanks for testing. Yes, it will go in the next release.
Fixed in gentoo-dev-sources-2.6.10-r7