Kernel oops when loading SCSI drivers and polling. Reproducible: Always Steps to Reproduce: 1. 2. 3. Boot with enough different host adapters, with devices on them Expected Results: Should load the scsi devices I have these Adaptec AHA-2940U/UW / AHA-39xx / AIC-7895 (rev 03) 04:01.1 SCSI storage controller: LSI Logic / Symbios Logic 53c1010 Ultra3 SCSI Adapter (rev 01) 04:02.0 SCSI storage controller: LSI Logic / Symbios Logic SAS1068 PCI-X Fusion-MPT SAS (rev 01) and an AHCI SATA controller. in am ML150 (G3) server. I think the number of cards must be relevant or everyone would be screaming blue murder by now. The verion 2.6.22-gentoo-r8 (SMP) works fine. The verion 2.6.22-gentoo-r9 (SMP) works if I switch off the external devices on the adaptech controller at boot time. I can switch them back on and rescan the bus ok. vanilla 2.6.23.8 compiled as SMP fails. vanilla 2.6.23.8 compiled as UP works fine! The reason I'm not too concerned is that vanilla 2.6.24-rc3 compiled as SMP works fine. So something got broken and then quietly fixed This bug report filed in the public interest. --john
Ok, now we have reference to this bug to help anyone who might hit it, we can close it.
Thanks for reporting, but we prefer to leave these issues open until they are fixed in released kernels. It may also be possible to locate and backport the fix. I'm a bit low on time right now, but hopefully myself or someone else will look at this soon.
Thanks for taking this seriously. Its occurred to me that it may be hard for others to test since my hardware is not mainstream. Thus I've studied up git-bisect to see if I can identify the problem patch. This is all new to me, so you might see some dumb questions. Regards, John
Ok, after a mind deadening series or reboots... Biest of linus's tree between v2.6.22 and v2.6.23 the bisect log is git-bisect start # good: [7dcca30a32aadb0520417521b0c44f42d09fe05c] Linux 2.6.22 git-bisect good 7dcca30a32aadb0520417521b0c44f42d09fe05c # bad: [bbf25010f1a6b761914430f5fca081ec8c7accd1] Linux 2.6.23 git-bisect bad bbf25010f1a6b761914430f5fca081ec8c7accd1 # good: [0a87cf128f3d3bc6aa7b1040e73109c974ed875a] NFSv4: handle lack of clientaddr in option string git-bisect good 0a87cf128f3d3bc6aa7b1040e73109c974ed875a # good: [878701db07db3f0b59f14f0c525b681e4ca81551] Merge master.kernel.org:/pub/scm/linux/kernel/git/bart/ide-2.6 git-bisect good 878701db07db3f0b59f14f0c525b681e4ca81551 # good: [f2154eef2a926435cdf79156cd361092d6cba91e] Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ieee1394/linux1394-2.6 git-bisect good f2154eef2a926435cdf79156cd361092d6cba91e # good: [2cc7345ff71b27b5ac99e49ad7de39360042f601] [SPARC64]: Fix booting on V100 systems. git-bisect good 2cc7345ff71b27b5ac99e49ad7de39360042f601 # good: [6110e02b97377a2903853faf3ecaff0e742fbe93] Merge branch 'for-linus' of git://oss.sgi.com:8090/xfs/xfs-2.6 git-bisect good 6110e02b97377a2903853faf3ecaff0e742fbe93 # good: [2bcff60f7ce88c09a2bc1302ff14510737bfcb7b] mv643xx_eth: Check ETH_INT_CAUSE_STATE bit git-bisect good 2bcff60f7ce88c09a2bc1302ff14510737bfcb7b # bad: [c7659e2c139d0be4647bef89188a932e0254d709] Merge branch 'upstream' of git://ftp.linux-mips.org/pub/scm/upstream-linus git-bisect bad c7659e2c139d0be4647bef89188a932e0254d709 # good: [f662fe5a0b144efadbfc00e8040e603ec318746e] dm9601: Fix receive MTU git-bisect good f662fe5a0b144efadbfc00e8040e603ec318746e # good: [2910ca6f8ae69648623b3c05b79be87dd7bda73d] Merge branch 'upstream-linus' of master.kernel.org:/pub/scm/linux/kernel/git/jgarzik/libata-dev git-bisect good 2910ca6f8ae69648623b3c05b79be87dd7bda73d # good: [c58c2140f08de4ad0b0dbd48f6e78168dc321042] Blackfin arch: gpio pinmux and resource allocation API required by BF537 on chip ethernet mac driver git-bisect good c58c2140f08de4ad0b0dbd48f6e78168dc321042 # good: [fef74705ea310acd716c2722bfeb0f796cf23640] [MIPS] Type proof reimplementation of cmpxchg. git-bisect good fef74705ea310acd716c2722bfeb0f796cf23640 # good: [66b1f1a982bf4dbad9fa0de25b8d95c4936f05c4] Merge branch 'for-linus' of master.kernel.org:/pub/scm/linux/kernel/git/cooloney/blackfin-2.6 git-bisect good 66b1f1a982bf4dbad9fa0de25b8d95c4936f05c4 # good: [9ea0f043fec38fadb0101fbf29563a5635f42e93] [MIPS] Terminally fix local_{dec,sub}_if_positive git-bisect good 9ea0f043fec38fadb0101fbf29563a5635f42e93 I wish i knew what it meant... Regards John
I've just tried 2.6.24-rc4 and the bug is back. This has given me reason to reconsider what is happening. The event that brought the problem to light was the installation of a secondhand Storagetek L80 tape library. This has two DLT8000 drives on a HV-Differential bus. This needed special card, an adaptec 3944AUWD. The kernel I was running at that time was 2.6.22-gentoo-r8. It worked fine. Then when -r9 came out and this error manifested, the assumption was that -r9 was broken. I no longer think this to be the case. I think they are _ALL_ broken, possibly going way back toward the start of the 2.6 series. I think that the bug may or may not manifest depending on the internal layout of data in the kernel --A true heisenbug-- All that the git bisect did was to change the internal layout, not add/remove a bad patch. This explains why I could take the 2.6.23.8 kernel and compile for SMP and have it fail. Compile it for UP and have it work. Initially I thought that meant a locking or race issue. Now I think its was just another case of altering the internal kernel layout. And adding the "nosmp" kernel option really doesn't work. I just get kernel oopses. I bet nosmp hasn't worked for ages, but nobody cares, even me. I would like advise on what to do next. Regards, John
Hi, It looks like its a puzzler. Can we get it referred upstream??
OK, so, to clarify, the problem started appearing after you installed the Storagetek L80/Adaptec 3944AUWD. If you boot up with the devices ON, the kernel crashes, but if you boot up with them off and turn them on later, it does not crash. Is that accurate? Please could you post the oops message or whatever error is happening. If it hangs the system, you could capture the message with a digital camera or serial console.
Hi Daniel, I've reentered the bug here http://bugzilla.kernel.org/show_bug.cgi?id=9775 in an attempt to get wider attention. There is a screenshot there. Im trying document what happens with all combinations of hardware, and its turning out to be a pain. Not least because my window of opportunity to work on the system has reduced. Hopefully i'll have more information in a week. I'll keep with bugzilla.kernel.org since it does seem to relate to the AHA7XXX driver. Regards, john
thanks, will watch the upstream bug