Gentoo Websites Logo
Go to: Gentoo Home Documentation Forums Lists Bugs Planet Store Wiki Get Gentoo!
Bug 200708 - >=gentoo-sources-2.6.22-r9 - HOST_MSG_LOOP invalid SCB ff
Summary: >=gentoo-sources-2.6.22-r9 - HOST_MSG_LOOP invalid SCB ff
Status: RESOLVED FIXED
Alias: None
Product: Gentoo Linux
Classification: Unclassified
Component: [OLD] Core system (show other bugs)
Hardware: AMD64 Linux
: High major (vote)
Assignee: Gentoo Kernel Bug Wranglers and Kernel Maintainers
URL: http://bugzilla.kernel.org/show_bug.c...
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2007-11-29 00:44 UTC by John Huttley
Modified: 2011-06-28 09:54 UTC (History)
0 users

See Also:
Package list:
Runtime testing required: ---


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description John Huttley 2007-11-29 00:44:42 UTC
Kernel oops when loading SCSI drivers and polling.


Reproducible: Always

Steps to Reproduce:
1.
2.
3.
Boot with enough different host adapters, with devices on them


Expected Results:  
Should load the scsi devices

I have these
Adaptec AHA-2940U/UW / AHA-39xx / AIC-7895 (rev 03)
04:01.1 SCSI storage controller: LSI Logic / Symbios Logic 53c1010 Ultra3 SCSI Adapter (rev 01)
04:02.0 SCSI storage controller: LSI Logic / Symbios Logic SAS1068 PCI-X Fusion-MPT SAS (rev 01)

 and an AHCI SATA controller. in am ML150 (G3) server.
I think the number of cards must be relevant or everyone would be screaming blue murder by now.

The verion 2.6.22-gentoo-r8 (SMP) works fine.
The verion 2.6.22-gentoo-r9 (SMP) works if I switch off the external devices on the adaptech controller at boot time. I can switch them back on and rescan the bus ok.

vanilla 2.6.23.8 compiled as SMP fails.
vanilla 2.6.23.8 compiled as UP works fine!

The reason I'm not too concerned is that


vanilla 2.6.24-rc3 compiled as SMP works fine.

So something got broken and then quietly fixed

This bug report filed in the public interest.
--john
Comment 1 John Huttley 2007-12-02 01:42:15 UTC
Ok, now we have reference to this bug to help anyone who might hit it, we can close it.
Comment 2 Daniel Drake (RETIRED) gentoo-dev 2007-12-02 12:02:30 UTC
Thanks for reporting, but we prefer to leave these issues open until they are fixed in released kernels.

It may also be possible to locate and backport the fix. I'm a bit low on time right now, but hopefully myself or someone else will look at this soon.
Comment 3 John Huttley 2007-12-04 01:53:06 UTC
Thanks for taking this seriously. 
Its occurred to me that it may be hard for others to test since my hardware is not mainstream.

Thus I've studied up git-bisect to see if I can identify the problem patch.
This is all new to me, so you might see some dumb questions.

Regards,
John
Comment 4 John Huttley 2007-12-05 05:37:25 UTC
Ok, after a mind deadening series or reboots...
Biest of linus's tree between v2.6.22 and v2.6.23
the bisect log is

git-bisect start
# good: [7dcca30a32aadb0520417521b0c44f42d09fe05c] Linux 2.6.22
git-bisect good 7dcca30a32aadb0520417521b0c44f42d09fe05c
# bad: [bbf25010f1a6b761914430f5fca081ec8c7accd1] Linux 2.6.23
git-bisect bad bbf25010f1a6b761914430f5fca081ec8c7accd1
# good: [0a87cf128f3d3bc6aa7b1040e73109c974ed875a] NFSv4: handle lack of clientaddr in option string
git-bisect good 0a87cf128f3d3bc6aa7b1040e73109c974ed875a
# good: [878701db07db3f0b59f14f0c525b681e4ca81551] Merge master.kernel.org:/pub/scm/linux/kernel/git/bart/ide-2.6
git-bisect good 878701db07db3f0b59f14f0c525b681e4ca81551
# good: [f2154eef2a926435cdf79156cd361092d6cba91e] Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ieee1394/linux1394-2.6
git-bisect good f2154eef2a926435cdf79156cd361092d6cba91e
# good: [2cc7345ff71b27b5ac99e49ad7de39360042f601] [SPARC64]: Fix booting on V100 systems.
git-bisect good 2cc7345ff71b27b5ac99e49ad7de39360042f601
# good: [6110e02b97377a2903853faf3ecaff0e742fbe93] Merge branch 'for-linus' of git://oss.sgi.com:8090/xfs/xfs-2.6
git-bisect good 6110e02b97377a2903853faf3ecaff0e742fbe93
# good: [2bcff60f7ce88c09a2bc1302ff14510737bfcb7b] mv643xx_eth: Check ETH_INT_CAUSE_STATE bit
git-bisect good 2bcff60f7ce88c09a2bc1302ff14510737bfcb7b
# bad: [c7659e2c139d0be4647bef89188a932e0254d709] Merge branch 'upstream' of git://ftp.linux-mips.org/pub/scm/upstream-linus
git-bisect bad c7659e2c139d0be4647bef89188a932e0254d709
# good: [f662fe5a0b144efadbfc00e8040e603ec318746e] dm9601: Fix receive MTU
git-bisect good f662fe5a0b144efadbfc00e8040e603ec318746e
# good: [2910ca6f8ae69648623b3c05b79be87dd7bda73d] Merge branch 'upstream-linus' of master.kernel.org:/pub/scm/linux/kernel/git/jgarzik/libata-dev
git-bisect good 2910ca6f8ae69648623b3c05b79be87dd7bda73d
# good: [c58c2140f08de4ad0b0dbd48f6e78168dc321042] Blackfin arch: gpio pinmux and resource allocation API required by BF537 on chip ethernet mac driver
git-bisect good c58c2140f08de4ad0b0dbd48f6e78168dc321042
# good: [fef74705ea310acd716c2722bfeb0f796cf23640] [MIPS] Type proof reimplementation of cmpxchg.
git-bisect good fef74705ea310acd716c2722bfeb0f796cf23640
# good: [66b1f1a982bf4dbad9fa0de25b8d95c4936f05c4] Merge branch 'for-linus' of master.kernel.org:/pub/scm/linux/kernel/git/cooloney/blackfin-2.6
git-bisect good 66b1f1a982bf4dbad9fa0de25b8d95c4936f05c4
# good: [9ea0f043fec38fadb0101fbf29563a5635f42e93] [MIPS] Terminally fix local_{dec,sub}_if_positive
git-bisect good 9ea0f043fec38fadb0101fbf29563a5635f42e93


I wish i knew what it meant...

Regards
John


Comment 5 John Huttley 2007-12-08 00:39:40 UTC
I've just tried 2.6.24-rc4 and the bug is back.
This has given me reason to reconsider what is happening.

The event that brought the problem to light was the installation of a secondhand Storagetek L80
tape library. This has two DLT8000 drives on a HV-Differential bus.
This needed special card, an adaptec 3944AUWD.
The kernel I was running at that time was 2.6.22-gentoo-r8.
It worked fine. Then when -r9 came out and this error manifested, the assumption
was that -r9 was broken.

I no longer think this to be the case.

I think they are _ALL_ broken, possibly going way back toward the start of the 2.6 series.
I think that the bug may or may not manifest depending on the internal layout of data in the kernel
--A true heisenbug--

All that the git bisect did was to change the internal layout, not add/remove a bad patch.

This explains why I could take the 2.6.23.8 kernel and compile for SMP and have it fail.
Compile it for UP and have it work. Initially I thought that meant a locking or race issue.
Now I think its was just another case of altering the internal kernel layout.

And adding the "nosmp" kernel option really doesn't work. I just get kernel oopses. I bet nosmp hasn't worked
for ages, but nobody cares, even me.

I would like advise on what to do next.

Regards,
John

Comment 6 John Huttley 2007-12-17 05:24:48 UTC
Hi,
It looks like its a puzzler. Can we get it referred upstream??
Comment 7 Daniel Drake (RETIRED) gentoo-dev 2008-01-25 14:13:48 UTC
OK, so, to clarify, the problem started appearing after you installed the Storagetek L80/Adaptec 3944AUWD. If you boot up with the devices ON, the kernel crashes, but if you boot up with them off and turn them on later, it does not crash. Is that accurate?

Please could you post the oops message or whatever error is happening. If it hangs the system, you could capture the message with a digital camera or serial console.
Comment 8 John Huttley 2008-01-31 00:37:04 UTC
Hi Daniel,
I've reentered the bug here

http://bugzilla.kernel.org/show_bug.cgi?id=9775

in an attempt to get wider attention.

There is a screenshot there.

Im trying document what happens with all combinations of hardware, and its turning out to be a pain. Not least because my window of opportunity to work on the system has reduced.

Hopefully i'll have more information in a week.
I'll keep with bugzilla.kernel.org since it does seem to relate to the AHA7XXX driver.
Regards,
john
Comment 9 Daniel Drake (RETIRED) gentoo-dev 2008-02-13 16:43:23 UTC
thanks, will watch the upstream bug