Being looked into upstream, but would appreciate a temporary fix in Gentoo-sources :) From: Andrew Morton <akpm@osdl.org> To: Hendrik Visage <hvjunk@gmail.com> Cc: linux-net@vger.kernel.org, linux-kernel@vger.kernel.org, ionut@badula.org, Jeff Garzik <jgarzik@pobox.com> Subject: Re: Starfire (Adaptec) kernel 2.6.13+ panics on AMD64 NFS server Message-Id: <20050930104046.4685e975.akpm@osdl.org> In-Reply-To: <d93f04c70509300901s3836b8afw4792d16c589b4fc4@mail.gmail.com> References: <d93f04c70509292036x269df799y7b51c5be9c3356d6@mail.gmail.com> <20050929211649.69eaddee.akpm@osdl.org> <d93f04c70509300901s3836b8afw4792d16c589b4fc4@mail.gmail.com> X-Mailer: Sylpheed version 1.0.4 (GTK+ 1.2.10; i386-redhat-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-Spam-Status: No, hits=0 required=5 tests= X-Spam-Checker-Version: SpamAssassin 2.63-osdl_revision__1.45__ X-MIMEDefang-Filter: osdl$Revision: 1.118 $ X-Scanned-By: MIMEDefang 2.36 Hendrik Visage <hvjunk@gmail.com> wrote: > > On 9/30/05, Andrew Morton <akpm@osdl.org> wrote: > > > The starfire changes in 2.6.12->2.6.13 look fairly innocuous. Need that > > trace, please. > > See attached :) > It helps, thanks. > ----------- [cut here ] --------- [please bite here ] --------- > Kernel BUG at net/core/dev.c:1099 > invalid operand: 0000 [1] PREEMPT > CPU 0 > Modules linked in: nvidia nfsd exportfs lockd sunrpc rfcomm l2cap hci_usb bluetooth starfire mii snd_ac97_bus soundcore snd_page_alloc forcedeth i2c_nforce2 dm_mirror dm_mod sbp2 ohci1394 ieee1394 ohci_hcd uhci_hcd usb_storage usbhid ehci_hcd usbcore > Pid: 11252, comm: nfsd Tainted: P 2.6.14-rc2 #3 > RIP: 0010:[<ffffffff802cc7ed>] <ffffffff802cc7ed>{skb_checksum_help+157} > RSP: 0000:ffff81003a0bd998 EFLAGS: 00010246 > RAX: ffff81003ff01624 RBX: ffff81003ca7f180 RCX: 00000000b7e42194 > RDX: 00000000b7e42194 RSI: ffff81003ff01624 RDI: ffff81003b026080 > RBP: ffff81003a0bd9b8 R08: 0000000000000000 R09: 0000000000000004 > R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 > R13: 0000000000000000 R14: ffff81003ca7f180 R15: ffff81003d462218 > FS: 00002aaaaade6ae0(0000) GS:ffffffff804fe800(0000) knlGS:0000000000000000 > CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b > CR2: 00002aaaaaac2000 CR3: 000000003d5a2000 CR4: 00000000000006e0 > Process nfsd (pid: 11252, threadinfo ffff81003a0bc000, task ffff81003e0ed0c0) > Stack: ffffffff804cd720 ffff81003d462000 ffff81003d4623e0 ffff81003ca7f180 > ffff81003a0bda08 ffffffff88104944 ffff81003d462218 000000013a2a8600 > ffff81003d462000 ffff81003d462000 > Call Trace:<ffffffff88104944>{:starfire:start_tx+164} <ffffffff802db0fc>{qdisc_restart+268} > <ffffffff802ccad0>{dev_queue_xmit+288} <ffffffff802d29b0>{neigh_resolve_output+672} > <ffffffff802ebb27>{ip_finish_output+455} <ffffffff802ec5ff>{ip_fragment+863} > <ffffffff802eb960>{ip_finish_output+0} <ffffffff802eca6c>{ip_output+108} yep, there's something wrong with the skb which starfire fed into skb_checksum_help(). offset = skb->tail - skb->h.raw; if (offset <= 0) BUG(); And that's a post-2.6.12 driver change. You can probably work around it by deleting the #define ZEROCOPY line. Reproducible: Always Steps to Reproduce: 1. Compile starfire into kernel 2. use NFS to output through the starfire interface on a post-2.6.12 x86_64 kernel 3. Actual Results: Kernel panics :) Expected Results: Kernel works ;^P It's actually worse with pre-empt turned off :(
Created attachment 69632 [details, diff] Patch from Herbert Xu fixing the panics from starfire module To: Hendrik Visage <hvjunk@gmail.com> Cc: Andrew Morton <akpm@osdl.org>, linux-net@vger.kernel.org, linux-kernel@vger.kernel.org, ionut@badula.org, Jeff Garzik <jgarzik@pobox.com>, netdev@vger.kernel.org Subject: Re: Starfire (Adaptec) kernel 2.6.13+ panics on AMD64 NFS server Message-ID: <20050930223915.GA17562@gondor.apana.org.au> References: <d93f04c70509292036x269df799y7b51c5be9c3356d6@mail.gmail.com> <20050929211649.69eaddee.akpm@osdl.org> <d93f04c70509300901s3836b8afw4792d16c589b4fc4@mail.gmail.com> <20050930104046.4685e975.akpm@osdl.org> <d93f04c70509301310y4bde1189wbcaef40124af6766@mail.gmail.com> Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="fdj2RfSjLxBAspz7" Content-Disposition: inline In-Reply-To: <d93f04c70509301310y4bde1189wbcaef40124af6766@mail.gmail.com> User-Agent: Mutt/1.5.9i From: Herbert Xu <herbert@gondor.apana.org.au> --fdj2RfSjLxBAspz7 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline On Fri, Sep 30, 2005 at 08:10:59PM +0000, Hendrik Visage wrote: > > Anycase, here is a non-PREEMPT traceback. What makes this one > interesting, is that > in the preempt case, I had to push the NFS output to get the panic, but the > non-preempt case attached, sorta just happened, ie. when the clients > just checked on the server's status :( You must never call skb_checksum_help unless the packet is meant to be checksummed by the hardware. So starfire is the guilty party here. This patch makes it do the check and also check for errors from skb_checksum_help. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
Here is the patch which was merged into Linus' tree: http://www.kernel.org/git/gitweb.cgi?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff_plain;h=67974231d4354fe26aaa39a3153b5c0945b94858;hp=32fa2bfcf882f8901ca206e33b0d8975cc8e89a2 Any chance you could test it instead of the one you posted and confirm it solves the problem?
I've been testing (successfully) those versions, but would please like it in 2.6.13 while we wait for 2.6.14
Yep - the only reason I asked that was because we try and stick to backporting patches from Linus' tree only
Fixed in gentoo-sources-2.6.13-r3 (genpatches-2.6.13-6)