Gentoo Websites Logo
Go to: Gentoo Home Documentation Forums Lists Bugs Planet Store Wiki Get Gentoo!
Bug 273749 - sys-kernel/gentoo-sources-2.6.29-r5: sky2 hw csum failure causing crash
Summary: sys-kernel/gentoo-sources-2.6.29-r5: sky2 hw csum failure causing crash
Status: RESOLVED TEST-REQUEST
Alias: None
Product: Gentoo Linux
Classification: Unclassified
Component: [OLD] Core system (show other bugs)
Hardware: All Linux
: High normal
Assignee: Gentoo Kernel Bug Wranglers and Kernel Maintainers
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2009-06-11 17:05 UTC by Martin von Gagern
Modified: 2009-08-12 16:42 UTC (History)
0 users

See Also:
Package list:
Runtime testing required: ---


Attachments
kernel log (csum1.log,119.33 KB, text/plain)
2009-06-11 17:05 UTC, Martin von Gagern
Details
lspci -vvv of eth0 (csum1.lspci,2.62 KB, text/plain)
2009-06-11 17:06 UTC, Martin von Gagern
Details
emerge --info (emerge --info,5.19 KB, text/plain)
2009-06-17 19:13 UTC, Martin von Gagern
Details
kernel log - take 2 (csum2.log,3.80 KB, text/plain)
2009-07-06 16:34 UTC, Martin von Gagern
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Martin von Gagern 2009-06-11 17:05:11 UTC
I just had to reset my machine, as its monitor wouldn't come to life after pressing a key after I had been away for a while. The kernel log indicates this as the first problem:

sky2 eth0: rx error, status 0x69f069f0 length 0
eth0: hw csum failure.
Pid: 4146, comm: hadcm3transum_5 Tainted: P           2.6.29-gentoo-r5 #1
Call Trace:
 [<c02f9f3d>] __skb_checksum_complete_head+0x58/0x5e
 [<f8554568>] tcp_error+0xac/0x24c [nf_conntrack]
 [<fa417591>] ipt_do_table+0x1e8/0x50b [ip_tables]
 [<c037215b>] _read_lock_bh+0x8/0x22
 [<f85514d3>] __nf_conntrack_find+0xd9/0xdf [nf_conntrack]
 [<f85544bc>] tcp_error+0x0/0x24c [nf_conntrack]
 [<f855162d>] nf_conntrack_in+0xd3/0x4f2 [nf_conntrack]
 [<c03315a9>] tcp_rcv_established+0x2f0/0x5b5
 [<c0337a50>] tcp_v4_do_rcv+0x94/0x191
 [<c0317202>] genl_register_mc_group+0xc7/0x115
 [<c03172bb>] nf_iterate+0x6b/0x7e
 [<c031d1e0>] ip_rcv_finish+0x0/0x311
 [<c0317467>] nf_hook_slow+0xaa/0xed
 [<c031d1e0>] ip_rcv_finish+0x0/0x311
 [<c031d6f0>] ip_rcv+0x1ff/0x248
 [<c031d1e0>] ip_rcv_finish+0x0/0x311
 [<c030058c>] netif_receive_skb+0x29a/0x538
 [<f8106218>] sky2_poll+0x407/0xd15 [sky2]
 [<c02fe5fb>] net_rx_action+0xef/0x1a8
 [<c01291a0>] __do_softirq+0x89/0x120
 [<f82aba8d>] sym53c8xx_intr+0x3f/0x64 [sym53c8xx]
 [<c0113ccd>] ack_apic_level+0x73/0x26b
 [<c012926e>] do_softirq+0x37/0x3b
 [<c012945c>] irq_exit+0x42/0x44
 [<c01052a8>] do_IRQ+0x48/0x90
 [<c012945c>] irq_exit+0x42/0x44
 [<c011214e>] smp_apic_timer_interrupt+0x5c/0x87
 [<c0103827>] common_interrupt+0x27/0x2c

After that, the hw csum failures get repeated a lot, see the full log which I'll attach. It seems to me that this might be a problem reported by other people in other places against older kernel releases like 2.6.22, where only the network became unusable, or 2.6.23, where lockups like mine were observed.
http://thread.gmane.org/gmane.linux.network/69593
https://bugs.launchpad.net/linux/+bug/138611
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/320808

All these cases seem to have succeeded in reproducing the issue on a regular basis, while I've been usin that kernel version for a while now and experienced it the first time today. So I can't simply test different versions, and won't expect much help here if this remains the only incident. I'd like to report it in any case, though, so that if others experience the same, the issue will become more visible.

Some more information: this is a ASUS P5GDC-V Deluxe mainboard with the following eth0 nic, according to lspci -vv

Ethernet controller: Marvell Technology Group Ltd. 88E8053 PCI-E Gigabit
        Ethernet Controller (rev 15)
Subsystem: ASUSTeK Computer Inc. Marvell 88E8053 Gigabit Ethernet controller
        PCIe (Asus)
Flags: bus master, fast devsel, latency 0, IRQ 17
Memory at cbefc000 (64-bit, non-prefetchable) [size=16K]
I/O ports at c800 [size=256]
Expansion ROM at cbec0000 [disabled] [size=128K]
Capabilities: [48] Power Management version 2
Capabilities: [50] Vital Product Data
Capabilities: [5c] MSI: Mask- 64bit+ Count=1/2 Enable-
Capabilities: [e0] Express Legacy Endpoint, MSI 00
Capabilities: [100] Advanced Error Reporting
Kernel driver in use: sky2
Kernel modules: sky2

I'll attach a file with the lspci -vvv log just in case. The running commands hadcm3transum_5 and astropulse_5.03 are boinc applications, while firefox and X were running in my KDE session. The kernel is tainted from the nvidia driver modules.

# uname -a
Linux server 2.6.29-gentoo-r5 #1 SMP PREEMPT Thu Jun 4 09:37:06 CEST 2009 i686 Intel(R) Pentium(R) 4 CPU 3.00GHz GenuineIntel GNU/Linux
Comment 1 Martin von Gagern 2009-06-11 17:05:42 UTC
Created attachment 194264 [details]
kernel log
Comment 2 Martin von Gagern 2009-06-11 17:06:09 UTC
Created attachment 194265 [details]
lspci -vvv of eth0
Comment 3 Lars Wendler (Polynomial-C) (RETIRED) gentoo-dev 2009-06-17 18:09:35 UTC
Please post your "emerge --info".
Comment 4 Martin von Gagern 2009-06-17 19:13:31 UTC
Created attachment 195017 [details]
emerge --info

(In reply to comment #3)
> Please post your "emerge --info".

Sorry I forgot that.
Comment 5 Mike Pagano gentoo-dev 2009-07-06 15:53:33 UTC
Has this error recurred? Have you tested with gentoo-sources-2.6.30-r2?
Comment 6 Martin von Gagern 2009-07-06 16:34:00 UTC
Created attachment 196919 [details]
kernel log - take 2

(In reply to comment #5)
> Has this error recurred?

Yes, I was about to write a follow up. It happened again once, on 2.6.30-r1. I'm attaching a kernel log again.

It was slightly different this time around. For once, there is only a single hw csum failure, caused by udp this time. It seems to lead more or less directly to the line mentioning a paging request and declaring it a BUG.

I've been in front of my machine this time, and I'll try to recapitulate what happened. I had been away for a while, and had been using gqview in fullscreen mode on one desktop while postfix/procmail/spamassassin was busy working through the backlog in the background. I had a "watch 'postqueue -p | tail -n1'" command in some shell on another desktop, displaying me the number of messages left in the queue. At some point, I noticed that there was no more output from that command. I canceled watch and typed the postqueue manually. Hung. Didn't respond to Ctrl+C either. Did a pstree in a different shell, found a branch from some postfix command through two instances of procmail to spamassassin, each process the only child of the one mentioned before. Killed the child procmail instance, using SIGTERM iirc. postqueue still didn't respond.

At some point in all that, don't recall exactly where, I noticed the system becoming unresponsive in other areas. The gqview fullscreen window I had on one desktop didn't respond any more, nor did it repaint, giving me one completely black desktop. Desktop switching still worked, however. As two seemingly unrelated applications (gqview and postqueue) now showed problems, I decided to reboot my system. Might be that some unclosable Firefox windows were involved as well, but I'm not sure, I might mix things up. In any case, I closed as much windows as I could. I think the rxvt-unicode shell windows didn't close properly either. Told my kde session to shut down the computer. It terminated the task bar, then hung. Couldn't switch to text console. Seemingly no response to magic sysrequest key. Reset was my only remaining option.

> Have you tested with gentoo-sources-2.6.30-r2?

No, I just installed that version. Will be testing from tomorrow onwards...
Comment 7 Stratos Psomadakis (RETIRED) gentoo-dev 2009-07-29 10:38:49 UTC
this patch seems related, although it refers to PowerPC.
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=b9389796fa4c87fbdff33816e317cdae5f36dd0b

either patch a 2.6.30 kernel manually, or use a 2.6.31-rc2+ kernel, and see if the patch fixes this bug.

it's going to be backported to 2.6.27-stable probably
http://lkml.org/lkml/2009/7/28/475
Comment 8 Stratos Psomadakis (RETIRED) gentoo-dev 2009-08-12 16:42:30 UTC
feel free to reopen the bug, after you've tested the patch or a newer kernel