Kernel module skge segfaults when i run iperf or hit up samba Reproducible: Always Steps to Reproduce: 1.Run iperf -c /or xfer files from said box to any box via samba or ftp 2. 3. Actual Results: Segfault Expected Results: iperf to print benchmark info dmesg, emerge --info, .config and lspci -vv to follow
Created attachment 116190 [details] emerge --info
Created attachment 116192 [details] dmesg
Created attachment 116193 [details] .config
Created attachment 116194 [details] lspci -vv
Please define what you mean by "segfault" i.e. show the full error message Is the dmesg you posted from before or after one of these segfaults?
(In reply to comment #5) Daniel, I have limited experience with capturing segfaults to a log, is there some easy and quick way to do this that i am unaware of? I checked all the log files in /var and none of them (critical kernel, etc.) had anything remotely resembling what i saw on pty0, the box in question is headless, I had to drag a monitor back there to find out that it was skge. Sorry about the trouble.
I don't think what you're seeing is a segfault, so I think you should at least try to describe it, or maybe type out the top few lines. Segfault errors look like: Segmentation Fault and thats it... see http://en.wikipedia.org/wiki/Segmentation_fault for an example
(In reply to comment #7) Ok it prints something like oops (#1) Linked modules: skge it goes on to print a call trace/stack dump I cant get a prompt back or switch to a different terminal, and i have to do a hard reboot, perhaps the reason why nothing gets logged, i assume it's in a cache somewhere that gets toasted before it's written. Anyway the first thing i did when this happened was boot a livecd and run memtest, it went through 50 something passes with no errors over a period of 24 hours or so.
OK. You're seeing a kernel oops, not a segfault. The oops message lists all the modules loaded into the system, but it doesn't imply that one of those modules is at fault. So, the crash you are seeing is probably totally unrelated to skge. To diganose this further, I need you to either capture the crash message using a serial console, or to use a digital camera to take a photo of the oops message.
eth0: hw csum failure. [<c0103b7a>] dump_trace+0x19a/0x1c0 [<c0103bb8>] show_trace_log_lvl+0x18/0x30 [<c01042cf>] show_trace+0xf/0x20 [<c0104325>] dump_stack+0x15/0x20 [<c0267dd9>] __skb_checksum_complete+0x59/0x60 [<c029a217>] tcp_v4_rcv+0x467/0x7d0 [<c027f542>] ip_local_deliver+0xd2/0x220 [<c027f21c>] ip_rcv+0x27c/0x4d0 [<c026ac93>] netif_receive_skb+0x163/0x1f0 [<f890998b>] skge_poll+0x3ab/0x570 [skge] [<c026c56c>] net_rx_action+0x5c/0xf0 [<c011a1d2>] __do_softirq+0x42/0x90 [<c011a247>] do_softirq+0x27/0x30 [<c01053da>] do_IRQ+0x6a/0xd0 [<c0103596>] common_interrupt+0x1a/0x20 [<c01019a1>] default_idle+0x31/0x60 [<c0101a0d>] cpu_idle+0x3d/0x60 [<c03b4705>] start_kernel+0x275/0x300 ======================= skge 0000:01:04.0: PCI error cmd=0x7 status=0x22b0 skge unable to clear error (so ignoring them) NETDEV WATCHDOG: eth0: transmit timed out NETDEV WATCHDOG: eth0: transmit timed out NETDEV WATCHDOG: eth0: transmit timed out NETDEV WATCHDOG: eth0: transmit timed out BUG: unable to handle kernel paging request at virtual address 0a0801f5 printing eip: c017c281 *pde = 00000000 Oops: 0000 [#1] Modules linked in: skge CPU: 0 EIP: 0060:[<c017c281>] Not tainted VLI EFLAGS: 00010202 (2.6.19-gentoo-r5 #6) EIP is at dnotify_parent+0x11/0x50 eax: f5cafd50 ebx: f5caff40 ecx: 0a080101 edx: 00000002 esi: 00000002 edi: f605c520 ebp: 00000058 esp: f7155f78 ds: 007b es: 007b ss: 0068 Process metalog (pid: 2399, ti=f7154000 task=c1bbb030 task.ti=f7154000) Stack: f5cafd50 c0150db6 f7155fa4 25b38fbf c0197520 c1ba8960 fffffff7 b7dfc000 f7154000 c01513b1 f7155fa4 000010fa 00000000 00000000 0000000b 00000058 c0102ba5 0000000b b7dfc000 00000058 00000058 b7dfc000 bfc2dad8 00000004 Call Trace: [<c0150db6>] vfs_write+0xd6/0x160 [<c01513b1>] sys_write+0x41/0x70 [<c0102ba5>] sysenter_past_esp+0x56/0x79 [<b7f45410>] 0xb7f45410 ======================= Code: a0 66 37 c0 e8 51 0d fd ff eb a5 8b 04 24 83 c4 04 5b 5e 5f 5d e9 50 ff ff ff 8b 0d 9c 66 37 c0 53 85 c9 74 0e 8b 58 14 8b 4b 08 <85> 91 f4 00 00 00 75 07 5b c3 90 8d 74 26 00 8b 03 85 c0 74 11 EIP: [<c017c281>] dnotify_parent+0x11/0x50 SS:ESP 0068:f7155f78
Disabling CONFIG_DNOTIFY in your kernel config will probably work around this for now. I'll comment further when I have time to investigate. If you have time, it would be really useful if you could test the latest gentoo-sources (currently 2.6.20-r6) and the latest development kernel (currently 2.6.21-rc7)
Same behavior w/o dnotify & inoify on 2.6.19-r5, will try alternate kernels and report back.
2.6.20-r6 Same thing Note that the EIP is at lines are not consistent, most of the time they say skge, sometimes they say cache_alloc_refill. Anyway i think at this point it's somewhat clear that this is related to some type of hardware issue, witch is unknown to me. The box now segfaults (real segfault) on shutdown.sh. I find it very unlikely that i am somehow special and the kernel's just tank for me. I appreciate your commitment to gentoo, and I don't want to waste your valuable time.
Be sure to test the latest 2.6.21-rc kernel first. There is a fix included which addresses skge handling of TX timeouts.