174519 – network-related(?) dnotify oops

Bug 174519 - network-related(?) dnotify oops

Summary: network-related(?) dnotify oops

Status:	RESOLVED INVALID

Alias:	None

Product:	Gentoo Linux
Classification:	Unclassified
Component:	[OLD] Core system (show other bugs)
Hardware:	x86 Linux

Importance:	High critical
Assignee:	Gentoo Kernel Bug Wranglers and Kernel Maintainers

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2007-04-13 20:45 UTC by Phillip R. Miller
Modified:	2007-04-18 03:29 UTC (History)
CC List:	0 users

See Also:
Package list:
Runtime testing required:	---

Attachments
emerge --info (emergeinfo.txt,2.57 KB, text/plain) 2007-04-13 20:46 UTC, Phillip R. Miller	Details
dmesg (dmesg.txt,10.13 KB, text/plain) 2007-04-13 20:47 UTC, Phillip R. Miller	Details
.config (kernelconfig.txt,24.26 KB, text/plain) 2007-04-13 20:47 UTC, Phillip R. Miller	Details
lspci -vv (lspci-vv.txt,9.67 KB, text/plain) 2007-04-13 20:48 UTC, Phillip R. Miller	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Phillip R. Miller 2007-04-13 20:45:36 UTC

Kernel module skge segfaults when i run iperf or hit up samba

Reproducible: Always

Steps to Reproduce:
1.Run iperf -c /or xfer files from said box to any box via samba or ftp
2.
3.

Actual Results:  
Segfault

Expected Results:  
iperf to print benchmark info

dmesg, emerge --info, .config and lspci -vv to follow

Comment 1 Phillip R. Miller 2007-04-13 20:46:58 UTC

Created attachment 116190 [details]
emerge --info

Comment 2 Phillip R. Miller 2007-04-13 20:47:18 UTC

Created attachment 116192 [details]
dmesg

Comment 3 Phillip R. Miller 2007-04-13 20:47:42 UTC

Created attachment 116193 [details]
.config

Comment 4 Phillip R. Miller 2007-04-13 20:48:00 UTC

Created attachment 116194 [details]
lspci -vv

Comment 5 Daniel Drake (RETIRED) gentoo-dev

2007-04-14 13:00:41 UTC

Please define what you mean by "segfault" i.e. show the full error message

Is the dmesg you posted from before or after one of these segfaults?

Comment 6 Phillip R. Miller 2007-04-14 18:42:30 UTC

(In reply to comment #5)

Daniel,

I have limited experience with capturing segfaults to a log, is there some easy and quick way to do this that i am unaware of? I checked all the log files in /var and none of them (critical kernel, etc.) had anything remotely resembling what i saw on pty0, the box in question is headless, I had to drag a monitor back there to find out that it was skge. Sorry about the trouble.

Comment 7 Daniel Drake (RETIRED) gentoo-dev

2007-04-15 00:25:52 UTC

I don't think what you're seeing is a segfault, so I think you should at least try to describe it, or maybe type out the top few lines.

Segfault errors look like:

  Segmentation Fault

and thats it... see http://en.wikipedia.org/wiki/Segmentation_fault for an example

Comment 8 Phillip R. Miller 2007-04-16 01:23:23 UTC

(In reply to comment #7)

Ok it prints something like

oops (#1)
Linked modules: skge

it goes on to print a call trace/stack dump

I cant get a prompt back or switch to a different terminal, and i have to do a hard reboot, perhaps the reason why nothing gets logged, i assume it's in a cache somewhere that gets toasted before it's written.

Anyway the first thing i did when this happened was boot a livecd and run memtest, it went through 50 something passes with no errors over a period of 24 hours or so.

Comment 9 Daniel Drake (RETIRED) gentoo-dev

2007-04-16 11:03:29 UTC

OK. You're seeing a kernel oops, not a segfault. The oops message lists all the modules loaded into the system, but it doesn't imply that one of those modules is at fault. So, the crash you are seeing is probably totally unrelated to skge.

To diganose this further, I need you to either capture the crash message using a serial console, or to use a digital camera to take a photo of the oops message.

Comment 10 Phillip R. Miller 2007-04-16 17:52:53 UTC

eth0: hw csum failure.
 [<c0103b7a>] dump_trace+0x19a/0x1c0
 [<c0103bb8>] show_trace_log_lvl+0x18/0x30
 [<c01042cf>] show_trace+0xf/0x20
 [<c0104325>] dump_stack+0x15/0x20
 [<c0267dd9>] __skb_checksum_complete+0x59/0x60
 [<c029a217>] tcp_v4_rcv+0x467/0x7d0
 [<c027f542>] ip_local_deliver+0xd2/0x220
 [<c027f21c>] ip_rcv+0x27c/0x4d0
 [<c026ac93>] netif_receive_skb+0x163/0x1f0
 [<f890998b>] skge_poll+0x3ab/0x570 [skge]
 [<c026c56c>] net_rx_action+0x5c/0xf0
 [<c011a1d2>] __do_softirq+0x42/0x90
 [<c011a247>] do_softirq+0x27/0x30
 [<c01053da>] do_IRQ+0x6a/0xd0
 [<c0103596>] common_interrupt+0x1a/0x20
 [<c01019a1>] default_idle+0x31/0x60
 [<c0101a0d>] cpu_idle+0x3d/0x60
 [<c03b4705>] start_kernel+0x275/0x300
 =======================
skge 0000:01:04.0: PCI error cmd=0x7 status=0x22b0
skge unable to clear error (so ignoring them)
NETDEV WATCHDOG: eth0: transmit timed out
NETDEV WATCHDOG: eth0: transmit timed out
NETDEV WATCHDOG: eth0: transmit timed out
NETDEV WATCHDOG: eth0: transmit timed out
BUG: unable to handle kernel paging request at virtual address 0a0801f5
 printing eip:
c017c281
*pde = 00000000
Oops: 0000 [#1]
Modules linked in: skge
CPU:    0
EIP:    0060:[<c017c281>]    Not tainted VLI
EFLAGS: 00010202   (2.6.19-gentoo-r5 #6)
EIP is at dnotify_parent+0x11/0x50
eax: f5cafd50   ebx: f5caff40   ecx: 0a080101   edx: 00000002
esi: 00000002   edi: f605c520   ebp: 00000058   esp: f7155f78
ds: 007b   es: 007b   ss: 0068
Process metalog (pid: 2399, ti=f7154000 task=c1bbb030 task.ti=f7154000)
Stack: f5cafd50 c0150db6 f7155fa4 25b38fbf c0197520 c1ba8960 fffffff7 b7dfc000 
       f7154000 c01513b1 f7155fa4 000010fa 00000000 00000000 0000000b 00000058 
       c0102ba5 0000000b b7dfc000 00000058 00000058 b7dfc000 bfc2dad8 00000004 
Call Trace:
 [<c0150db6>] vfs_write+0xd6/0x160
 [<c01513b1>] sys_write+0x41/0x70
 [<c0102ba5>] sysenter_past_esp+0x56/0x79
 [<b7f45410>] 0xb7f45410
 =======================
Code: a0 66 37 c0 e8 51 0d fd ff eb a5 8b 04 24 83 c4 04 5b 5e 5f 5d e9 50 ff ff ff 8b 0d 9c 66 37 c0 53 85 c9 74 0e 8b 58 14 8b 4b 08 <85> 91 f4 00 00 00 75 07 5b c3 90 8d 74 26 00 8b 03 85 c0 74 11 
EIP: [<c017c281>] dnotify_parent+0x11/0x50 SS:ESP 0068:f7155f78

Comment 11 Daniel Drake (RETIRED) gentoo-dev

2007-04-17 11:27:53 UTC

Disabling CONFIG_DNOTIFY in your kernel config will probably work around this for now. I'll comment further when I have time to investigate.

If you have time, it would be really useful if you could test the latest gentoo-sources (currently 2.6.20-r6) and the latest development kernel (currently 2.6.21-rc7)

Comment 12 Phillip R. Miller 2007-04-17 23:45:48 UTC

Same behavior w/o dnotify & inoify on 2.6.19-r5, will try alternate kernels and report back.

Comment 13 Phillip R. Miller 2007-04-18 00:36:21 UTC

2.6.20-r6 Same thing

Note that the EIP is at lines are not consistent, most of the time they say skge, sometimes they say cache_alloc_refill.

Anyway i think at this point it's somewhat clear that this is related to some type of hardware issue, witch is unknown to me. The box now segfaults (real segfault)
on shutdown.sh. I find it very unlikely that i am somehow special and the kernel's just tank for me.

I appreciate your commitment to gentoo, and I don't want to waste your valuable time.

Comment 14 Daniel Drake (RETIRED) gentoo-dev

2007-04-18 03:29:46 UTC

Be sure to test the latest 2.6.21-rc kernel first. There is a fix included which addresses skge handling of TX timeouts.