Summary: | =sys-kernel/gentoo-sources-{3.7.5,3.8.3} - skb_over_panic by "java" process, invalid opcode 0000 in RIP skb_put called by tcp_sendmsg when sysctl net.ipv4.tcp_mtu_probing!=0; different call stacks, same head and bottom. | ||
---|---|---|---|
Product: | Gentoo Linux | Reporter: | El Goretto <el.goretto> |
Component: | [OLD] Core system | Assignee: | Gentoo Kernel Bug Wranglers and Kernel Maintainers <kernel> |
Status: | RESOLVED NEEDINFO | ||
Severity: | normal | ||
Priority: | Normal | ||
Version: | unspecified | ||
Hardware: | AMD64 | ||
OS: | Linux | ||
Whiteboard: | |||
Package list: | Runtime testing required: | --- | |
Attachments: |
vanilla 3.7.5 kernel config file
Kernel error messages collection (newest at the end of file) |
Description
El Goretto
2013-02-18 12:56:11 UTC
Created attachment 339242 [details]
vanilla 3.7.5 kernel config file
Created attachment 339244 [details]
Kernel error messages collection (newest at the end of file)
Can you try a more recent kernel like 3.8.2, a long term support kernel like 3.4.34 and / or a development kernel like 3.9_rc1? This would point out whether the issue is still present in current kernels. I assume this to be a broken network driver that makes skb crazy; though, also interesting is that it is writing through VFS (virtual file system) so I assume it is trying to write something over the network... (In reply to comment #3) > Can you try a more recent kernel like 3.8.2, a long term support kernel like > 3.4.34 and / or a development kernel like 3.9_rc1? This would point out > whether the issue is still present in current kernels. I assume this to be a > broken network driver that makes skb crazy; though, also interesting is that > it is writing through VFS (virtual file system) so I assume it is trying to > write something over the network... Ok, I tried a 3.9rc2 git-sources kernel but it didn't compile, so... I tried a 3.4.36 LTS kernel: same bug. I'll soon try a 3.8.3 hardened-sources kernel (as it's a vanilla-sources bug, it won't matter), but I'm confident in the fact it will still trigger it. The good point is that the process that was killed this time is Tor, another heavy networking application (I2P was first impacted): [19808.653912] skb_over_panic: text:ffffffff814de8fa len:1568 put:1054 head:ffff880031242000 data:ffff8800312420e8 tail:0x708 end:0x6c0 dev:<NULL> [19808.656089] ------------[ cut here ]------------ [19808.657085] kernel BUG at net/core/skbuff.c:127! [19808.657088] invalid opcode: 0000 [#1] SMP [19808.657091] CPU 0 [19808.657096] Pid: 2659, comm: tor Not tainted 3.4.36 #1 HP ProLiant MicroServer [19808.657100] RIP: 0010:[<ffffffff81498a76>] [<ffffffff81498a76>] skb_put+0x86/0x90 [19808.657109] RSP: 0018:ffff88003f8efc78 EFLAGS: 00010292 [19808.657112] RAX: 0000000000000099 RBX: 000000000000041e RCX: 00000000000000c6 [19808.657115] RDX: 0000000000000000 RSI: 0000000000000046 RDI: ffffffff8185cf14 [19808.657117] RBP: 000000000000102a R08: 0000000000000000 R09: 0000000000000000 [19808.657120] R10: 0000000000000000 R11: 0000000000002000 R12: ffff88003f893180 [19808.657123] R13: ffff88003111f6c0 R14: 0000000000008160 R15: 0000000000000000 [19808.657127] FS: 00007ff7dc027700(0000) GS:ffff88007dc00000(0000) knlGS:0000000000000000 [19808.657130] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [19808.657132] CR2: 00007ff5a5b0e900 CR3: 000000004925d000 CR4: 00000000000007f0 [19808.657135] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [19808.657137] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [19808.657140] Process tor (pid: 2659, threadinfo ffff88003f8ee000, task ffff88004933ce00) [19808.657142] Stack: [19808.657144] 0000000000000708 00000000000006c0 ffffffff817310ce ffff88003111f6c0 [19808.657148] 0000000000000001 ffffffff814de8fa 0000000000000000 ffffea000000024a [19808.657151] 000000000000024a 0000000000000002 ffff88003f8efd02 00007ff7e396a396 [19808.657154] Call Trace: [19808.657161] [<ffffffff814de8fa>] ? tcp_sendmsg+0x3ba/0xda0 [19808.657166] [<ffffffff81490223>] ? sock_aio_write+0x103/0x130 [19808.657172] [<ffffffff810b77f0>] ? do_sync_write+0xc0/0x100 [19808.657177] [<ffffffff81050fc5>] ? set_next_entity+0x35/0x70 [19808.657180] [<ffffffff810b7995>] ? vfs_write+0x165/0x180 [19808.657184] [<ffffffff810b7a87>] ? sys_write+0x47/0x90 [19808.657189] [<ffffffff81595b62>] ? system_call_fastpath+0x16/0x1b [19808.657191] Code: 4c 24 10 8b 8f bc 00 00 00 48 89 4c 24 08 8b bf b8 00 00 00 89 f1 48 8b 74 24 28 48 89 3c 24 48 c7 c7 d0 49 73 81 e8 04 58 0f 00 <0f> 0b 0f 1f 84 00 00 00 00 00 41 55 b9 ff ff ff ff 41 54 41 89 [19808.657215] RIP [<ffffffff81498a76>] skb_put+0x86/0x90 [19808.657218] RSP <ffff88003f8efc78> [19808.665073] ---[ end trace 2770b033a903dfd1 ]--- I forgot to mention the network card related info: 02:00.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5723 Gigabit Ethernet PCIe (rev 10) Subsystem: Hewlett-Packard Company NC107i Integrated PCI Express Gigabit Server Adapter Flags: bus master, fast devsel, latency 0, IRQ 42 Memory at fe9f0000 (64-bit, non-prefetchable) [size=64K] Capabilities: [48] Power Management version 3 Capabilities: [40] Vital Product Data Capabilities: [60] Vendor Specific Information: Len=6c <?> Capabilities: [50] MSI: Enable+ Count=1/1 Maskable- 64bit+ Capabilities: [cc] Express Endpoint, MSI 00 Capabilities: [100] Advanced Error Reporting Capabilities: [13c] Virtual Channel Capabilities: [160] Device Serial Number 00-9c-02-ff-fe-9f-8a-ac Capabilities: [16c] Power Budgeting <?> Kernel driver in use: tg3 Numeric version: 02:00.0 0200: 14e4:165b (rev 10) Subsystem: 103c:705d I'll remove the "3.7.5" mention in the bug summary as soon as I confirm it on a 3.8 kernel too. Same issue with a 3.8.3 kernel. 28386.789181] skbuff: skb_over_panic: text:ffffffff8159a3f9 len:1568 put:768 head:ffff8800723e0000 data:ffff8800723e01a0 tail:0x7c0 end:0x6c0 dev:<NULL> [28386.791595] invalid opcode: 0000 [#1] SMP [28386.792826] CPU 1 [28386.792827] Pid: 2040, comm: java Not tainted 3.8.3-hardened #1 HP ProLiant MicroServer [...] [28386.792891] Call Trace: [28386.792901] [<ffffffff8159a3f9>] ? tcp_sendmsg+0x759/0x1700 [28386.792909] [<ffffffff81065a05>] ? set_next_entity+0x35/0x80 [28386.792914] [<ffffffff8103c9ed>] ? current_fs_time+0xd/0x50 [28386.792921] [<ffffffff8105abd8>] ? __remove_hrtimer+0x58/0xc0 [28386.792925] [<ffffffff81538749>] ? sock_aio_write+0x109/0x140 [28386.792933] [<ffffffff81115d4b>] ? ep_scan_ready_list.isra.18+0x17b/0x180 [28386.792938] [<ffffffff810d356a>] ? do_sync_write+0x9a/0xe0 [28386.792943] [<ffffffff810d3759>] ? vfs_write+0x1a9/0x220 [28386.792949] [<ffffffff8165947e>] ? sysret_check+0x1c/0x58 [28386.792953] [<ffffffff810d38c0>] ? sys_write+0x50/0xa0 [28386.792958] [<ffffffff81659458>] ? system_call_fastpath+0x18/0x1d [...] [28386.801016] ---[ end trace db66e185ec8bacfe ]--- Have you tried to update java or use a different implementation? I don't see anything wrong with the skb_put function nor can much be found about it upstream, since it is so common I assume the problem not to be in that function. Therefore I assume that java is calling the function in a wrong way, likely with incorrect information. (In reply to comment #7) > Have you tried to update java or use a different implementation? > > I don't see anything wrong with the skb_put function nor can much be found > about it upstream, since it is so common I assume the problem not to be in > that function. Therefore I assume that java is calling the function in a > wrong way, likely with incorrect information. As I said (in a verrrrry concise way in first message :)), I verified that the JVM wasn't in cause (sun/oracle or icedtea and different versions (1.6/1.7)). Plus, Tor (no java at all) triggers the same issue. Hmm, I really don't have a clue on this one especially since the call traces differ so there is some randomization in how it comes to that point, sounds like something you would need to look into with a kernel debugger but that comes more tricky with a random bug like this; could you try the most recent upstream stable kernel =sys-kernel/gentoo-sources-3.10.6 and upstream development kernel =sys-kernel/git-sources-3.11_rc5 to see if it has since been fixed? If not, then please file this upstream at https://bugzilla.kernel.org and leave us a link behind to the upstream bug; thank you very much in advance. (In reply to Tom Wijsman (TomWij) from comment #9) > Hmm, I really don't have a clue on this one especially since the call traces > differ so there is some randomization in how it comes to that point, sounds > like something you would need to look into with a kernel debugger but that > comes more tricky with a random bug like this; could you try the most recent > upstream stable kernel =sys-kernel/gentoo-sources-3.10.6 and upstream > development kernel =sys-kernel/git-sources-3.11_rc5 to see if it has since > been fixed? > > If not, then please file this upstream at https://bugzilla.kernel.org and > leave us a link behind to the upstream bug; thank you very much in advance. As this doesn't occur too often and it kind of hard to pinpoint; as well as no further information, I am closing this for now until it becomes more apparent. When it does happen again, consider filing a bug upstream and letting us know about that. Good luck and thank you in advance. (In reply to Tom Wijsman (TomWij) from comment #10) > As this doesn't occur too often and it kind of hard to pinpoint; as well as > no further information, I am closing this for now until it becomes more > apparent. When it does happen again, consider filing a bug upstream and > letting us know about that. Good luck and thank you in advance. I do not intend to reopen this bug, only to write down some of the latest thoughts I had while reading this: http://staff.psc.edu/mathis/MTU/ Especially about the "Opportunistic jumbo MTU discovery" that is in fact describing the mecanism behind net.ipv4.tcp_mtu_probing I think. I can't prove it because the machine was decommissioned, but here is what I think: - this broadcom chip (tg3 driver) doesn't support MTU beyond 1500. I should have seen that! I only saw that when getting the machine offline... - net.ipv4.tcp_mtu_probing=1 makes the problem not correctly reproducible. Of course! I should have read the documentation carefully: 1 - Disabled by default, enabled when an ICMP black hole detected. Back then, I should have set it to 2 (2 - Always enabled, use initial MSS of tcp_base_mss). So, my guess is that when an ICMP black hole was detected (only once in a while, making it so hard to reproduce it reliably), "TCP Packetization-Layer Path MTU Discovery" was triggered but maybe the mecanisme wasn't take care of the case that the NIC doesn't support the MTU it is supposed to test. Well, that's not much, but that only to write it down somewhere. Anyway, thank you very much for your help Tom, back then. |