I get crashes like this one from time to time, sometimes every few hours, sometimes once a month. Crash log copied manually from VGA, please excuse missing top and eventual typos: … [<ffffffff8141fc8e>] ? ip_reply_glue_Bits+0x5a/0x5a [<ffffffff8148be92>] udp_v6_push_pending_frames+0x29a/0x312 [<ffffffff8148ce56>] udpv6_sendmsg+0x696/0x84d [<ffffffff810b2450>] ? check_preempt_curr+0x2b/0x66 [<ffffffff810b249d>] ? ttwu_do_wakeup+0x12/0x7f [<ffffffff81448dd0>] inet_sendmsg+0x58/0x91 [<ffffffff810d5dba>] ? wake_futex+0x57/0x6c [<ffffffff813d9d19>] sock_sendmsg+0x69+0x7a [<ffffffff814c4dc5>] ? __schedule+0x56f/0x784 [<ffffffff81157ba1>] ? __fget_light+0x46/0x5b [<ffffffff81157bc4>] ? __fdget+0xe/0x10 [<ffffffff813da268>] ? sockfd_lookup_light+0x12/0x5b [<ffffffff813db9e9>] SyS_sendto+0x109/0x13a [<ffffffff874c7e41>] ? sysret_check+0x19/0x5b [<ffffffff814c7e1e>] system_call_fastpath+0x16/0x1b Code: 00 00 58 89 44 24 10 8b 87 bc 00 00 00 48 89 44 24 08 48 8b 87 d0 00 00 00 48 c7 c7 41 89 64 81 48 89 04 24 31 c0 e8 8f bb ff ff <0f> 0b 55 47 78 e5 41 56 41 55 41 54 5348 89 fb 48 83 ec 28 4c RIP [<ffffffff814c3e53>] skb_panic+0x5e/0x60 RSP <ffff8800503b98e8> ---[ end trace 1525a4a3242c7adb ]--- Kernel panic - not syncing: Fatal exception in interrupt Kernel Offset: 0x0 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffff9fffffff) I've been getting these crashes in the past since kernel 3.12.x on a Dell PowerEdge R210 with an tg3 supported nic (sorry, machine doesn't exist, so I don't know the exact nic model) and recently again with a Dell PowerEdge 860: 06:00.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5721 Gigabit Ethernet PCI Express (rev 11) 07:00.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5721 Gigabit Ethernet PCI Express (rev 11) I don't know the exact point when it started, nor do I currently have a safely working kernel. Problem exists at least in 3.13.x and 3.14.x (up to 3.14.4). Common for both servers is the usage of iptables. On the R210 machine, one workaround in the past has been to not use (even build?) netfilter at all. Googling suggests that there was a kernel bug CVE-2013-4470 in the past which involves at least some of the function calls. Right now, I'm building a hardened-sources-3.14.6 kernel with UDP_CORK amputated in the code.
Try using a serial/USB console to capture the rest of the output. Since you cannot reproduce it now, try to do so when it happens. In the mean time, there doesn't seem to be anything interesting to look at, so please reopen this bug report when the time comes. I suggest replacing some if not all hardware since it apparently happens infrequently (both in the sense of not very often and with varying intervals). As for the "ipv6 code" it might just be happening there because that code is most frequently used. You can't rule out a more basic memory corruption there.
Created attachment 379204 [details] Crash example 1
Created attachment 379206 [details] Crash example 2
I currently cannot use a serial or USB console unless my hoster supports the case. * DRAC ssh broken (Received disconnect from 0.0.0.0: 11: Logged out.) * VGA not scrollable * no physical access I've opened a support case at my hoster to get a working serial console and maybe also look into hardware issues.
Try with the boot_delay=... kernel param (fill in a value in place of the dots) boot_delay= Milliseconds to delay each printk during boot. Values larger than 10 seconds (10000) are changed to no delay (0). Format: integer https://www.kernel.org/doc/Documentation/kernel-parameters.txt This should allow you to increase the print delay such that you can try to 1) make a video of the whole thing (low delay), 2) take a picture of the top part or (medium delay) 3) write it out manually on paper or another device (high delay).
Created attachment 379390 [details] netconsole crash log I managed to configure netconsole accordingly and got a complete crashdump!
On my last post, I disabled my ipv6 uplink tunnel and had no crashes since then. The server in question which is crashing is operating as an ipv6 tunnel endpoint, forwarding smaller subnets through a openvpn tap device. To complicate things, the network behind openvpn uses ipsec transport mode (which adds to paket size). It'd really be nice if we can narrow this thing down and I'm willing to add configs, stuff or other information that could help. I'm just not sure what'd be interesting.
Created attachment 380490 [details] Current kernel config
(In reply to Jeroen Roovers from comment #1) > Try using a serial/USB console to capture the rest of the output. Since you > cannot reproduce it now, try to do so when it happens. In the mean time, > there doesn't seem to be anything interesting to look at, so please reopen > this bug report when the time comes. Or use netconsole and log to another machine. See <linux-src-tree>/Documentation/networking/netconsole.txt > Right now, I'm building a hardened-sources-3.14.6 kernel with UDP_CORK > amputated in the code. Can you try comparing vanilla-sources to hardened-sources with nearly identical configurations. The panic doesn't suggest anything to me except that is related to sending an ipv6 udp datagram.
Removing UDP CORK did not change the behaviour, it's still crashing the same way as before. Regarding comparing to a vanilla kernel, I'd rather not try that since that'd defeat the whole idea of running a hardened system in a potentially hostile environment and I don't have a test lab or fallback machine to switch to. My newest attempt is a cronjob to restart the tunnel once per hour. I hope this gains me enough time until a replacement server at my new job becomes available. *Then* I will have a machine with physical access where I can dig deeper into the problem. Networking console is working fine so far, I'd even extent that to using kexec/kdump but I seem to have trouble to set up a crash kernel correctly. Can someone provide me with documentation that works on a hardened kernel?
I applied the script from https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/scripts/decode_stacktrace.sh?id=dbd1abb209715544bf37ffa0a3798108e140e3ec on my kernel, but it doesn't include much more information than before :( [43245.251941] skb_push (??:?) [43245.251970] ip6_finish_output2 (ip6_output.c:?) [43245.252000] ? ipv6_select_ident (??:?) [43245.252029] ip6_fragment (??:?) [43245.252058] ? nf_iterate (??:?) [43245.252086] ? ip6_forward_finish (ip6_output.c:?) [43245.252117] ip6_finish_output (ip6_output.c:?) [43245.252146] ip6_output (??:?) [43245.252175] xfrm_output_resume (??:?) [43245.252204] xfrm_output2 (xfrm_output.c:?) [43245.252232] xfrm_output (??:?) [43245.252260] xfrm6_output_finish (??:?) [43245.252289] __xfrm6_output (xfrm6_output.c:?) [43245.252317] xfrm6_output (??:?) [43245.252345] ip6_local_out (??:?) [43245.252373] ip6_push_pending_frames (??:?) [43245.252403] ? ip6_append_data (??:?) [43245.252433] ? ip_reply_glue_bits (??:?) [43245.252463] udp_v6_push_pending_frames (udp.c:?) [43245.252493] udpv6_sendmsg (??:?) [43245.252523] ? up (??:?) [43245.252550] ? console_unlock (??:?) [43245.252582] inet_sendmsg (??:?) [43245.252611] ? native_smp_send_reschedule (smp.c:?) [43245.252642] sock_sendmsg (??:?) [43245.252671] ? default_wake_function (??:?) [43245.252701] ? __fget_light (file.c:?) [43245.252729] ? __fdget (??:?) [43245.252757] ? sockfd_lookup_light (socket.c:?) [43245.252786] SyS_sendto (??:?) [43245.252816] ? fsnotify (??:?) [43245.252844] ? __wake_up_locked_key (??:?) [43245.252875] system_call_fastpath (arch/x86/kernel/entry_64.o:?) [43245.252903] ? __fget_light (file.c:?) Besides enabling frame pointers, is there more kernel options I can enable to fill the ?? parts?
(In reply to satmd from comment #11) ... > Besides enabling frame pointers, is there more kernel options I can enable > to fill the ?? parts? Do you get the same on the equivalent vanilla kernels or no? I'll cc hardened upstream to see if they have a clue.
this doesn't look like a problem triggered by a grsec feature (e.g., size overflow is disabled as are other plugin based features). it's either a vanilla kernel bug or perhaps a bad backport from upstream. either way, testing vanilla at least until it reproduces the problem (or works fine for more than what would already make the grsec kernel fail) would be helpful. also try 3.15.x to have another data point
(In reply to PaX Team from comment #13) > this doesn't look like a problem triggered by a grsec feature (e.g., size > overflow is disabled as are other plugin based features). it's either a > vanilla kernel bug or perhaps a bad backport from upstream. either way, > testing vanilla at least until it reproduces the problem (or works fine for > more than what would already make the grsec kernel fail) would be helpful. > also try 3.15.x to have another data point @satmd. Did you try any of the above?
Sorry for my late reply. I worked around the problem only with a cron job that resets my ipv6 tunnel once per night. I've upgraded to hardened-kernel 3.16.2 recently and did not have the problem any more (yet). I'd rather be happy with the problem disappearing now than risking more downtime for the services. I'd be fine if we closed the bug since we can't easily reproduce the problem and maybe reopen if it ever happens again. The problem appeared since 2.6.38 but always sporadically ranging from every 20 minutes to once in a few weeks. I've ordered a new server and if you insist on looking into the problem some more, I can experiment more with the old server when I tranferred my important stuff over to the new one.
(In reply to satmd from comment #15) > Sorry for my late reply. > > I worked around the problem only with a cron job that resets my ipv6 tunnel > once per night. > > I've upgraded to hardened-kernel 3.16.2 recently and did not have the > problem any more (yet). > > I'd rather be happy with the problem disappearing now than risking more > downtime for the services. > > I'd be fine if we closed the bug since we can't easily reproduce the problem > and maybe reopen if it ever happens again. > > The problem appeared since 2.6.38 but always sporadically ranging from every > 20 minutes to once in a few weeks. > > I've ordered a new server and if you insist on looking into the problem some > more, I can experiment more with the old server when I tranferred my > important stuff over to the new one. We didn't really nail this, but given that its probably not a hardened-sources problem and that you're not reproducing it, I'm closing this. Feel free to reopen if you get more info for us.