513482 – sys-kernel/hardened-sources-3.14.4 - Kernel panic - not syncing: Fatal exception in interrupt

Bug 513482 - sys-kernel/hardened-sources-3.14.4 - Kernel panic - not syncing: Fatal exception in interrupt

Summary: sys-kernel/hardened-sources-3.14.4 - Kernel panic - not syncing: Fatal except...

Status:	RESOLVED NEEDINFO

Alias:	None

Product:	Gentoo Linux
Classification:	Unclassified
Component:	Hardened (show other bugs)
Hardware:	All Linux

Importance:	Normal normal
Assignee:	The Gentoo Linux Hardened Kernel Team (OBSOLETE)

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2014-06-16 15:18 UTC by satmd
Modified:	2014-11-29 13:25 UTC (History)
CC List:	4 users (show)

See Also:
Package list:
Runtime testing required:	---

Attachments
Crash example 1 (1.jpg,124.98 KB, image/jpeg) 2014-06-18 13:15 UTC, satmd	Details
Crash example 2 (2.jpg,126.13 KB, image/jpeg) 2014-06-18 13:15 UTC, satmd	Details
netconsole crash log (crashdump.txt,10.33 KB, text/plain) 2014-06-21 22:50 UTC, satmd	Details
Current kernel config (config,71.64 KB, text/x-mpsub) 2014-07-09 12:25 UTC, satmd	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description satmd 2014-06-16 15:18:50 UTC

I get crashes like this one from time to time, sometimes every few hours, sometimes once a month.

Crash log copied manually from VGA, please excuse missing top and eventual typos:
…
[<ffffffff8141fc8e>] ? ip_reply_glue_Bits+0x5a/0x5a
[<ffffffff8148be92>] udp_v6_push_pending_frames+0x29a/0x312
[<ffffffff8148ce56>] udpv6_sendmsg+0x696/0x84d
[<ffffffff810b2450>] ? check_preempt_curr+0x2b/0x66
[<ffffffff810b249d>] ? ttwu_do_wakeup+0x12/0x7f
[<ffffffff81448dd0>] inet_sendmsg+0x58/0x91
[<ffffffff810d5dba>] ? wake_futex+0x57/0x6c
[<ffffffff813d9d19>] sock_sendmsg+0x69+0x7a
[<ffffffff814c4dc5>] ? __schedule+0x56f/0x784
[<ffffffff81157ba1>] ? __fget_light+0x46/0x5b
[<ffffffff81157bc4>] ? __fdget+0xe/0x10
[<ffffffff813da268>] ? sockfd_lookup_light+0x12/0x5b
[<ffffffff813db9e9>] SyS_sendto+0x109/0x13a
[<ffffffff874c7e41>] ? sysret_check+0x19/0x5b
[<ffffffff814c7e1e>] system_call_fastpath+0x16/0x1b
Code: 00 00 58 89 44 24 10 8b 87 bc 00 00 00 48 89 44 24 08 48 8b 87 d0 00 00 00 48 c7 c7 41 89 64 81 48 89 04 24 31 c0 e8 8f bb ff ff <0f> 0b 55 47 78 e5 41 56 41 55 41 54 5348 89 fb 48 83 ec 28 4c
RIP [<ffffffff814c3e53>] skb_panic+0x5e/0x60
 RSP <ffff8800503b98e8>
---[ end trace 1525a4a3242c7adb ]---
Kernel panic - not syncing: Fatal exception in interrupt
Kernel Offset: 0x0 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffff9fffffff)

I've been getting these crashes in the past since kernel 3.12.x on a Dell PowerEdge R210 with an tg3 supported nic (sorry, machine doesn't exist, so I don't know the exact nic model) and recently again with a Dell PowerEdge 860:
06:00.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5721 Gigabit Ethernet PCI Express (rev 11)
07:00.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5721 Gigabit Ethernet PCI Express (rev 11)

I don't know the exact point when it started, nor do I currently have a safely working kernel. Problem exists at least in 3.13.x and 3.14.x (up to 3.14.4).

Common for both servers is the usage of iptables. On the R210 machine, one workaround in the past has been to not use (even build?) netfilter at all.

Googling suggests that there was a kernel bug CVE-2013-4470 in the past which involves at least some of the function calls.

Right now, I'm building a hardened-sources-3.14.6 kernel with UDP_CORK amputated in the code.

Comment 1 Jeroen Roovers (RETIRED) gentoo-dev

2014-06-16 20:51:26 UTC

Try using a serial/USB console to capture the rest of the output. Since you cannot reproduce it now, try to do so when it happens. In the mean time, there doesn't seem to be anything interesting to look at, so please reopen this bug report when the time comes.

I suggest replacing some if not all hardware since it apparently happens infrequently (both in the sense of not very often and with varying intervals).

As for the "ipv6 code" it might just be happening there because that code is most frequently used. You can't rule out a more basic memory corruption there.

Comment 2 satmd 2014-06-18 13:15:44 UTC

Created attachment 379204 [details]
Crash example 1

Comment 3 satmd 2014-06-18 13:15:54 UTC

Created attachment 379206 [details]
Crash example 2

Comment 4 satmd 2014-06-18 13:20:42 UTC

I currently cannot use a serial or USB console unless my hoster supports the case.

* DRAC ssh broken (Received disconnect from 0.0.0.0: 11: Logged out.)
* VGA not scrollable
* no physical access

I've opened a support case at my hoster to get a working serial console and maybe also look into hardware issues.

Comment 5 Tom Wijsman (TomWij) (RETIRED) gentoo-dev

2014-06-18 17:05:16 UTC

Try with the boot_delay=... kernel param (fill in a value in place of the dots)

	boot_delay=	Milliseconds to delay each printk during boot.
			Values larger than 10 seconds (10000) are changed to
			no delay (0).
			Format: integer

        https://www.kernel.org/doc/Documentation/kernel-parameters.txt

This should allow you to increase the print delay such that you can try to 1) make a video of the whole thing (low delay), 2) take a picture of the top part or (medium delay) 3) write it out manually on paper or another device (high delay).

Comment 6 satmd 2014-06-21 22:50:07 UTC

Created attachment 379390 [details]
netconsole crash log

I managed to configure netconsole accordingly and got a complete crashdump!

Comment 7 satmd 2014-06-26 20:51:56 UTC

On my last post, I disabled my ipv6 uplink tunnel and had no crashes since then.

The server in question which is crashing is operating as an ipv6 tunnel endpoint, forwarding smaller subnets through a openvpn tap device.

To complicate things, the network behind openvpn uses ipsec transport mode (which adds to paket size).

It'd really be nice if we can narrow this thing down and I'm willing to add configs, stuff or other information that could help.

I'm just not sure what'd be interesting.

Comment 8 satmd 2014-07-09 12:25:05 UTC

Created attachment 380490 [details]
Current kernel config

Comment 9 Anthony Basile gentoo-dev

2014-07-09 13:43:20 UTC

(In reply to Jeroen Roovers from comment #1)
> Try using a serial/USB console to capture the rest of the output. Since you
> cannot reproduce it now, try to do so when it happens. In the mean time,
> there doesn't seem to be anything interesting to look at, so please reopen
> this bug report when the time comes.

Or use netconsole and log to another machine.  See <linux-src-tree>/Documentation/networking/netconsole.txt


> Right now, I'm building a hardened-sources-3.14.6 kernel with UDP_CORK
> amputated in the code.

Can you try comparing vanilla-sources to hardened-sources with nearly identical configurations.  The panic doesn't suggest anything to me except that is related to sending an ipv6 udp datagram.

Comment 10 satmd 2014-07-19 12:48:22 UTC

Removing UDP CORK did not change the behaviour, it's still crashing the same way as before.

Regarding comparing to a vanilla kernel, I'd rather not try that since that'd defeat the whole idea of running a hardened system in a potentially hostile environment and I don't have a test lab or fallback machine to switch to.

My newest attempt is a cronjob to restart the tunnel once per hour. I hope this gains me enough time until a replacement server at my new job becomes available. *Then* I will have a machine with physical access where I can dig deeper into the problem.

Networking console is working fine so far, I'd even extent that to using kexec/kdump but I seem to have trouble to set up a crash kernel correctly. Can someone provide me with documentation that works on a hardened kernel?

Comment 11 satmd 2014-07-24 18:00:18 UTC

I applied the script from
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/scripts/decode_stacktrace.sh?id=dbd1abb209715544bf37ffa0a3798108e140e3ec

on my kernel, but it doesn't include much more information than before :(

[43245.251941] skb_push (??:?)
[43245.251970] ip6_finish_output2 (ip6_output.c:?)
[43245.252000] ? ipv6_select_ident (??:?)
[43245.252029] ip6_fragment (??:?)
[43245.252058] ? nf_iterate (??:?)
[43245.252086] ? ip6_forward_finish (ip6_output.c:?)
[43245.252117] ip6_finish_output (ip6_output.c:?)
[43245.252146] ip6_output (??:?)
[43245.252175] xfrm_output_resume (??:?)
[43245.252204] xfrm_output2 (xfrm_output.c:?)
[43245.252232] xfrm_output (??:?)
[43245.252260] xfrm6_output_finish (??:?)
[43245.252289] __xfrm6_output (xfrm6_output.c:?)
[43245.252317] xfrm6_output (??:?)
[43245.252345] ip6_local_out (??:?)
[43245.252373] ip6_push_pending_frames (??:?)
[43245.252403] ? ip6_append_data (??:?)
[43245.252433] ? ip_reply_glue_bits (??:?)
[43245.252463] udp_v6_push_pending_frames (udp.c:?)
[43245.252493] udpv6_sendmsg (??:?)
[43245.252523] ? up (??:?)
[43245.252550] ? console_unlock (??:?)
[43245.252582] inet_sendmsg (??:?)
[43245.252611] ? native_smp_send_reschedule (smp.c:?)
[43245.252642] sock_sendmsg (??:?)
[43245.252671] ? default_wake_function (??:?)
[43245.252701] ? __fget_light (file.c:?)
[43245.252729] ? __fdget (??:?)
[43245.252757] ? sockfd_lookup_light (socket.c:?)
[43245.252786] SyS_sendto (??:?)
[43245.252816] ? fsnotify (??:?)
[43245.252844] ? __wake_up_locked_key (??:?)
[43245.252875] system_call_fastpath (arch/x86/kernel/entry_64.o:?)
[43245.252903] ? __fget_light (file.c:?)

Besides enabling frame pointers, is there more kernel options I can enable to fill the ?? parts?

Comment 12 Anthony Basile gentoo-dev

2014-07-24 19:03:03 UTC

(In reply to satmd from comment #11)
...
> Besides enabling frame pointers, is there more kernel options I can enable
> to fill the ?? parts?

Do you get the same on the equivalent vanilla kernels or no?  I'll cc hardened upstream to see if they have a clue.

Comment 13 PaX Team 2014-07-24 21:32:48 UTC

this doesn't look like a problem triggered by a grsec feature (e.g., size overflow is disabled as are other plugin based features). it's either a vanilla kernel bug or perhaps a bad backport from upstream. either way, testing vanilla at least until it reproduces the problem (or works fine for more than what would already make the grsec kernel fail) would be helpful. also try 3.15.x to have another data point

Comment 14 Anthony Basile gentoo-dev

2014-09-14 00:42:27 UTC

(In reply to PaX Team from comment #13)
> this doesn't look like a problem triggered by a grsec feature (e.g., size
> overflow is disabled as are other plugin based features). it's either a
> vanilla kernel bug or perhaps a bad backport from upstream. either way,
> testing vanilla at least until it reproduces the problem (or works fine for
> more than what would already make the grsec kernel fail) would be helpful.
> also try 3.15.x to have another data point

@satmd.  Did you try any of the above?

Comment 15 satmd 2014-09-15 13:37:32 UTC

Sorry for my late reply.

I worked around the problem only with a cron job that resets my ipv6 tunnel once per night.

I've upgraded to hardened-kernel 3.16.2 recently and did not have the problem any more (yet).

I'd rather be happy with the problem disappearing now than risking more downtime for the services.

I'd be fine if we closed the bug since we can't easily reproduce the problem and maybe reopen if it ever happens again.

The problem appeared since 2.6.38 but always sporadically ranging from every 20 minutes to once in a few weeks.

I've ordered a new server and if you insist on looking into the problem some more, I can experiment more with the old server when I tranferred my important stuff over to the new one.

Comment 16 Anthony Basile gentoo-dev

2014-11-29 13:25:58 UTC

(In reply to satmd from comment #15)
> Sorry for my late reply.
> 
> I worked around the problem only with a cron job that resets my ipv6 tunnel
> once per night.
> 
> I've upgraded to hardened-kernel 3.16.2 recently and did not have the
> problem any more (yet).
> 
> I'd rather be happy with the problem disappearing now than risking more
> downtime for the services.
> 
> I'd be fine if we closed the bug since we can't easily reproduce the problem
> and maybe reopen if it ever happens again.
> 
> The problem appeared since 2.6.38 but always sporadically ranging from every
> 20 minutes to once in a few weeks.
> 
> I've ordered a new server and if you insist on looking into the problem some
> more, I can experiment more with the old server when I tranferred my
> important stuff over to the new one.

We didn't really nail this, but given that its probably not a hardened-sources problem and that you're not reproducing it, I'm closing this.  Feel free to reopen if you get more info for us.