483686 – =sys-kernel/gentoo-sources-3.11.0 - New kernel aborts immediately when using kexec.

Bug 483686 - =sys-kernel/gentoo-sources-3.11.0 - New kernel aborts immediately when using kexec.

Summary: =sys-kernel/gentoo-sources-3.11.0 - New kernel aborts immediately when using ...

Status:	RESOLVED OBSOLETE

Alias:	None

Product:	Gentoo Linux
Classification:	Unclassified
Component:	[OLD] Core system (show other bugs)
Hardware:	AMD64 Linux

Importance:	Normal normal
Assignee:	Gentoo Kernel Bug Wranglers and Kernel Maintainers

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2013-09-05 09:39 UTC by Liam Dennehy
Modified:	2019-07-14 11:10 UTC (History)
CC List:	1 user (show)

See Also:
Package list:
Runtime testing required:	---

Attachments
dmesg output (dmesg,52.74 KB, text/plain) 2013-09-05 10:27 UTC, Liam Dennehy	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Liam Dennehy 2013-09-05 09:39:23 UTC

On a system successfully kexeced from 3.10.10-gentoo up to 3.11.0-gentoo, any subsequent kexec calls cause replacement kernel to abort operation immediately. Kernel loads (kexec -l) successfully, when execution of new begins about half a second of log appears and system reboots (BIOS).
Same target kernel, initrd and command-line succeeds when launched from 3.10.10-gentoo.

Reproducible: Always

Steps to Reproduce:
/proc/cmdline: root=/dev/mapper/vg1-gentoo--root ro rootflags=subvol=@ rd.auto=1 nomodeset

1. Generate kernel image and initrd for 3.11.0-gentoo and 3.10.10-gentoo
2. kexec -l <target> --initrd=<target> --reuse-cmdline
3. shutdown -r now

(reboot.sh script has kexec -e to kexec instead of reboot)
Actual Results:  
As kexec -e is launched, the timer-marked screen output starts appearing on-screen, i.e.
[    0.000000] Initializing cgroup subsys cpuset
[    0.000000] Initializing cgroup subsys cpu
[    0.000000] Initializing cgroup subsys cpuacct
....

After approximately one screen (25 lines) system reboots, and I have yet to find a way to pause it to see if a pertinent message appears before this happens.

Expected Results:  
When performing the applied steps on 3.10.10-gentoo booted kernel, both the 3.10 and 3.11 kernels kexec successfully. This should remain the case on 3.11.0-gentoo booted kernel.

The behaviour appears to be the same if kexec -l is launched without either a initrd or cmdline option supplied (i.e. kexec -l <kernel> _only_). Unknown if this is pertinent, but I'm suspecting boot options and/or initrd are not being passed to running kernel, especially since both versions boot from GRUB and kexec on 3.10.10-gentoo.

~ # equery l kexec-tools
[IP-] [  ] sys-apps/kexec-tools-2.0.4-r1:0

Comment 1 Liam Dennehy 2013-09-05 09:41:55 UTC

Proofreading skills need work, my apologies.
For clarity, bug occurs when system is booted into 3.11.0-gentoo by any means, not just kexec into 3.11.0-gentoo.

Comment 2 Yixun Lan archtester

2013-09-05 09:55:43 UTC

I also use 3.11.0-gentoo kernel here, but have no problem.

you'd better provide kernel config, and kernel log (dmesg output), so people can help ..

have you compared kernel config between 3.10.10 and 3.11.0? any difference

Comment 3 Liam Dennehy 2013-09-05 10:27:08 UTC

Yippee, a timing bug!
Concerned boot_delay was having no effect, I discovered CONFIG_BOOT_PRINTK_DELAY was not set. Enabling that and boot_delay=1000 succeeded in booting. boot_delay=1-100 fails. Something is evidently starting too fast, but at I have a chance for a wall of text to stay on the screen at delay=100 and will compare with a regular dmesg, attached.
Apologies, this now appears to be specific to my system, but then again there is definitely somethinig reproducibly wring with a kernel component that worked in kexec from 3.10.10 on the same system.

Comment 4 Liam Dennehy 2013-09-05 10:27:34 UTC

Created attachment 357908 [details]
dmesg output

Comment 5 Liam Dennehy 2013-09-05 11:44:55 UTC

Hello all
I refer to the previously attached dmesg output. The final successful message in the fatal kexec appears to correspond with:

[    0.157046] smpboot: Booting Node   0, Processors  #1 #2 #3 OK 

I cannot see if the "OK" characters are present, each digit is presented on a new printk_delay but #3 definitely appears. The next line (chronologically, the dmesg output provided is not correctly ordered):

[    0.168235] CPU1: Thermal monitoring handled by SMI

Tweaking boot_delay shows that entering this section within ~10 seconds from kexec -e causes an immediate reboot. The culprit appears to be a loaded coretemp module, or if coretemp is compiled-in. rmmod coretemp before reboot and adding a ten second delay causes kexec to function normally.

coretemp is enabled and loaded on equivalent 3.10.10 kernel (from which 3.11.0 kernel is derived with make oldconfig). /etc/init.d/lm_sensors zap just to be sure that coretemp is not unloaded before reboot still causes the 3.10.10 kernel's system state to allow entry to this section without a problem.

Comment 6 Liam Dennehy 2013-09-05 11:51:05 UTC

Previous post a bit long-winded, summary:

kernel version 3.11.0 with coretemp component loaded as module or built-in: new kernel fails after kexec on entry to section "CPU1: Thermal monitoring handled by SMI" if section attempted within ten seconds from kexec -e

kernel version 3.11.0 with coretemp unloaded: new kernel loads after kexec without problem.

kernel version 3.10.10 with coretemp component loaded as module: new kernel loads 
after kexec without problem.

Comment 7 Tom Wijsman (TomWij) (RETIRED) gentoo-dev

2013-09-13 17:27:09 UTC

(In reply to Liam Dennehy from comment #6)
> Previous post a bit long-winded, summary:
> 
> kernel version 3.11.0 with coretemp component loaded as module or built-in:
> new kernel fails after kexec on entry to section "CPU1: Thermal monitoring
> handled by SMI" if section attempted within ten seconds from kexec -e
> 
> kernel version 3.11.0 with coretemp unloaded: new kernel loads after kexec
> without problem.
> 
> kernel version 3.10.10 with coretemp component loaded as module: new kernel
> loads 
> after kexec without problem.

Between 3.10 and 3.11 the following relevant commits can be found:

http://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/commit/?id=807f730105e986b3b2da711cfd94f22b92532f79

http://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/commit/?id=d23e2ae1aae52bb80bd90525179375817db99809

The changes in the commits appear quite small; so, I'd suggest you to try to reverse them if possible to see which commit affects you. Make sure you try them one by one so you know which one it is.

You can reverse patch by `patch -p1 -R < FILE` where FILE would be your patch.

The raw patches can be found by clicking on the "patch" link in the URLs above.

Comment 8 Liam Dennehy 2013-09-19 09:56:22 UTC

Thanks for the pointers Tom, this has been a good learning experience.

Unfortunately my original successful case (coretemp compiled as module, unloaded) is now giving me failures, leading me to believe it's more than just that module. The only successful kexec is now with hwmon disabled entirely, making the two patches mentioned meaningless. This is now taking a lot of time to work through, especially as results are more intermittent than I previously indicated.

I may have to simply accept that 3.11.0 doesn't kexec to anything else when hwmon/coretemp are enabled, even if not loaded.

Comment 9 Liam Dennehy 2013-09-19 10:05:28 UTC

> ... The only successful kexec is now with hwmon disabled
> entirely, making the two patches mentioned meaningless.

# CONFIG_HWMON is not set

Scratch that, even hwmon disabled is causing a new kernel to fail, even when kexec to itself. This is entirely somewhere else, and I'm losing the inclination to track it down.

Comment 10 Tom Wijsman (TomWij) (RETIRED) gentoo-dev

2013-10-14 17:40:35 UTC

(In reply to Liam Dennehy from comment #9)
> This is entirely somewhere else, and I'm losing the
> inclination to track it down.

If so, feel free to file this upstream at https://bugzilla.kernel.org/ and provide us an URL to the upstream bug report such that we can follow along; I'm running out of ideas too, maybe the upstream maintainers have better ideas. :)

Comment 11 Tom Wijsman (TomWij) (RETIRED) gentoo-dev

2013-11-16 16:34:20 UTC

Did you track it down or filed it upstream? Thank you in advance.

Comment 12 Liam Dennehy 2013-11-17 13:23:41 UTC

(In reply to Tom Wijsman (TomWij) from comment #11)
> Did you track it down or filed it upstream? Thank you in advance.

This is much more complex to reproduce than I expected, there appears to be a timing component and my original guess that it is related to the coretemp module no longer appears true. Based on that filing upstream doesn't feel right until I can give some proper evidence for the devs to go on.

I will try to reproduce on 3.12.x and see where it goes, but I just don't have the time and resources to properly track this down.

Comment 13 Mike Pagano gentoo-dev

2014-03-07 19:50:53 UTC

> 
> I will try to reproduce on 3.12.x and see where it goes, but I just don't
> have the time and resources to properly track this down.

Ok, that's fair, of course. Comment here if you have the time and would like to pursue this further.