330157 – sys-kernel/hardened-sources-2.6.34-r1 through hardened-sources-2.6.36 reboots when turning on CPU core

Bug 330157 - sys-kernel/hardened-sources-2.6.34-r1 through hardened-sources-2.6.36 reboots when turning on CPU core

Summary: sys-kernel/hardened-sources-2.6.34-r1 through hardened-sources-2.6.36 reboots...

Status:	RESOLVED FIXED

Alias:	None

Product:	Gentoo Linux
Classification:	Unclassified
Component:	[OLD] Core system (show other bugs)
Hardware:	All Linux

Importance:	High normal (vote)
Assignee:	The Gentoo Linux Hardened Kernel Team (OBSOLETE)

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2010-07-28 06:57 UTC by Jaak Ristioja
Modified:	2010-12-20 23:48 UTC (History)
CC List:	5 users (show)

See Also:
Package list:
Runtime testing required:	---

Attachments
output of emerge --info =hardened-sources-2.6.34-r1 (emerge-info-hardened-sources-2.6.34-r1.txt,4.45 KB, text/plain) 2010-07-31 10:49 UTC, Hugo Mildenberger	Details
copy of /proc/config.gz for linux 2.6.34-hardened-r1 #7 SMP x86_64 Intel(R) Core(TM)2 Duo CPU T8100 @ 2.10GHz GenuineIntel GNU/Linux (config.gz,17.03 KB, application/octet-stream) 2010-07-31 10:52 UTC, Hugo Mildenberger	Details
Output of lshw on Dell Vostro 1510 (lshw.txt,18.97 KB, text/plain) 2010-07-31 10:53 UTC, Hugo Mildenberger	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Jaak Ristioja 2010-07-28 06:57:28 UTC

I turned one core of a dual-core CPU off using "echo 0 > /sys/devices/system/cpu/cpu1/online" and later turned it back on again with "echo 1 > /sys/devices/system/cpu/cpu1/online". This resulted in a black screen followed by POST.

Comment 1 Hugo Mildenberger 2010-07-28 13:02:38 UTC

You could set up netconsole support and then repeat the experiment in order to get some context information about what actually happened. Here is a HOWTO: /usr/src/linux/Documentation/networking/netconsole.txt

Comment 2 Chí-Thanh Christopher Nguyễn gentoo-dev

2010-07-28 16:53:32 UTC

Also check if you have reboot on panic enabled.

Comment 3 Hugo Mildenberger 2010-07-28 20:42:59 UTC

Can reproduce this using 2.6.34-grsec on a Core2 Duo 64 bit system. The reboot comes so fast that nothing reaches the logs after these two lines:

[ 3784.210039] CPU 1 is now offline
[ 3784.210045] SMP alternatives: switching to UP code

Comment 4 Anthony Basile gentoo-dev

2010-07-28 20:55:39 UTC

@reporter.  hardened-sources-2.6.34 had known issues and was masked for a while.  It has been replaced by hardened-sources-2.6.34-r1.  If its not too much trouble, can you test with -r1.  If the problem is repeats, please post your emerge --info and the kernel config.

Comment 5 Hugo Mildenberger 2010-07-30 12:57:32 UTC

2.6.34-hardened-r1 crashes in exactly the same way while 2.6.34-gentoo-r2 works correctly. Unfortunatley, netconsole did not show anything after boot, and instructing syslog-ng to log to a remote host also shows nothing after reactivating cpu1. A crash kernel loaded via kexec does not gain control, even though the crash kernel actually comes up when manually triggered via "echo c > /proc/sysrq-trigger". Hence

- the POST is not a consequence of a kernel panic (else the crash kernel would gain control)

-the crash is _probably_ related to hardened/grsec only (2.6.34-gentoo-r1 not tested though)

Comment 6 Anthony Basile gentoo-dev

2010-07-30 18:53:05 UTC

Okay this has to be looked at more carefully.  Can I have your emerge --info, the kernel config file and hardware info.

Comment 7 Hugo Mildenberger 2010-07-31 10:49:46 UTC

Created attachment 240781 [details]
output of emerge --info =hardened-sources-2.6.34-r1

(In reply to comment #6)

Comment 8 Hugo Mildenberger 2010-07-31 10:52:19 UTC

Created attachment 240785 [details]
copy of /proc/config.gz for  linux 2.6.34-hardened-r1 #7 SMP x86_64 Intel(R) Core(TM)2 Duo CPU T8100 @ 2.10GHz GenuineIntel GNU/Linux

Comment 9 Hugo Mildenberger 2010-07-31 10:53:58 UTC

Created attachment 240787 [details]
Output of lshw on Dell Vostro 1510

Comment 10 Hugo Mildenberger 2010-08-02 14:02:06 UTC

Using kgdb via kgdboe, I managed to get a backtrace from near to the point where the system finally dies. 

(gdb) bt f
#0  native_apic_mem_write (reg=784, v=16777216) at /usr/src/linux-2.6.34-hardened-r1/arch/x86/include/asm/apic.h:102
No locals.
#1  0xffffffff8101a8e0 in apic_write (low=1552, id=<value optimized out>) at /usr/src/linux-2.6.34-hardened-r1/arch/x86/include/asm/apic.h:383
No locals.
#2  native_apic_icr_write (low=1552, id=<value optimized out>) at arch/x86/kernel/apic/apic.c:271
No locals.
#3  0xffffffff814730ff in apic_icr_write (apicid=1, cpu=1) at /usr/src/linux-2.6.34-hardened-r1/arch/x86/include/asm/apic.h:393
No locals.
#4  wakeup_secondary_cpu_via_init (apicid=1, cpu=1) at arch/x86/kernel/smpboot.c:636
        send_status = 1552
        accept_status = 18446744073709551615
        maxlvt = 5
        j = 1
#5  do_boot_cpu (apicid=1, cpu=1) at arch/x86/kernel/smpboot.c:806
        boot_error = <value optimized out>
        start_ip = 1552
        timeout = <value optimized out>
        c_idle = {work = {data = {counter = 0}, entry = {next = 0xffff88011abfbd30, prev = 0xffff88011abfbd30}, 
            func = 0xffffffff814734b0 <do_fork_idle>}, idle = 0xffff88013fa723a0, done = {done = 0, wait = {lock = {{rlock = {raw_lock = {
                      slock = 0}}}}, task_list = {next = 0xffff88011abfbd60, prev = 0xffff88011abfbd60}}}, cpu = 1}
#6  0xffffffff814733d0 in native_cpu_up (cpu=1) at arch/x86/kernel/smpboot.c:919
        apicid = 1
        flags = <value optimized out>
        err = <value optimized out>
        __func__ = "native_cpu_up"
#7  0xffffffff81474f54 in __cpu_up (cpu=1) at /usr/src/linux-2.6.34-hardened-r1/arch/x86/include/asm/smp.h:96
No locals.
#8  _cpu_up (cpu=1) at kernel/cpu.c:317
        ret = -2124205856
        nr_calls = 38
        hcpu = 0x1
#9  cpu_up (cpu=1) at kernel/cpu.c:356
        err = <value optimized out>
#10 0xffffffff81467978 in store_online (dev=0xffff880001d04428, attr=0x1000000, buf=<value optimized out>, count=2) at drivers/base/cpu.c:50
        cpu = 0x0
        ret = -2124205856
#11 0xffffffff812a444b in sysdev_store (kobj=<value optimized out>, attr=0x1000000, buffer=0x1ff726 <Address 0x1ff726 out of bounds>, count=0)
    at drivers/base/sys.c:52
No locals.
#12 0xffffffff81120fa5 in flush_write_buffer (file=<value optimized out>, buf=<value optimized out>, count=<value optimized out>, 
    ppos=<value optimized out>) at fs/sysfs/file.c:209
        attr_sd = 0xffff88013fa5dca8
        kobj = 0xffff880001d04438
---Type <return> to continue, or q <return> to quit---q
Quit
(gdb) step
103     in /usr/src/linux-2.6.34-hardened-r1/arch/x86/include/asm/apic.h
(gdb) 
102     in /usr/src/linux-2.6.34-hardened-r1/arch/x86/include/asm/apic.h
(gdb) 
105     in /usr/src/linux-2.6.34-hardened-r1/arch/x86/include/asm/apic.h
(gdb) 
108     in /usr/src/linux-2.6.34-hardened-r1/arch/x86/include/asm/apic.h
(gdb) 
native_apic_mem_write (reg=768, v=1552) at /usr/src/linux-2.6.34-hardened-r1/arch/x86/include/asm/apic.h:102
102     in /usr/src/linux-2.6.34-hardened-r1/arch/x86/include/asm/apic.h
(gdb) 
103     in /usr/src/linux-2.6.34-hardened-r1/arch/x86/include/asm/apic.h
(gdb) 
102     in /usr/src/linux-2.6.34-hardened-r1/arch/x86/include/asm/apic.h
(gdb) 
105     in /usr/src/linux-2.6.34-hardened-r1/arch/x86/include/asm/apic.h
(gdb) 
Ignoring packet error, continuing...

Comment 11 Hugo Mildenberger 2010-08-05 14:55:14 UTC

Non-hardened 2.6.34-gentoo-r1, apart from missing grsec features identically configured, is unaffected:

$ uname -a
Linux localhost 2.6.34-gentoo-r1 #1 SMP Thu Aug 5 16:33:26 CEST 2010 x86_64 Intel(R) Core(TM)2 Duo CPU T8100 @ 2.10GHz GenuineIntel GNU/Linux


[  209.070052] CPU 1 is now offline
[  209.070058] SMP alternatives: switching to UP code
[  255.720638] SMP alternatives: switching to SMP code
[  255.731461] Booting Node 0 Processor 1 APIC 0x1
[  255.720301] CPU1: Thermal monitoring handled by SMI

Comment 12 Anthony Basile gentoo-dev

2010-08-05 20:19:02 UTC

(In reply to comment #11)
> Non-hardened 2.6.34-gentoo-r1, apart from missing grsec features identically
> configured, is unaffected:
> 

I have confirmed you above findings.  Upstream suspects that this is due to some new PaX code in SMP which was introduced in 2.6.34.

Comment 13 Anthony Basile gentoo-dev

2010-08-23 04:02:10 UTC

I'm looking at this again and found that the problem persists in hardened-sources-2.6.34-r2.

The bug is clearly introduced by PaX.  CONFIG_PAX_KERNEXEC=y triggers the problem.  But what I haven't been able to figure out is the connection between kernel page protection and your backtrace showing where the problem is hit.

Comment 14 Anthony Basile gentoo-dev

2010-09-29 22:22:29 UTC

I tested the latest hardened-sources-2.6.35 and this problem persists.  I'm going to pass this one by upstream again.

Comment 15 PaX Team 2010-09-29 22:44:54 UTC

can someone tell me if i386 is affected as well (seems to work here) or only amd64?

Comment 16 Anthony Basile gentoo-dev

2010-09-29 22:54:31 UTC

(In reply to comment #15)
> can someone tell me if i386 is affected as well (seems to work here) or only
> amd64?
> 

I tested on i386 and amd64 running on identical hardware.  This problem *only* affects the amd64 system.

Comment 17 Hugo Mildenberger 2010-10-18 09:55:30 UTC

Bug is still present using sys-kernel/hardened-sources-2.6.35-r2 on amd64. 

> The bug is clearly introduced by PaX.  CONFIG_PAX_KERNEXEC=y 
> triggers the problem.  But what I haven't been able to figure 
> out is the connection between kernel page protection and 
> your backtrace showing where the problem is hit.

As I understand it, the problem is hit after an APIC I/O operation completed, presumably one which is finally restarting CPU1. Hence, without having an ICE/debugger available one probably can't come much closer.

Comment 18 PaX Team 2010-10-18 19:56:09 UTC

(In reply to comment #17)
> As I understand it, the problem is hit after an APIC I/O operation completed,
> presumably one which is finally restarting CPU1. Hence, without having an
> ICE/debugger available one probably can't come much closer.

actually if you try this under qemu, you'll see that the problem is with the initial page tables that somehow get broken by the time the CPU is woken up (even though the very same page tables worked fine during init and nothing should have modified them since). i'm still trying to understand the root cause, i'll let you know when i figured it out.

Comment 19 PaX Team 2010-11-12 00:30:16 UTC

(In reply to comment #18)
> i'm still trying to understand the root cause, i'll let you know when i figured it out.

the latest patches (both PaX and grsec) should fix this problem.

Comment 20 Anthony Basile gentoo-dev

2010-11-14 17:53:33 UTC

(In reply to comment #19)
> (In reply to comment #18)
> > i'm still trying to understand the root cause, i'll let you know when i figured it out.
> 
> the latest patches (both PaX and grsec) should fix this problem.
> 

Confirmed.  The fix will be in

    hardened-sources-2.6.32-r26
    hardened-sources-2.6.36-r1

which will hit the tree this afternoon.  When one of these (or above) is stabilized, I'll close this bug.

Thanks pipacs!

Comment 21 Anthony Basile gentoo-dev

2010-12-20 23:48:03 UTC

Just stabilized hardened-sources-2.6.32-r31.ebuild and hardened-sources-2.6.36-r6.ebuild which include the fix.  Closing.