Bug 298987

Summary:	GENERATE_KEYMAP produces oops in kernel 2.6.3[12] when attempting to start hald
Product:	Gentoo Linux	Reporter:	ta2002 <throw_away_2002>
Component:	[OLD] Core system	Assignee:	Gentoo Kernel Bug Wranglers and Kernel Maintainers <kernel>
Status:	RESOLVED UPSTREAM
Severity:	normal
Priority:	High
Version:	10.1
Hardware:	All
OS:	Linux
URL:	http://forums.gentoo.org/viewtopic-p-6113574.html
Whiteboard:
Package list:		Runtime testing required:	---
Attachments:	.config file used compilation output without the patch compilation output with the patch kernel messages without the patch kernel message with the patch (oops occurs when hald attempts to start) Shipped keymap Generated keymap with NO changes to defkeymap.map

Description ta2002 2009-12-30 15:20:02 UTC

I have literally wasted hundreds of hours on this.

This kernel produces an oops immediately when attempting to start hald. The latest stable vanilla-sources (2.6.31.6) does not.

I will answer all questions and spend whatever additional time it takes to get rid of this bug. Most of the details are in the given forum thread (obviously, only posts made within the last week or so are relevant).

Comment 1 Rafał Mużyło 2009-12-30 16:11:28 UTC

Two guesses, both related to your CFLAGS,
either you've got your processor wrong
or that particular combination triggers a compiler bug.

Comment 2 ta2002 2009-12-30 21:46:50 UTC

(In reply to comment #1)
> Two guesses, both related to your CFLAGS,
> either you've got your processor wrong
> or that particular combination triggers a compiler bug.

I doubt that (although I certainly don't want to say "that's impossible").

From from forum thread (where I have posted "emerge --info" cpuinfo, and various .configs):

CFLAGS="-march=pentium3 -Os -pipe -fomit-frame-pointer -mfpmath=sse"

$ dog /proc/cpuinfo 
processor       : 0 
vendor_id       : GenuineIntel 
cpu family      : 6 
model           : 11 
model name      : Intel(R) Pentium(R) III Mobile CPU      1000MHz 
stepping        : 1 
cpu MHz         : 1000.000 
cache size      : 512 KB 
fdiv_bug        : no 
hlt_bug         : no 
f00f_bug        : no 
coma_bug        : no 
fpu             : yes 
fpu_exception   : yes 
cpuid level     : 2 
wp              : yes 
flags           : fpu vme de pse tsc msr pae mce cx8 sep mtrr pge mca cmov pat pse36 mmx fxsr sse 
bogomips        : 1998.25 
clflush size    : 32 
power management:

That hardly seems like such a rare combination that it wouldn't have been found before.

Any suggestions on how to proceed? I assume I am going to have to isolate which patch is doing this.

Comment 3 ta2002 2009-12-31 03:32:57 UTC

I have tracked down the cause of this, though I don't really have any idea how to fix it.

This comes from a two character patch I have been applying since 2.6.4 on over a dozen machines. I have never seen anything like this happen before on any of the machines.

The patch:

--- linux-2.6.31/drivers/char/Makefile  2009-09-09 22:13:59.000000000 +0000
+++ linux-2.6.31/drivers/char/Makefile  2009-12-01 00:00:00.000000000 +0000
@@ -125,7 +125,7 @@
 # Uncomment if you're changing the keymap and have an appropriate
 # loadkeys version for the map. By default, we'll use the shipped
 # versions.
-# GENERATE_KEYMAP := 1
+GENERATE_KEYMAP := 1

 ifdef GENERATE_KEYMAP

Without the patch, 2.6.31.6 runs without errors, and a kernel compiled with the patch produces the output shown in the files I will attach. I will also attach the .config I used, and the compilation output.

Please advise on how to continue. It is important for me to get this fixed.

Comment 4 ta2002 2009-12-31 03:33:35 UTC

Created attachment 214685 [details]
.config file used

Comment 5 ta2002 2009-12-31 03:34:09 UTC

Created attachment 214687 [details]
compilation output without the patch

Comment 6 ta2002 2009-12-31 03:34:39 UTC

Created attachment 214688 [details]
compilation output with the patch

Comment 7 ta2002 2009-12-31 03:35:08 UTC

Created attachment 214690 [details]
kernel messages without the patch

Comment 8 ta2002 2009-12-31 03:35:58 UTC

Created attachment 214691 [details]
kernel message with the patch (oops occurs when hald attempts to start)

Comment 9 Wormo (RETIRED) gentoo-dev

2010-01-01 19:32:00 UTC

Since you have tracked it down to a consequence of the custom keymap, how about attaching your generated defkeymap.c and also a diff against the shipped version?

Comment 10 ta2002 2010-01-06 13:42:58 UTC

(In reply to comment #9)
> Since you have tracked it down to a consequence of the custom keymap, how about
> attaching your generated defkeymap.c and also a diff against the shipped
> version?

I am guessing that the problem comes from the fact that loadkeys is generating a defkeymap.c in a different format (and a much larger file) than the shipped one. To test this, I have generated a keymap from the default with no changes to the keyboard layout. (I will attach these files). This still produced the Oops. Editing the "shipped" version directly and using that (without generating a keymap) produces a kernel that does not cause the problem.

Comment 11 ta2002 2010-01-06 13:43:56 UTC

Created attachment 215384 [details]
Shipped keymap

Comment 12 ta2002 2010-01-06 13:44:58 UTC

Created attachment 215386 [details]
Generated keymap with NO changes to defkeymap.map

Comment 13 Wormo (RETIRED) gentoo-dev

2010-01-06 20:23:27 UTC

The only real difference I see is that the arrays from loadkeys are filled up with 0xf200 where the shipped version allows the unused space at the end of the arrays to be padded out with 0's.

I can think of one more key piece of information needed to report this issue upstream: the oops output from running vanilla kernel with 'CONFIG_DEBUG_INFO=y' and using your generated keymap.

Comment 14 ta2002 2010-01-07 11:36:08 UTC

(In reply to comment #13)
> The only real difference I see is that the arrays from loadkeys are
> filled up with 0xf200 where the shipped version allows the unused 
> space at the end of the arrays to be padded out with 0's.

OK. Well, I admit I am pretty much in over my head on this.

> I can think of one more key piece of information needed to report this
> issue upstream: the oops output from running vanilla kernel with
> 'CONFIG_DEBUG_INFO=y' and using your generated keymap.

I set that (what a space hog to compile - over 1.1 GB), and generated the keymap from the default defkeymap.map (again, I usually make changes before generating the keymap, but I want to demonstrate that this bug has nothing to do with those changes).

These are the lines from /var/log/messages:


Jan  7 11:25:33 system kernel: BUG: unable to handle kernel NULL pointer dereference at 000000a2
Jan  7 11:25:33 system kernel: IP: [<c10e2a2a>] strlen+0x8/0x11
Jan  7 11:25:33 system kernel: *pde = 00000000
Jan  7 11:25:33 system kernel: Oops: 0000 [#1]
Jan  7 11:25:33 system kernel: last sysfs file: /sys/devices/virtual/misc/nvram/uevent
Jan  7 11:25:33 system kernel:
Jan  7 11:25:33 system kernel: Pid: 2360, comm: udevadm Not tainted (2.6.31.6 #2) 26472TA
Jan  7 11:25:33 system kernel: EIP: 0060:[<c10e2a2a>] EFLAGS: 00010246 CPU: 0
Jan  7 11:25:33 system kernel: EIP is at strlen+0x8/0x11
Jan  7 11:25:33 system kernel: EAX: 00000000 EBX: 00000000 ECX: ffffffff EDX: 000000d0
Jan  7 11:25:33 system kernel: ESI: 000000a2 EDI: 000000a2 EBP: efaad908 ESP: eec7dee4
Jan  7 11:25:33 system kernel: DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 0068
Jan  7 11:25:33 system kernel: Process udevadm (pid: 2360, ti=eec7c000 task=efb97130 task.ti=eec7c000)
Jan  7 11:25:33 system kernel: Stack:
Jan  7 11:25:33 system kernel: 000000d0 c104992b 00000000 eec7df24 efaad900 c1149cc7 efaad900 eee90000
Jan  7 11:25:33 system kernel: <0> eee90000 c1149d57 eee90000 c1362a93 00000090 eee90000 c1362a8a 0000000a
Jan  7 11:25:33 system kernel: <0> 00000000 c13b9ba8 00000000 c1149ebe eec78000 fffffffb c13b9bd4 c1149e5a
Jan  7 11:25:33 system kernel: Call Trace:
Jan  7 11:25:33 system kernel: [<c104992b>] ? kstrdup+0x14/0x41
Jan  7 11:25:33 system kernel: [<c1149cc7>] ? device_get_nodename+0x3c/0x89
Jan  7 11:25:33 system kernel: [<c1149d57>] ? dev_uevent+0x43/0xdc
Jan  7 11:25:33 system kernel: [<c1149ebe>] ? show_uevent+0x64/0xa5
Jan  7 11:25:33 system kernel: [<c1149e5a>] ? show_uevent+0x0/0xa5
Jan  7 11:25:33 system kernel: [<c1149b79>] ? dev_attr_show+0x16/0x32
Jan  7 11:25:33 system kernel: [<c108d61b>] ? sysfs_read_file+0x8b/0xea
Jan  7 11:25:33 system kernel: [<c108d590>] ? sysfs_read_file+0x0/0xea
Jan  7 11:25:33 system kernel: [<c105bf9a>] ? vfs_read+0x81/0x102
Jan  7 11:25:33 system kernel: [<c105c0b3>] ? sys_read+0x3c/0x63
Jan  7 11:25:33 system kernel: [<c1002728>] ? sysenter_do_call+0x12/0x26
Jan  7 11:25:33 system kernel: Code: eb 04 19 c0 0c 01 5e 5f c3 56 89 c6 89 d0 88 c4 ac 38 e0 74 09 84 c0 75 f7 be 01 00
00 00 89 f0 48 5e c3 57 83 c9 ff 89 c7 31 c0 <f2> ae f7 d1 49 89 c8 5f c3 57 31 ff 85 c9 74 0e 89 c7 89 d0 f2
Jan  7 11:25:33 system kernel: EIP: [<c10e2a2a>] strlen+0x8/0x11 SS:ESP 0068:eec7dee4
Jan  7 11:25:33 system kernel: CR2: 00000000000000a2
Jan  7 11:25:33 system kernel: ---[ end trace 423d5ba8ea6bd181 ]---

Let me know if you need anything else (I don't really see any additional information beyond what was there before).

Comment 15 Wormo (RETIRED) gentoo-dev

2010-01-10 00:49:57 UTC

Ok time to get this reported upstream since you have an oops from vanilla kernel; assigning to kernel team who can advise on opening up a kernel bug.

Comment 16 Mike Pagano gentoo-dev

2010-01-15 00:24:27 UTC

Have you tested with gentoo-sources-2.6.32-r1?

Comment 17 ta2002 2010-01-15 07:49:49 UTC

(In reply to comment #16)
> Have you tested with gentoo-sources-2.6.32-r1?

No. I generally tend to run only "stable" packages. I am even a bit more leery of 2.6.32 than usual. I believe there are some significant changes in serial port handling that have the potential to break a lot of things.

Comment 18 Mike Pagano gentoo-dev

2010-02-04 17:59:07 UTC

I'd like to know if this is a bug fixed in later kernels. Please reopen if you can do this test and post the results.

Comment 19 ta2002 2010-03-24 09:26:48 UTC

Tested with latest stable kernels (gentoo-sources-2.6.31-r10 and vanilla-sources-2.6.31.12) with absolutely identical results.

Comment 20 Mike Pagano gentoo-dev

2010-04-08 23:16:54 UTC

Please test with gentoo-sources 2.6.32 and 2.6.33

Comment 21 ta2002 2010-04-10 11:53:39 UTC

(In reply to comment #20)
> Please test with gentoo-sources 2.6.32 and 2.6.33

Just tested with gentoo-sources 2.6.32-r7 (just marked stable a couple of days ago). No changes.

From /var/log/messages:

Apr 10 11:25:02 system kernel: BUG: unable to handle kernel NULL pointer dereference at 000000a2
Apr 10 11:25:02 system kernel: IP: [<c110500a>] strlen+0x8/0x11
Apr 10 11:25:02 system kernel: *pde = 00000000
Apr 10 11:25:02 system kernel: Oops: 0000 [#1]
Apr 10 11:25:02 system kernel: last sysfs file: /sys/devices/virtual/misc/hpet/uevent
Apr 10 11:25:02 system kernel:
Apr 10 11:25:02 system kernel: Pid: 2298, comm: udevadm Not tainted (2.6.32-gentoo-r7 #1) 26472TA
Apr 10 11:25:02 system kernel: EIP: 0060:[<c110500a>] EFLAGS: 00010246 CPU: 0
Apr 10 11:25:02 system kernel: EIP is at strlen+0x8/0x11
Apr 10 11:25:02 system kernel: EAX: 00000000 EBX: 00000000 ECX: ffffffff EDX: 000000d0
Apr 10 11:25:02 system kernel: ESI: 000000a2 EDI: 000000a2 EBP: ef363f26 ESP: ef363edc
Apr 10 11:25:02 system kernel: DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 0068
Apr 10 11:25:02 system kernel: Process udevadm (pid: 2298, ti=ef362000 task=ef074700 task.ti=ef362000)
Apr 10 11:25:02 system kernel: Stack:
Apr 10 11:25:02 system kernel: 000000d0 c104b843 00000000 ef363f20 ef9b4c00 c11860a3 ef9b4c00 ef1fb000
Apr 10 11:25:02 system kernel: <0> ef1fb000 ef9b4c08 c118613f ef1fb000 c13bc2a1 000000e4 ef1fb000 c13bc298
Apr 10 11:25:02 system kernel: <0> 0000000a 00000000 007c8280 c141a508 00000000 c11862c4 ef2f8000 fffffffb
Apr 10 11:25:02 system kernel: Call Trace:
Apr 10 11:25:02 system kernel: [<c104b843>] ? kstrdup+0x14/0x41
Apr 10 11:25:02 system kernel: [<c11860a3>] ? device_get_devnode+0x41/0x8f
Apr 10 11:25:02 system kernel: [<c118613f>] ? dev_uevent+0x4e/0x105
Apr 10 11:25:02 system kernel: [<c11862c4>] ? show_uevent+0x64/0xa5
Apr 10 11:25:02 system kernel: [<c1186260>] ? show_uevent+0x0/0xa5
Apr 10 11:25:02 system kernel: [<c1185f4d>] ? dev_attr_show+0x16/0x32
Apr 10 11:25:02 system kernel: [<c1091417>] ? sysfs_read_file+0x8b/0xea
Apr 10 11:25:02 system kernel: [<c109138c>] ? sysfs_read_file+0x0/0xea
Apr 10 11:25:02 system kernel: [<c105f069>] ? vfs_read+0x81/0x102
Apr 10 11:25:02 system kernel: [<c105f182>] ? sys_read+0x3c/0x63
Apr 10 11:25:02 system kernel: [<c10027a8>] ? sysenter_do_call+0x12/0x26
Apr 10 11:25:02 system kernel: Code: eb 04 19 c0 0c 01 5e 5f c3 56 89 c6 89 d0 88 c4 ac 38 e0 74 09 84 c0 75 f7 be 01 00 00 00 89 f0 48 5e c3 57 83 c9 ff 89 c7 31 c0 <f2> ae f7 d1 49 89 c8 5f c3 57 31 ff 85 c9 74 0e 89 c7 89 d0 f2
Apr 10 11:25:02 system kernel: EIP: [<c110500a>] strlen+0x8/0x11 SS:ESP 0068:ef363edc
Apr 10 11:25:02 system kernel: CR2: 00000000000000a2
Apr 10 11:25:02 system kernel: ---[ end trace d6fa32b3eb5107ce ]---
Apr 10 11:25:11 system kernel: BUG: unable to handle kernel NULL pointer dereference at 00000076
Apr 10 11:25:11 system kernel: IP: [<c111e425>] misc_open+0x35/0xb7
Apr 10 11:25:11 system kernel: *pde = 00000000
Apr 10 11:25:11 system kernel: Oops: 0000 [#2]
Apr 10 11:25:11 system kernel: last sysfs file: /sys/devices/pci0000:00/0000:00:01.0/0000:01:00.0/boot_vga
Apr 10 11:25:11 system kernel:
Apr 10 11:25:11 system kernel: Pid: 3627, comm: X Tainted: G      D    (2.6.32-gentoo-r7 #1) 26472TA
Apr 10 11:25:11 system kernel: EIP: 0060:[<c111e425>] EFLAGS: 00213212 CPU: 0
Apr 10 11:25:11 system kernel: EIP is at misc_open+0x35/0xb7
Apr 10 11:25:11 system kernel: EAX: 0000006a EBX: 0000003f ECX: c111e3f0 EDX: 00000076
Apr 10 11:25:11 system kernel: ESI: ef39eb80 EDI: 00000000 EBP: ef1c5adc ESP: ef10fe7c
Apr 10 11:25:11 system kernel: DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 0068
Apr 10 11:25:11 system kernel: Process X (pid: 3627, ti=ef10e000 task=ef35ee00 task.ti=ef10e000)
Apr 10 11:25:11 system kernel: Stack:
Apr 10 11:25:11 system kernel: 00000000 ef837bc0 00000000 ef1c5adc c106064e ef39eb80 0000003f ef39eb80
Apr 10 11:25:11 system kernel: <0> ef1c5adc 00000000 c1060599 c105d424 efbf6b00 ef4c5600 ef39eb80 ef10ff00
Apr 10 11:25:11 system kernel: <0> ef10ff00 00000003 c105d591 ef39eb80 00000000 ef32da80 00000000 ef10ff00
Apr 10 11:25:11 system kernel: Call Trace:
Apr 10 11:25:11 system kernel: [<c106064e>] ? chrdev_open+0xb5/0xcb
Apr 10 11:25:11 system kernel: [<c1060599>] ? chrdev_open+0x0/0xcb
Apr 10 11:25:11 system kdm[3598]: X server died during startup
Apr 10 11:25:11 system kdm[3598]: X server for display :0 cannot be started, session disabled
Apr 10 11:25:11 system kernel: [<c105d424>] ? __dentry_open+0xd5/0x1b2
Apr 10 11:25:11 system kernel: [<c105d591>] ? nameidata_to_filp+0x28/0x3b
Apr 10 11:25:11 system kernel: [<c1066fbd>] ? do_filp_open+0x417/0x7b0
Apr 10 11:25:11 system kernel: [<c104dbee>] ? __do_fault+0x2e2/0x319
Apr 10 11:25:11 system kernel: [<c106d86c>] ? alloc_fd+0x49/0xab
Apr 10 11:25:11 system kernel: [<c105d224>] ? do_sys_open+0x48/0x114
Apr 10 11:25:11 system kernel: [<c105d334>] ? sys_open+0x1e/0x23
Apr 10 11:25:11 system kernel: [<c10027a8>] ? sysenter_do_call+0x12/0x26
Apr 10 11:25:11 system kernel: Code: 34 b8 b4 e1 40 c1 e8 13 00 1e 00 a1 c0 e1 40 c1 81 e3 ff ff 0f 00 83 e8 0c eb 10 39 18 75 09 8b 40 08 85 c0 75 51 eb 11 8d 42 f4 <8b> 50 0c 0f 18 02 90 3d b4 e1 40 c1 75 e2 b8 b4 e1 40 c1 e8 f1
Apr 10 11:25:11 system kernel: EIP: [<c111e425>] misc_open+0x35/0xb7 SS:ESP 0068:ef10fe7c
Apr 10 11:25:11 system kernel: CR2: 0000000000000076
Apr 10 11:25:11 system kernel: ---[ end trace d6fa32b3eb5107cf ]---

I must admit I am completely baffled as to why you think that this bug might somehow disappear in later kernels without anyone actually bothering to fix it.

Comment 22 Mike Pagano gentoo-dev

2010-04-10 16:54:47 UTC

Please submit upstream at http://bugzilla.kernel.org and post url here.