Summary: | sys-kernel/gentoo-sources-3.0.6: Kernel oops then crash (via fsnotify_mark + auditd) | ||
---|---|---|---|
Product: | Gentoo Linux | Reporter: | Valentin Avram <valentin.avram> |
Component: | [OLD] Core system | Assignee: | Gentoo Kernel Bug Wranglers and Kernel Maintainers <kernel> |
Status: | RESOLVED UPSTREAM | ||
Severity: | major | ||
Priority: | Normal | ||
Version: | 10.0 | ||
Hardware: | x86 | ||
OS: | Linux | ||
Whiteboard: | |||
Package list: | Runtime testing required: | --- | |
Attachments: |
Screenshot of crashed kernel
Config of running kernel Screenshot of crashed debug kernel Events timeframe Syslog of the oops and warnings Kernel 2.6.37 oops - 2 servers, same oops code The 3.2.x kernel config used |
Description
Valentin Avram
2011-11-03 10:46:13 UTC
Created attachment 291581 [details]
Config of running kernel
Attached the config of the 3.0.6-gentoo kernel.
Can you reproduce with a kernel with CONFIG_DEBUG_INFO=y and paste the oops here? Can you also enable CONFIG_DEBUG_LIST and tell me if it still occurs? Hello.
I have recompiled the kernel starting from the config i already attached, and enabling the 2 DEBUG options you specified. The differences between the "normal" kernel and the debug one are the following:
# diff config-3.0.6-gentoo-drbd-version3 config-3.0.6-gentoo-drbd-version3-debug | grep -v "is not set" | egrep '^<|^>'
> CONFIG_DEBUG_KERNEL=y
> CONFIG_SCHED_DEBUG=y
> CONFIG_DEBUG_INFO=y
> CONFIG_DEBUG_LIST=y
> CONFIG_DEBUG_RODATA=y
> CONFIG_DEBUG_RODATA_TEST=y
Booted the debug kernel, and then started the following in the command line:
while :; do /etc/init.d/auditd start ; sleep 10 ; /etc/init.d/auditd stop ; sleep 10 ; done
73 auditd starts, 72 auditd stops, 1 kernel oops and 59 kernel warnings later, the system became unresponsive, the console showing as in the screenshot i will attach (kernel_crash6.jpeg).
All the warnings (counted 59) are marked at:
WARNING: at lib/list_debug.c:26 __list_add+0x54/0xb0()
All report this part:
list_add corruption. next->prev should be prev (c17b8ec0), but was POINTER1. . (next=POINTER2).
POINTER1 is either (null) or different values (some are preffered, they show up alot), POINTER2 has different values.
ALSO, what seems to point to the problem is the fact that all the traces go through fsnotify_destroy_mark. There are 60 of them (1 BUG + 59 WARNINGs).
I will also attach the timeframe of what happened (auditd start/stop, oops + warnings), since for some reason the warnings did NOT happen every auditd cycle (timeframe.txt) (Looks like a race condition somewhere?)
Also attached will be a file with the oops and all the warnings after, one line of auditd before and after it (i'm not very sure the rsyslog timestamp is the same as auditd's timestamp, since sometimes when starting, the loading of the rules were shown in the log 5 or 10 seconds later (as if it started loading the rules when it received the kill signal, same reversed on stop.
Hope all this logs will help, if theres anything more i can do to test, please tell me.
Thanks.
Created attachment 291637 [details]
Screenshot of crashed debug kernel
Created attachment 291641 [details]
Events timeframe
Grep-ing and sed-ing through the syslog to get all auditd start and stop events, the oops and the warnings.
It can be noticed the warnings did not happen every auditd cycle.
Created attachment 291645 [details]
Syslog of the oops and warnings
It can be noticed that there are X warning "types":
1. warning after the rules have been loaded (process "audit_prune_tre"(e?))
2. warning as soon as auditd starts (process "auditctl")
Any news guys? I don't want to hurry things, but not being able to use auditd on servers with updated kernels is a frustrating problem. Anything else i can test or verify in order to help? Thx. Update. I can confirm the problem exists at least since kernel 2.6.37 (gentoo-sources-2.6.37-r4 ebuild). Another two servers we have running this kernel are affected by the auditd-restart-generating-oops's although i can't confirm the effects are the same on a long run auditd restart cycle (those servers are critical and i can't risk to crash one just for testing purposes). I will attach the oops log on a auditd restart on those machine. If nobody answers in the next few days, i'll also post this issue to LKML, maybe somebody there will find some spare time to look into the matter. Thanks. Created attachment 293613 [details]
Kernel 2.6.37 oops - 2 servers, same oops code
Both servers are Dell R610, both have the same Code in the oops data.
I just noticed in the bug opening note, i said there were 2 types of oops, but pasted the same oops twice (dumb me). Here is a copy paste of the second type of oops: 2011-11-03T11:55:42.649341+02:00 SERVER_NAME auditd: type=DAEMON_END msg=audit(1320306837.541:4816): auditd normal halt, sending auid=0 pid=3714 subj= res=success 2011-11-03T11:55:42.649343+02:00 SERVER_NAME auditd: type=DAEMON_START msg=audit(1320314142.035:7415): auditd start, ver=2.1.3 format=raw kernel=3.0.6-gentoo-drbd-version3 auid=4294967295 pid=2083 res=success 2011-11-03T11:55:42.649345+02:00 SERVER_NAME auditd: type=CONFIG_CHANGE msg=audit(1320314142.160:2): auid=4294967295 ses=4294967295 op="add rule" key="etc-directory" list=4 res=1 2011-11-03T11:55:42.649348+02:00 SERVER_NAME auditd: type=CONFIG_CHANGE msg=audit(1320314142.160:3): auid=4294967295 ses=4294967295 op="add rule" key="sbin-directory" list=4 res=1 2011-11-03T11:55:42.649350+02:00 SERVER_NAME auditd: type=CONFIG_CHANGE msg=audit(1320314142.160:4): auid=4294967295 ses=4294967295 op="add rule" key="bin-directory" list=4 res=1 2011-11-03T11:55:42.649353+02:00 SERVER_NAME auditd: type=CONFIG_CHANGE msg=audit(1320314142.170:5): auid=4294967295 ses=4294967295 op="add rule" key="usr-sbin-directory" list=4 res=1 2011-11-03T11:55:42.649356+02:00 SERVER_NAME auditd: type=CONFIG_CHANGE msg=audit(1320314142.170:6): auid=4294967295 ses=4294967295 op="add rule" key="usr-bin-directory" list=4 res=1 2011-11-03T11:55:42.649358+02:00 SERVER_NAME auditd: type=CONFIG_CHANGE msg=audit(1320314142.170:7): auid=4294967295 ses=4294967295 op="add rule" key="skip-lib-rc" list=4 res=1 2011-11-03T11:55:42.649360+02:00 SERVER_NAME auditd: type=CONFIG_CHANGE msg=audit(1320314142.170:8): auid=4294967295 ses=4294967295 op="add rule" key="lib-directory" list=4 res=1 2011-11-03T11:55:42.649362+02:00 SERVER_NAME auditd: type=CONFIG_CHANGE msg=audit(1320314142.170:9): auid=4294967295 ses=4294967295 op="add rule" key="usr-lib-directory" list=4 res=1 2011-11-03T11:55:42.649364+02:00 SERVER_NAME auditd: type=CONFIG_CHANGE msg=audit(1320314142.170:10): auid=4294967295 ses=4294967295 op="add rule" key="excluded-syscalls" list=4 res=1 2011-11-03T11:55:42.649366+02:00 SERVER_NAME auditd: type=CONFIG_CHANGE msg=audit(1320314142.170:11): audit_backlog_limit=8192 old=64 auid=4294967295 ses=4294967295 res=1 2011-11-03T11:55:42.742869+02:00 SERVER_NAME kernel: BUG: unable to handle kernel NULL pointer dereference at 00000004 2011-11-03T11:55:42.742883+02:00 SERVER_NAME kernel: IP: [<c10f2f75>] fsnotify_mark_destroy+0x85/0x130 2011-11-03T11:55:42.742888+02:00 SERVER_NAME kernel: *pdpt = 0000000000000000 *pde = f000def8f000def8 2011-11-03T11:55:42.742889+02:00 SERVER_NAME kernel: Oops: 0002 [#1] SMP 2011-11-03T11:55:42.742890+02:00 SERVER_NAME kernel: 2011-11-03T11:55:42.742892+02:00 SERVER_NAME kernel: Pid: 694, comm: fsnotify_mark Not tainted 3.0.6-gentoo-drbd-version3 #1 Dell Inc. PowerEdge R610/086HF8 2011-11-03T11:55:42.742893+02:00 SERVER_NAME kernel: EIP: 0060:[<c10f2f75>] EFLAGS: 00010212 CPU: 1 2011-11-03T11:55:42.742895+02:00 SERVER_NAME kernel: EIP is at fsnotify_mark_destroy+0x85/0x130 2011-11-03T11:55:42.742896+02:00 SERVER_NAME kernel: EAX: f2d0cc88 EBX: f2725fa8 ECX: 00000000 EDX: f2d0ccc4 2011-11-03T11:55:42.742897+02:00 SERVER_NAME kernel: ESI: f2728000 EDI: ffffffc4 EBP: c1456380 ESP: f2725f90 2011-11-03T11:55:42.742900+02:00 SERVER_NAME kernel: DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 2011-11-03T11:55:42.742901+02:00 SERVER_NAME kernel: Process fsnotify_mark (pid: 694, ti=f2724000 task=f2728000 task.ti=f2724000) 2011-11-03T11:55:42.742902+02:00 SERVER_NAME kernel: Stack: 2011-11-03T11:55:42.742904+02:00 SERVER_NAME kernel: f2728000 00000000 f2728000 c10504f0 f2725fa0 f2725fa0 f2d0ccc4 f2d0ccc4 2011-11-03T11:55:42.742905+02:00 SERVER_NAME kernel: f2c47f68 00000000 c10f2ef0 00000000 c1050174 00000000 00000000 00000000 2011-11-03T11:55:42.742906+02:00 SERVER_NAME kernel: 00000000 f2725fd4 f2725fd4 00000000 c1050100 f2c47f68 c157b876 00000000 2011-11-03T11:55:42.742907+02:00 SERVER_NAME kernel: Call Trace: 2011-11-03T11:55:42.742908+02:00 SERVER_NAME kernel: [<c10504f0>] ? wake_up_bit+0x60/0x60 2011-11-03T11:55:42.742910+02:00 SERVER_NAME kernel: [<c10f2ef0>] ? fsnotify_set_mark_ignored_mask_locked+0x20/0x20 2011-11-03T11:55:42.742911+02:00 SERVER_NAME kernel: [<c1050174>] ? kthread+0x74/0x80 2011-11-03T11:55:42.742913+02:00 SERVER_NAME kernel: [<c1050100>] ? kthread_worker_fn+0x150/0x150 2011-11-03T11:55:42.742915+02:00 SERVER_NAME kernel: [<c157b876>] ? kernel_thread_helper+0x6/0xd 2011-11-03T11:55:42.742917+02:00 SERVER_NAME kernel: Code: c1 b8 f0 ba 8a c1 e8 bb 24 f6 ff 8b 54 24 18 8d 42 c4 39 da 8b 48 3c 74 32 8d 79 c4 eb 0a 90 8d b4 26 00 00 00 00 89 ef 8b 68 40 2011-11-03T11:55:42.742919+02:00 SERVER_NAME kernel: EIP: [<c10f2f75>] fsnotify_mark_destroy+0x85/0x130 SS:ESP 0068:f2725f90 2011-11-03T11:55:42.742920+02:00 SERVER_NAME kernel: CR2: 0000000000000004 2011-11-03T11:55:42.742921+02:00 SERVER_NAME kernel: ---[ end trace 0cdac460a4b203e5 ]--- Hope this helps. Hello. It's been 2 months of silence since the last update on this bug. So far no fix, no comments, no nothing. I hope to be able to retry the crash on gentoo-sources-3.1.6, maybe the patches to fsnotify in the kernel in the meanwhile have fixed something. If not, and if it will still crash, it would be nice if somebody will at least have a look at this issue. Thx. Hello. Since it seems nobody has any spare time to have a look at this issue, i notified the audit developers in the meantime. Nobody had told them of this issue. Also, since i managed to get a bit of spare time and a spare server, i tested Gentoo's latest stable gentoo-sources-3.2.1-r2 with audit-2.1.3-r1 and the results are: 1. 3.2.1-gentoo-r2 does not have any gentoo special patch to fix against the oops triggered via auditd (which soon after crashes the machine completely) - SuSE bug: 689860 ( https://bugzilla.novell.com/show_bug.cgi?id=689860 ) - officially fixed in kernel.org's 3.2.2. But this is not the issue here. 2. 3.2.1-gentoo-r2 still gives the original oops (of crashed kernel thread fsnotify_mark), after which it pours with debug events of list_add corruption. Also, kernel.org's 3.2.9 (released yesterday) behaves the same. The only good thing i noticed on 3.x (x>0) kernel instead of the first 3.0.6 (on which i first noticed the issue), is that now no more CPU stall happens. Or maybe i didn't give it enough time to get there. Anyway, i will: - attach logs of the initial BUG and the list_add corruption messages. - notify the audit list that the problems are still there. Maybe the RedHat guys will find and/or confirm the issue. Will keep this bug updated as soon as more information becomes available. Thank you for your time. As promised, this is the BUG which shows up in 3.2.1-gentoo-r2: kernel: [ 1200.790009] BUG: unable to handle kernel NULL pointer dereference at (null) kernel: [ 1200.790176] IP: [<c12379d0>] __list_del_entry+0x20/0xe0 kernel: [ 1200.790268] *pdpt = 0000000000000000 *pde = f000ddc8f000ddc8 kernel: [ 1200.790357] Oops: 0000 [#1] SMP kernel: [ 1200.790441] kernel: [ 1200.790519] Pid: 642, comm: fsnotify_mark Not tainted 3.2.1-gentoo-r2-drbd-version3 #2 Dell Inc. PowerEdge 2950/0CX396 kernel: [ 1200.790690] EIP: 0060:[<c12379d0>] EFLAGS: 00010287 CPU: 6 kernel: [ 1200.790775] EIP is at __list_del_entry+0x20/0xe0 kernel: [ 1200.790858] EAX: f4d49ec4 EBX: f47d3fa4 ECX: 00000000 EDX: 00000000 kernel: [ 1200.790945] ESI: f4d49ec4 EDI: f4d49e88 EBP: f47d3f7c ESP: f47d3f64 kernel: [ 1200.791031] DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 kernel: [ 1200.791116] Process fsnotify_mark (pid: 642, ti=f47d2000 task=f447fc00 task.ti=f47d2000) kernel: [ 1200.791275] Stack: kernel: [ 1200.791352] c10811d0 f47d3fa4 f447fc00 f3ca4e88 f47d3f7c f47d3fa4 f47d3fb8 c10f6636 kernel: [ 1200.791525] ffffffc4 f447fc00 f447fc00 00000000 f447fc00 c1052f90 f47d3f9c f47d3f9c kernel: [ 1200.791698] f4d49ec4 f4d49ec4 f4c47f58 00000000 c10f65b0 f47d3fe4 c1052704 00000000 kernel: [ 1200.791870] Call Trace: kernel: [ 1200.791953] [<c10811d0>] ? rcu_check_callbacks+0x110/0x110 kernel: [ 1200.792039] [<c10f6636>] fsnotify_mark_destroy+0x86/0x120 kernel: [ 1200.792126] [<c1052f90>] ? abort_exclusive_wait+0x80/0x80 kernel: [ 1200.792211] [<c10f65b0>] ? fsnotify_put_mark+0x30/0x30 kernel: [ 1200.792295] [<c1052704>] kthread+0x74/0x80 kernel: [ 1200.792379] [<c1052690>] ? kthread_flush_work_fn+0x10/0x10 kernel: [ 1200.792466] [<c1581eb6>] kernel_thread_helper+0x6/0xd kernel: [ 1200.792550] Code: 55 f4 8b 45 f8 e9 75 ff ff ff 90 55 89 e5 53 83 ec 14 8b 08 8b 50 04 81 f9 00 01 10 00 74 24 81 fa 00 02 20 00 0f 84 8e 00 00 00 <8b> 1a 39 d8 75 62 8b 59 04 39 d8 75 35 89 51 04 89 0a 83 c4 14 kernel: [ 1200.792929] EIP: [<c12379d0>] __list_del_entry+0x20/0xe0 SS:ESP 0068:f47d3f64 kernel: [ 1200.793020] CR2: 0000000000000000 kernel: [ 1200.793442] ---[ end trace b824ee2095d496c7 ]--- The BUG that shows up in kernel.org's 3.2.9 is the following: kernel: [ 301.240011] BUG: unable to handle kernel NULL pointer dereference at (null) kernel: [ 301.240305] IP: [<c1238dd0>] __list_del_entry+0x20/0xe0 kernel: [ 301.240481] *pdpt = 0000000000000000 *pde = f000ddc8f000ddc8 kernel: [ 301.240698] Oops: 0000 [#1] SMP kernel: [ 301.240910] kernel: [ 301.241030] Pid: 642, comm: fsnotify_mark Not tainted 3.2.9-drbd-version3 #1 Dell Inc. PowerEdge 2950/0CX396 kernel: [ 301.241370] EIP: 0060:[<c1238dd0>] EFLAGS: 00010287 CPU: 6 kernel: [ 301.241498] EIP is at __list_del_entry+0x20/0xe0 kernel: [ 301.241623] EAX: f4fae544 EBX: f47cffa4 ECX: ffffffff EDX: 00000000 kernel: [ 301.241751] ESI: f4fae544 EDI: f4fae508 EBP: f47cff7c ESP: f47cff64 kernel: [ 301.241879] DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 kernel: [ 301.242005] Process fsnotify_mark (pid: 642, ti=f47ce000 task=f4f47c00 task.ti=f47ce000) kernel: [ 301.242207] Stack: kernel: [ 301.242327] c10813c0 f47cffa4 f4f47c00 f4e70888 f47cff7c f47cffa4 f47cffb8 c10f6976 kernel: [ 301.242882] ffffffc3 f4f47c00 f4f47c00 00000000 f4f47c00 c10530c0 f47cff9c f47cff9c kernel: [ 301.243438] f4fae544 f4fae544 f4c47f58 00000000 c10f68f0 f47cffe4 c1052834 00000000 kernel: [ 301.243995] Call Trace: kernel: [ 301.244119] [<c10813c0>] ? rcu_check_callbacks+0x110/0x110 kernel: [ 301.244248] [<c10f6976>] fsnotify_mark_destroy+0x86/0x120 kernel: [ 301.244377] [<c10530c0>] ? abort_exclusive_wait+0x80/0x80 kernel: [ 301.244504] [<c10f68f0>] ? fsnotify_put_mark+0x30/0x30 kernel: [ 301.244631] [<c1052834>] kthread+0x74/0x80 kernel: [ 301.244756] [<c10527c0>] ? kthread_flush_work_fn+0x10/0x10 kernel: [ 301.244885] [<c1582ab6>] kernel_thread_helper+0x6/0xd kernel: [ 301.245011] Code: 55 f4 8b 45 f8 e9 75 ff ff ff 90 55 89 e5 53 83 ec 14 8b 08 8b 50 04 81 f9 00 01 10 00 74 24 81 fa 00 02 20 00 0f 84 8e 00 00 00 <8b> 1a 39 d8 75 62 8b 59 04 39 d8 75 35 89 51 04 89 0a 83 c4 14 kernel: [ 301.248195] EIP: [<c1238dd0>] __list_del_entry+0x20/0xe0 SS:ESP 0068:f47cff64 kernel: [ 301.248414] CR2: 0000000000000000 kernel: [ 301.248538] ---[ end trace 15082dbfb353f84c ]--- So it's basically the same. In both cases, the kernel thread fsnotify_mark crashes. If need be, i can add the list_add corruption warnings the kernel logs after the BUG, but all of them are from 3.2.9, not 3.2.1-gentoo-r2. Just ask if they would be useful. Created attachment 304275 [details]
The 3.2.x kernel config used
This is 3.2.9 kernel config used to generate the BUG and the list_add corruption messages. I diff'ed it against the config used for 3.2.1-gentoo-r2 and they are identical except for the kernel version in the header.
Please take this issue upstream at http://bugzilla.kernel.org and post the url back here. Reported to bugzilla.kernel.org: https://bugzilla.kernel.org/show_bug.cgi?id=42882 Reported to LKML: https://lkml.org/lkml/2012/3/13/200 Reported to audit/redhat: https://www.redhat.com/archives/linux-audit/2012-March/msg00004.html No answer from anywhere yet, still waiting for someone to notice. Thanks. We'll watch the upstream bug and work to backport any patches identified. |