Bug 389405

Summary:	sys-kernel/gentoo-sources-3.0.6: Kernel oops then crash (via fsnotify_mark + auditd)
Product:	Gentoo Linux	Reporter:	Valentin Avram <valentin.avram>
Component:	[OLD] Core system	Assignee:	Gentoo Kernel Bug Wranglers and Kernel Maintainers <kernel>
Status:	RESOLVED UPSTREAM
Severity:	major
Priority:	Normal
Version:	10.0
Hardware:	x86
OS:	Linux
Whiteboard:
Package list:		Runtime testing required:	---
Attachments:	Screenshot of crashed kernel Config of running kernel Screenshot of crashed debug kernel Events timeframe Syslog of the oops and warnings Kernel 2.6.37 oops - 2 servers, same oops code The 3.2.x kernel config used

Description Valentin Avram 2011-11-03 10:46:13 UTC

Created attachment 291579 [details]
Screenshot of crashed kernel

Hello.

We have a Dell R610 system (Dual Xeon X5560 (6 core), 8 GB RAM, no swap), that gives randomly a kernel oops (NULL pointer dereference) in fsnotify_mark_destroy (usually either at boot, at reboot or when restarting auditd). "Stress" tests caused the kernel to sort of crash, reporting stalls on cpus.

Rebooting the server multiple times without the auditd service never caused any oops. I guess this is happening because the in-kernel auditing never gets activated.

With the auditd service enabled in default runlevel, before the oops ps shows the kernel thread [fsmark_notify]. After oops, there is no longer. The oops is as below (2 possibilities).

OOPS 1:
[snip]
2011-11-03T11:25:32+02:00 SERVER_NAME auditd[3113]: Init complete, auditd 2.1.3 listening for events (startup state enable)
2011-11-03T11:25:33.147032+02:00 SERVER_NAME kernel: BUG: unable to handle kernel NULL pointer dereference at 00000003
2011-11-03T11:25:33.147104+02:00 SERVER_NAME kernel: IP: [<c10f2f75>] fsnotify_mark_destroy+0x85/0x130
2011-11-03T11:25:33.147107+02:00 SERVER_NAME kernel: *pdpt = 0000000000000000 *pde = f000def8f000def8
2011-11-03T11:25:33.147108+02:00 SERVER_NAME kernel: Oops: 0002 [#1] SMP
2011-11-03T11:25:33.147110+02:00 SERVER_NAME kernel:
2011-11-03T11:25:33.147112+02:00 SERVER_NAME kernel: Pid: 694, comm: fsnotify_mark Not tainted 3.0.6-gentoo-drbd-version3 #1 Dell Inc. PowerEdge R610/086HF8
2011-11-03T11:25:33.147117+02:00 SERVER_NAME kernel: EIP: 0060:[<c10f2f75>] EFLAGS: 00010297 CPU: 5
2011-11-03T11:25:33.147129+02:00 SERVER_NAME kernel: EIP is at fsnotify_mark_destroy+0x85/0x130
2011-11-03T11:25:33.147131+02:00 SERVER_NAME kernel: EAX: f23f5708 EBX: f2725fa8 ECX: ffffffff EDX: f23f5744
2011-11-03T11:25:33.147132+02:00 SERVER_NAME kernel: ESI: f2728000 EDI: ffffffc3 EBP: 00000000 ESP: f2725f90
2011-11-03T11:25:33.147133+02:00 SERVER_NAME kernel: DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
2011-11-03T11:25:33.147135+02:00 SERVER_NAME kernel: Process fsnotify_mark (pid: 694, ti=f2724000 task=f2728000 task.ti=f2724000)
2011-11-03T11:25:33.147136+02:00 SERVER_NAME kernel: Stack:
2011-11-03T11:25:33.147137+02:00 SERVER_NAME kernel: f2728000 00000000 f2728000 c10504f0 f2725fa0 f2725fa0 f23f5744 f23f5744
2011-11-03T11:25:33.147138+02:00 SERVER_NAME kernel: f2c47f68 00000000 c10f2ef0 00000000 c1050174 00000000 00000000 00000000
2011-11-03T11:25:33.147140+02:00 SERVER_NAME kernel: 00000000 f2725fd4 f2725fd4 00000000 c1050100 f2c47f68 c157b876 00000000
2011-11-03T11:25:33.147140+02:00 SERVER_NAME kernel: Call Trace:
2011-11-03T11:25:33.147144+02:00 SERVER_NAME kernel: [<c10504f0>] ? wake_up_bit+0x60/0x60
2011-11-03T11:25:33.147146+02:00 SERVER_NAME kernel: [<c10f2ef0>] ? fsnotify_set_mark_ignored_mask_locked+0x20/0x20
2011-11-03T11:25:33.147147+02:00 SERVER_NAME kernel: [<c1050174>] ? kthread+0x74/0x80
2011-11-03T11:25:33.147149+02:00 SERVER_NAME kernel: [<c1050100>] ? kthread_worker_fn+0x150/0x150
2011-11-03T11:25:33.147150+02:00 SERVER_NAME kernel: [<c157b876>] ? kernel_thread_helper+0x6/0xd
2011-11-03T11:25:33.147152+02:00 SERVER_NAME kernel: Code: c1 b8 f0 ba 8a c1 e8 bb 24 f6 ff 8b 54 24 18 8d 42 c4 39 da 8b 48 3c 74 32 8d 79 c4 eb 0a 90 8d b4 26 00 00 00 00 89 ef 8b 68 40
2011-11-03T11:25:33.147154+02:00 SERVER_NAME kernel: EIP: [<c10f2f75>] fsnotify_mark_destroy+0x85/0x130 SS:ESP 0068:f2725f90
2011-11-03T11:25:33.147155+02:00 SERVER_NAME kernel: CR2: 0000000000000003
2011-11-03T11:25:33.147156+02:00 SERVER_NAME kernel: ---[ end trace 55d88741d8e7a76a ]---
2011-11-03T11:25:33.578485+02:00 SERVER_NAME auditd: type=DAEMON_START msg=audit(1320312332.509:1866): auditd start, ver=2.1.3 format=raw kernel=3.0.6-gentoo-drbd-version3 auid=1001 pid=3113 res=success
[snip]


OOPS 2:
[snip]
2011-11-03T11:25:32+02:00 SERVER_NAME auditd[3113]: Init complete, auditd 2.1.3 listening for events (startup state enable)
2011-11-03T11:25:33.147032+02:00 SERVER_NAME kernel: BUG: unable to handle kernel NULL pointer dereference at 00000003
2011-11-03T11:25:33.147104+02:00 SERVER_NAME kernel: IP: [<c10f2f75>] fsnotify_mark_destroy+0x85/0x130
2011-11-03T11:25:33.147107+02:00 SERVER_NAME kernel: *pdpt = 0000000000000000 *pde = f000def8f000def8 
2011-11-03T11:25:33.147108+02:00 SERVER_NAME kernel: Oops: 0002 [#1] SMP 
2011-11-03T11:25:33.147110+02:00 SERVER_NAME kernel: 
2011-11-03T11:25:33.147112+02:00 SERVER_NAME kernel: Pid: 694, comm: fsnotify_mark Not tainted 3.0.6-gentoo-drbd-version3 #1 Dell Inc. PowerEdge R610/086HF8
2011-11-03T11:25:33.147117+02:00 SERVER_NAME kernel: EIP: 0060:[<c10f2f75>] EFLAGS: 00010297 CPU: 5
2011-11-03T11:25:33.147129+02:00 SERVER_NAME kernel: EIP is at fsnotify_mark_destroy+0x85/0x130
2011-11-03T11:25:33.147131+02:00 SERVER_NAME kernel: EAX: f23f5708 EBX: f2725fa8 ECX: ffffffff EDX: f23f5744
2011-11-03T11:25:33.147132+02:00 SERVER_NAME kernel: ESI: f2728000 EDI: ffffffc3 EBP: 00000000 ESP: f2725f90
2011-11-03T11:25:33.147133+02:00 SERVER_NAME kernel: DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
2011-11-03T11:25:33.147135+02:00 SERVER_NAME kernel: Process fsnotify_mark (pid: 694, ti=f2724000 task=f2728000 task.ti=f2724000)
2011-11-03T11:25:33.147136+02:00 SERVER_NAME kernel: Stack:
2011-11-03T11:25:33.147137+02:00 SERVER_NAME kernel: f2728000 00000000 f2728000 c10504f0 f2725fa0 f2725fa0 f23f5744 f23f5744
2011-11-03T11:25:33.147138+02:00 SERVER_NAME kernel: f2c47f68 00000000 c10f2ef0 00000000 c1050174 00000000 00000000 00000000
2011-11-03T11:25:33.147140+02:00 SERVER_NAME kernel: 00000000 f2725fd4 f2725fd4 00000000 c1050100 f2c47f68 c157b876 00000000
2011-11-03T11:25:33.147140+02:00 SERVER_NAME kernel: Call Trace:
2011-11-03T11:25:33.147144+02:00 SERVER_NAME kernel: [<c10504f0>] ? wake_up_bit+0x60/0x60
2011-11-03T11:25:33.147146+02:00 SERVER_NAME kernel: [<c10f2ef0>] ? fsnotify_set_mark_ignored_mask_locked+0x20/0x20
2011-11-03T11:25:33.147147+02:00 SERVER_NAME kernel: [<c1050174>] ? kthread+0x74/0x80
2011-11-03T11:25:33.147149+02:00 SERVER_NAME kernel: [<c1050100>] ? kthread_worker_fn+0x150/0x150
2011-11-03T11:25:33.147150+02:00 SERVER_NAME kernel: [<c157b876>] ? kernel_thread_helper+0x6/0xd
2011-11-03T11:25:33.147152+02:00 SERVER_NAME kernel: Code: c1 b8 f0 ba 8a c1 e8 bb 24 f6 ff 8b 54 24 18 8d 42 c4 39 da 8b 48 3c 74 32 8d 79 c4 eb 0a 90 8d b4 26 00 00 00 00 89 ef 8b 68 40 
2011-11-03T11:25:33.147154+02:00 SERVER_NAME kernel: EIP: [<c10f2f75>] fsnotify_mark_destroy+0x85/0x130 SS:ESP 0068:f2725f90
2011-11-03T11:25:33.147155+02:00 SERVER_NAME kernel: CR2: 0000000000000003
2011-11-03T11:25:33.147156+02:00 SERVER_NAME kernel: ---[ end trace 55d88741d8e7a76a ]---
2011-11-03T11:25:33.578485+02:00 SERVER_NAME auditd: type=DAEMON_START msg=audit(1320312332.509:1866): auditd start, ver=2.1.3 format=raw kernel=3.0.6-gentoo-drbd-version3 auid=1001 pid=3113 res=success
[snip]

As you can see, both crash the kernel thread fsnotify_mark when the execution reaches fsnotify_mark_destroy.

Also, starting the system without the auditd service on default, then keeping doing the sequence stop-sleep10sec-start-sleep10sec cause the oops, but most interesting after keeping doing that the system became unresponsive, while the console showed the lines in the screenshot attached (sorry, i saw the lines too late in order to see the lines above and Shift-PgUp did not work).

I will attach the config file for the kernel.

The audit ebuild is 2.1.3 from portage. The linux-headers package is 2.6.39. Auditd has been recompiled after switching to the new 3.0.6 kernel (/usr/src/linux points to 3.0.6)

We have to use kernel 3.0.6 because kernel 2.6.39 throws weird apic errors (see bug #387047 comment 2 - https://bugs.gentoo.org/show_bug.cgi?id=387047 ).

The servers's profile is 10.0/server (we just need a bare profile without flags preactivated). emerge --info reports:
# emerge --info
Portage 2.1.10.11 (default/linux/x86/10.0/server, gcc-4.4.5, glibc-2.12.2-r0, 3.0.6-gentoo-drbd-version3 i686)
=================================================================
System uname: Linux-3.0.6-gentoo-drbd-version3-i686-Intel-R-_Xeon-R-_CPU_X5560_@_2.80GHz-with-gentoo-2.0.3
Timestamp of tree: Wed, 02 Nov 2011 00:45:01 +0000
app-shells/bash:          4.1_p9-r839::<unknown repository>
dev-lang/python:          2.7.2-r3, 3.1.4-r3
dev-util/cmake:           2.8.4-r1
dev-util/pkgconfig:       0.26
sys-apps/baselayout:      2.0.3
sys-apps/openrc:          0.8.3-r1
sys-apps/sandbox:         2.4
sys-devel/autoconf:       2.68
sys-devel/automake:       1.11.1
sys-devel/binutils:       2.20.1-r1
sys-devel/gcc:            4.4.5
sys-devel/gcc-config:     1.4.1-r1
sys-devel/libtool:        2.4-r1
sys-devel/make:           3.82-r1
sys-kernel/linux-headers: 2.6.39 (virtual/os-headers)
sys-libs/glibc:           2.12.2
Repositories: gentoo x-portage
ACCEPT_KEYWORDS="x86"
ACCEPT_LICENSE="* -@EULA"
CBUILD="i686-pc-linux-gnu"
CFLAGS="-O2 -march=core2 -pipe -msse4 -mcx16 -msahf -fomit-frame-pointer"
CHOST="i686-pc-linux-gnu"
CONFIG_PROTECT="/etc /usr/share/gnupg/qualified.txt"
CONFIG_PROTECT_MASK="/etc/ca-certificates.conf /etc/env.d /etc/gconf /etc/gentoo-release /etc/revdep-rebuild /etc/sandbox.d /etc/terminfo"
CXXFLAGS="-O2 -march=core2 -pipe -msse4 -mcx16 -msahf -fomit-frame-pointer"
DISTDIR="/usr/portage/distfiles"
FEATURES="assume-digests binpkg-logs distlocks ebuild-locks fixlafiles fixpackages news parallel-fetch protect-owned sandbox sfperms strict unknown-features-warn unmerge-logs unmerge-orphans userfetch usersandbox"
FFLAGS=""
GENTOO_MIRRORS="http://10.5.1.237:8080 "
LDFLAGS="-Wl,-O1 -Wl,--as-needed"
MAKEOPTS="-j9"
PKGDIR="/usr/portage/packages"
PORTAGE_CONFIGROOT="/"
PORTAGE_RSYNC_OPTS="--recursive --links --safe-links --perms --times --compress --force --whole-file --delete --stats --timeout=180 --exclude=/distfiles --exclude=/local --exclude=/packages"
PORTAGE_TMPDIR="/var/tmp"
PORTDIR="/usr/portage"
PORTDIR_OVERLAY="/usr/local/portage"
SYNC="rsync://10.5.1.237/gentoo-portage"
USE="acl bashlogger berkdb bzip2 caps cli cracklib crypt cups cxx dri gdbm iconv modules mudflap ncurses nls nptl nptlonly openmp pam pcre pie pppd random readline session snmp ssl sysfs tcpd truetype unicode x86 xml xorg zlib" ALSA_CARDS="ali5451 als4000 atiixp atiixp-modem bt87x ca0106 cmipci emu10k1 emu10k1x ens1370 ens1371 es1938 es1968 fm801 hda-intel intel8x0 intel8x0m maestro3 trident usb-audio via82xx via82xx-modem ymfpci" ALSA_PCM_PLUGINS="adpcm alaw asym copy dmix dshare dsnoop empty extplug file hooks iec958 ioplug ladspa lfloat linear meter mmap_emul mulaw multi null plug rate route share shm softvol" APACHE2_MODULES="actions alias auth_basic authn_alias authn_anon authn_dbm authn_default authn_file authz_dbm authz_default authz_groupfile authz_host authz_owner authz_user autoindex cache cgi cgid dav dav_fs dav_lock deflate dir disk_cache env expires ext_filter file_cache filter headers include info log_config logio mem_cache mime mime_magic negotiation rewrite setenvif speling status unique_id userdir usertrack vhost_alias" CALLIGRA_FEATURES="kexi words flow plan stage tables krita karbon braindump" CAMERAS="ptp2" COLLECTD_PLUGINS="df interface irq load memory rrdtool swap syslog" ELIBC="glibc" GPSD_PROTOCOLS="ashtech aivdm earthmate evermore fv18 garmin garmintxt gpsclock itrax mtk3301 nmea ntrip navcom oceanserver oldstyle oncore rtcm104v2 rtcm104v3 sirf superstar2 timing tsip tripmate tnt ubx" INPUT_DEVICES="keyboard mouse evdev" KERNEL="linux" LCD_DEVICES="bayrad cfontz cfontz633 glk hd44780 lb216 lcdm001 mtxorb ncurses text" PHP_TARGETS="php5-3" RUBY_TARGETS="ruby18" USERLAND="GNU" VIDEO_CARDS="fbdev glint intel mach64 mga neomagic nouveau nv r128 radeon savage sis tdfx trident vesa via vmware dummy v4l" XTABLES_ADDONS="quota2 psd pknock lscan length2 ipv4options ipset ipp2p iface geoip fuzzy condition tee tarpit sysrq steal rawnat logmark ipmark dhcpmac delude chaos account"
Unset:  CPPFLAGS, CTARGET, EMERGE_DEFAULT_OPTS, INSTALL_MASK, LANG, LC_ALL, LINGUAS, PORTAGE_BUNZIP2_COMMAND, PORTAGE_COMPRESS, PORTAGE_COMPRESS_FLAGS, PORTAGE_RSYNC_EXTRA_OPTS

If there is any other information i can provide, please ask.

Thank you for your time.

Comment 1 Valentin Avram 2011-11-03 10:49:00 UTC

Created attachment 291581 [details]
Config of running kernel

Attached the config of the 3.0.6-gentoo kernel.

Comment 2 Mike Pagano gentoo-dev

2011-11-03 20:03:28 UTC

Can you reproduce with a kernel with CONFIG_DEBUG_INFO=y and paste the oops here?

Comment 3 Mike Pagano gentoo-dev

2011-11-03 20:38:48 UTC

Can you also enable CONFIG_DEBUG_LIST and tell me if it still occurs?

Comment 4 Valentin Avram 2011-11-04 10:15:57 UTC

Hello.

I have recompiled the kernel starting from the config i already attached, and enabling the 2 DEBUG options you specified. The differences between the "normal" kernel and the debug one are the following:
# diff config-3.0.6-gentoo-drbd-version3 config-3.0.6-gentoo-drbd-version3-debug | grep -v "is not set" | egrep '^<|^>'
> CONFIG_DEBUG_KERNEL=y
> CONFIG_SCHED_DEBUG=y
> CONFIG_DEBUG_INFO=y
> CONFIG_DEBUG_LIST=y
> CONFIG_DEBUG_RODATA=y
> CONFIG_DEBUG_RODATA_TEST=y

Booted the debug kernel, and then started the following in the command line:
while :; do /etc/init.d/auditd start ; sleep 10 ; /etc/init.d/auditd stop ; sleep 10 ; done

73 auditd starts, 72 auditd stops, 1 kernel oops and 59 kernel warnings later, the system became unresponsive, the console showing as in the screenshot i will attach (kernel_crash6.jpeg).

All the warnings (counted 59) are marked at:
WARNING: at lib/list_debug.c:26 __list_add+0x54/0xb0()

All report this part:
list_add corruption. next->prev should be prev (c17b8ec0), but was POINTER1. . (next=POINTER2).

POINTER1 is either (null) or different values (some are preffered, they show up alot), POINTER2 has different values.

ALSO, what seems to point to the problem is the fact that all the traces go through fsnotify_destroy_mark. There are 60 of them (1 BUG + 59 WARNINGs).

I will also attach the timeframe of what happened (auditd start/stop, oops + warnings), since for some reason the warnings did NOT happen every auditd cycle (timeframe.txt) (Looks like a race condition somewhere?)

Also attached will be a file with the oops and all the warnings after, one line of auditd before and after it (i'm not very sure the rsyslog timestamp is the same as auditd's timestamp, since sometimes when starting, the loading of the rules were shown in the log 5 or 10 seconds later (as if it started loading the rules when it received the kill signal, same reversed on stop.

Hope all this logs will help, if theres anything more i can do to test, please tell me.

Thanks.

Comment 5 Valentin Avram 2011-11-04 10:17:11 UTC

Created attachment 291637 [details]
Screenshot of crashed debug kernel

Comment 6 Valentin Avram 2011-11-04 10:20:07 UTC

Created attachment 291641 [details]
Events timeframe

Grep-ing and sed-ing through the syslog to get all auditd start and stop events, the oops and the warnings.

It can be noticed the warnings did not happen every auditd cycle.

Comment 7 Valentin Avram 2011-11-04 10:43:40 UTC

Created attachment 291645 [details]
Syslog of the oops and warnings

It can be noticed that there are X warning "types":
1. warning after the rules have been loaded (process "audit_prune_tre"(e?))
2. warning as soon as auditd starts (process "auditctl")

Comment 8 Valentin Avram 2011-11-09 15:57:05 UTC

Any news guys? I don't want to hurry things, but not being able to use auditd on servers with updated kernels is a frustrating problem.

Anything else i can test or verify in order to help?

Thx.

Comment 9 Valentin Avram 2011-11-24 11:19:18 UTC

Update.

I can confirm the problem exists at least since kernel 2.6.37 (gentoo-sources-2.6.37-r4 ebuild).

Another two servers we have running this kernel are affected by the auditd-restart-generating-oops's although i can't confirm the effects are the same on a long run auditd restart cycle (those servers are critical and i can't risk to crash one just for testing purposes).

I will attach the oops log on a auditd restart on those machine.

If nobody answers in the next few days, i'll also post this issue to LKML, maybe somebody there will find some spare time to look into the matter.

Thanks.

Comment 10 Valentin Avram 2011-11-24 11:30:32 UTC

Created attachment 293613 [details]
Kernel 2.6.37 oops - 2 servers, same oops code

Both servers are Dell R610, both have the same Code in the oops data.

Comment 11 Valentin Avram 2011-11-24 11:47:52 UTC

I just noticed in the bug opening note, i said there were 2 types of oops, but pasted the same oops twice (dumb me).

Here is a copy paste of the second type of oops:
2011-11-03T11:55:42.649341+02:00 SERVER_NAME auditd: type=DAEMON_END msg=audit(1320306837.541:4816): auditd normal halt, sending auid=0 pid=3714 subj= res=success
2011-11-03T11:55:42.649343+02:00 SERVER_NAME auditd: type=DAEMON_START msg=audit(1320314142.035:7415): auditd start, ver=2.1.3 format=raw kernel=3.0.6-gentoo-drbd-version3 auid=4294967295 pid=2083 res=success
2011-11-03T11:55:42.649345+02:00 SERVER_NAME auditd: type=CONFIG_CHANGE msg=audit(1320314142.160:2): auid=4294967295 ses=4294967295 op="add rule" key="etc-directory" list=4 res=1
2011-11-03T11:55:42.649348+02:00 SERVER_NAME auditd: type=CONFIG_CHANGE msg=audit(1320314142.160:3): auid=4294967295 ses=4294967295 op="add rule" key="sbin-directory" list=4 res=1
2011-11-03T11:55:42.649350+02:00 SERVER_NAME auditd: type=CONFIG_CHANGE msg=audit(1320314142.160:4): auid=4294967295 ses=4294967295 op="add rule" key="bin-directory" list=4 res=1
2011-11-03T11:55:42.649353+02:00 SERVER_NAME auditd: type=CONFIG_CHANGE msg=audit(1320314142.170:5): auid=4294967295 ses=4294967295 op="add rule" key="usr-sbin-directory" list=4 res=1
2011-11-03T11:55:42.649356+02:00 SERVER_NAME auditd: type=CONFIG_CHANGE msg=audit(1320314142.170:6): auid=4294967295 ses=4294967295 op="add rule" key="usr-bin-directory" list=4 res=1
2011-11-03T11:55:42.649358+02:00 SERVER_NAME auditd: type=CONFIG_CHANGE msg=audit(1320314142.170:7): auid=4294967295 ses=4294967295 op="add rule" key="skip-lib-rc" list=4 res=1
2011-11-03T11:55:42.649360+02:00 SERVER_NAME auditd: type=CONFIG_CHANGE msg=audit(1320314142.170:8): auid=4294967295 ses=4294967295 op="add rule" key="lib-directory" list=4 res=1
2011-11-03T11:55:42.649362+02:00 SERVER_NAME auditd: type=CONFIG_CHANGE msg=audit(1320314142.170:9): auid=4294967295 ses=4294967295 op="add rule" key="usr-lib-directory" list=4 res=1
2011-11-03T11:55:42.649364+02:00 SERVER_NAME auditd: type=CONFIG_CHANGE msg=audit(1320314142.170:10): auid=4294967295 ses=4294967295 op="add rule" key="excluded-syscalls" list=4 res=1
2011-11-03T11:55:42.649366+02:00 SERVER_NAME auditd: type=CONFIG_CHANGE msg=audit(1320314142.170:11): audit_backlog_limit=8192 old=64 auid=4294967295 ses=4294967295 res=1
2011-11-03T11:55:42.742869+02:00 SERVER_NAME kernel: BUG: unable to handle kernel NULL pointer dereference at 00000004
2011-11-03T11:55:42.742883+02:00 SERVER_NAME kernel: IP: [<c10f2f75>] fsnotify_mark_destroy+0x85/0x130
2011-11-03T11:55:42.742888+02:00 SERVER_NAME kernel: *pdpt = 0000000000000000 *pde = f000def8f000def8 
2011-11-03T11:55:42.742889+02:00 SERVER_NAME kernel: Oops: 0002 [#1] SMP 
2011-11-03T11:55:42.742890+02:00 SERVER_NAME kernel: 
2011-11-03T11:55:42.742892+02:00 SERVER_NAME kernel: Pid: 694, comm: fsnotify_mark Not tainted 3.0.6-gentoo-drbd-version3 #1 Dell Inc. PowerEdge R610/086HF8
2011-11-03T11:55:42.742893+02:00 SERVER_NAME kernel: EIP: 0060:[<c10f2f75>] EFLAGS: 00010212 CPU: 1
2011-11-03T11:55:42.742895+02:00 SERVER_NAME kernel: EIP is at fsnotify_mark_destroy+0x85/0x130
2011-11-03T11:55:42.742896+02:00 SERVER_NAME kernel: EAX: f2d0cc88 EBX: f2725fa8 ECX: 00000000 EDX: f2d0ccc4
2011-11-03T11:55:42.742897+02:00 SERVER_NAME kernel: ESI: f2728000 EDI: ffffffc4 EBP: c1456380 ESP: f2725f90
2011-11-03T11:55:42.742900+02:00 SERVER_NAME kernel: DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
2011-11-03T11:55:42.742901+02:00 SERVER_NAME kernel: Process fsnotify_mark (pid: 694, ti=f2724000 task=f2728000 task.ti=f2724000)
2011-11-03T11:55:42.742902+02:00 SERVER_NAME kernel: Stack:
2011-11-03T11:55:42.742904+02:00 SERVER_NAME kernel: f2728000 00000000 f2728000 c10504f0 f2725fa0 f2725fa0 f2d0ccc4 f2d0ccc4
2011-11-03T11:55:42.742905+02:00 SERVER_NAME kernel: f2c47f68 00000000 c10f2ef0 00000000 c1050174 00000000 00000000 00000000
2011-11-03T11:55:42.742906+02:00 SERVER_NAME kernel: 00000000 f2725fd4 f2725fd4 00000000 c1050100 f2c47f68 c157b876 00000000
2011-11-03T11:55:42.742907+02:00 SERVER_NAME kernel: Call Trace:
2011-11-03T11:55:42.742908+02:00 SERVER_NAME kernel: [<c10504f0>] ? wake_up_bit+0x60/0x60
2011-11-03T11:55:42.742910+02:00 SERVER_NAME kernel: [<c10f2ef0>] ? fsnotify_set_mark_ignored_mask_locked+0x20/0x20
2011-11-03T11:55:42.742911+02:00 SERVER_NAME kernel: [<c1050174>] ? kthread+0x74/0x80
2011-11-03T11:55:42.742913+02:00 SERVER_NAME kernel: [<c1050100>] ? kthread_worker_fn+0x150/0x150
2011-11-03T11:55:42.742915+02:00 SERVER_NAME kernel: [<c157b876>] ? kernel_thread_helper+0x6/0xd
2011-11-03T11:55:42.742917+02:00 SERVER_NAME kernel: Code: c1 b8 f0 ba 8a c1 e8 bb 24 f6 ff 8b 54 24 18 8d 42 c4 39 da 8b 48 3c 74 32 8d 79 c4 eb 0a 90 8d b4 26 00 00 00 00 89 ef 8b 68 40 
2011-11-03T11:55:42.742919+02:00 SERVER_NAME kernel: EIP: [<c10f2f75>] fsnotify_mark_destroy+0x85/0x130 SS:ESP 0068:f2725f90
2011-11-03T11:55:42.742920+02:00 SERVER_NAME kernel: CR2: 0000000000000004
2011-11-03T11:55:42.742921+02:00 SERVER_NAME kernel: ---[ end trace 0cdac460a4b203e5 ]---

Hope this helps.

Comment 12 Valentin Avram 2012-01-24 14:44:21 UTC

Hello.

It's been 2 months of silence since the last update on this bug. So far no fix, no comments, no nothing.

I hope to be able to retry the crash on gentoo-sources-3.1.6, maybe the patches to fsnotify in the kernel in the meanwhile have fixed something.

If not, and if it will still crash, it would be nice if somebody will at least have a look at this issue.

Thx.

Comment 13 Valentin Avram 2012-03-02 07:47:03 UTC

Hello.

Since it seems nobody has any spare time to have a look at this issue, i notified the audit developers in the meantime. Nobody had told them of this issue.

Also, since i managed to get a bit of spare time and a spare server, i tested Gentoo's latest stable gentoo-sources-3.2.1-r2 with audit-2.1.3-r1 and the results are:

1. 3.2.1-gentoo-r2 does not have any gentoo special patch to fix against the oops triggered via auditd (which soon after crashes the machine completely) - SuSE bug: 689860 ( https://bugzilla.novell.com/show_bug.cgi?id=689860 ) - officially fixed in kernel.org's 3.2.2. But this is not the issue here.

2. 3.2.1-gentoo-r2 still gives the original oops (of crashed kernel thread fsnotify_mark), after which it pours with debug events of list_add corruption. Also, kernel.org's 3.2.9 (released yesterday) behaves the same.

The only good thing i noticed on 3.x (x>0) kernel instead of the first 3.0.6 (on which i first noticed the issue), is that now no more CPU stall happens. Or maybe i didn't give it enough time to get there.

Anyway, i will:
- attach logs of the initial BUG and the list_add corruption messages.
- notify the audit list that the problems are still there. Maybe the RedHat guys will find and/or confirm the issue.

Will keep this bug updated as soon as more information becomes available.

Thank you for your time.

Comment 14 Valentin Avram 2012-03-05 08:15:36 UTC

As promised, this is the BUG which shows up in 3.2.1-gentoo-r2:

kernel: [ 1200.790009] BUG: unable to handle kernel NULL pointer dereference at   (null)
kernel: [ 1200.790176] IP: [<c12379d0>] __list_del_entry+0x20/0xe0
kernel: [ 1200.790268] *pdpt = 0000000000000000 *pde = f000ddc8f000ddc8  
kernel: [ 1200.790357] Oops: 0000 [#1] SMP 
kernel: [ 1200.790441] 
kernel: [ 1200.790519] Pid: 642, comm: fsnotify_mark Not tainted 3.2.1-gentoo-r2-drbd-version3 #2 Dell Inc. PowerEdge 2950/0CX396
kernel: [ 1200.790690] EIP: 0060:[<c12379d0>] EFLAGS: 00010287 CPU: 6 
kernel: [ 1200.790775] EIP is at __list_del_entry+0x20/0xe0
kernel: [ 1200.790858] EAX: f4d49ec4 EBX: f47d3fa4 ECX: 00000000 EDX: 00000000 
kernel: [ 1200.790945] ESI: f4d49ec4 EDI: f4d49e88 EBP: f47d3f7c ESP: f47d3f64 
kernel: [ 1200.791031]  DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
kernel: [ 1200.791116] Process fsnotify_mark (pid: 642, ti=f47d2000 task=f447fc00 task.ti=f47d2000)
kernel: [ 1200.791275] Stack:
kernel: [ 1200.791352]  c10811d0 f47d3fa4 f447fc00 f3ca4e88 f47d3f7c f47d3fa4 f47d3fb8 c10f6636
kernel: [ 1200.791525]  ffffffc4 f447fc00 f447fc00 00000000 f447fc00 c1052f90 f47d3f9c f47d3f9c
kernel: [ 1200.791698]  f4d49ec4 f4d49ec4 f4c47f58 00000000 c10f65b0 f47d3fe4 c1052704 00000000
kernel: [ 1200.791870] Call Trace:
kernel: [ 1200.791953]  [<c10811d0>] ? rcu_check_callbacks+0x110/0x110
kernel: [ 1200.792039]  [<c10f6636>] fsnotify_mark_destroy+0x86/0x120
kernel: [ 1200.792126]  [<c1052f90>] ? abort_exclusive_wait+0x80/0x80
kernel: [ 1200.792211]  [<c10f65b0>] ? fsnotify_put_mark+0x30/0x30
kernel: [ 1200.792295]  [<c1052704>] kthread+0x74/0x80
kernel: [ 1200.792379]  [<c1052690>] ? kthread_flush_work_fn+0x10/0x10
kernel: [ 1200.792466]  [<c1581eb6>] kernel_thread_helper+0x6/0xd 
kernel: [ 1200.792550] Code: 55 f4 8b 45 f8 e9 75 ff ff ff 90 55 89 e5 53 83 ec 14 8b 08 8b 50 04 81 f9 00 01 10 00 74 24 81 fa 00 02 20 00 0f 84 8e 00 00 00 <8b> 1a 39 d8 75 62 8b 59 04 39 d8 75 35 89 51 04 89 0a 83 c4 14
kernel: [ 1200.792929] EIP: [<c12379d0>] __list_del_entry+0x20/0xe0 SS:ESP 0068:f47d3f64
kernel: [ 1200.793020] CR2: 0000000000000000
kernel: [ 1200.793442] ---[ end trace b824ee2095d496c7 ]---

The BUG that shows up in kernel.org's 3.2.9 is the following:
kernel: [  301.240011] BUG: unable to handle kernel NULL pointer dereference at   (null)
kernel: [  301.240305] IP: [<c1238dd0>] __list_del_entry+0x20/0xe0
kernel: [  301.240481] *pdpt = 0000000000000000 *pde = f000ddc8f000ddc8 
kernel: [  301.240698] Oops: 0000 [#1] SMP 
kernel: [  301.240910] 
kernel: [  301.241030] Pid: 642, comm: fsnotify_mark Not tainted 3.2.9-drbd-version3 #1 Dell Inc. PowerEdge 2950/0CX396
kernel: [  301.241370] EIP: 0060:[<c1238dd0>] EFLAGS: 00010287 CPU: 6
kernel: [  301.241498] EIP is at __list_del_entry+0x20/0xe0
kernel: [  301.241623] EAX: f4fae544 EBX: f47cffa4 ECX: ffffffff EDX: 00000000 
kernel: [  301.241751] ESI: f4fae544 EDI: f4fae508 EBP: f47cff7c ESP: f47cff64 
kernel: [  301.241879]  DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
kernel: [  301.242005] Process fsnotify_mark (pid: 642, ti=f47ce000 task=f4f47c00 task.ti=f47ce000)
kernel: [  301.242207] Stack:
kernel: [  301.242327]  c10813c0 f47cffa4 f4f47c00 f4e70888 f47cff7c f47cffa4 f47cffb8 c10f6976
kernel: [  301.242882]  ffffffc3 f4f47c00 f4f47c00 00000000 f4f47c00 c10530c0 f47cff9c f47cff9c
kernel: [  301.243438]  f4fae544 f4fae544 f4c47f58 00000000 c10f68f0 f47cffe4 c1052834 00000000
kernel: [  301.243995] Call Trace:
kernel: [  301.244119]  [<c10813c0>] ? rcu_check_callbacks+0x110/0x110
kernel: [  301.244248]  [<c10f6976>] fsnotify_mark_destroy+0x86/0x120
kernel: [  301.244377]  [<c10530c0>] ? abort_exclusive_wait+0x80/0x80
kernel: [  301.244504]  [<c10f68f0>] ? fsnotify_put_mark+0x30/0x30
kernel: [  301.244631]  [<c1052834>] kthread+0x74/0x80
kernel: [  301.244756]  [<c10527c0>] ? kthread_flush_work_fn+0x10/0x10
kernel: [  301.244885]  [<c1582ab6>] kernel_thread_helper+0x6/0xd
kernel: [  301.245011] Code: 55 f4 8b 45 f8 e9 75 ff ff ff 90 55 89 e5 53 83 ec 14 8b 08 8b 50 04 81 f9 00 01 10 00 74 24 81 fa 00 02 20 00 0f 84 8e 00 00 00 <8b> 1a 39 d8 75 62 8b 59 04 39 d8 75 35 89 51 04 89 0a 83 c4 14
kernel: [  301.248195] EIP: [<c1238dd0>] __list_del_entry+0x20/0xe0 SS:ESP 0068:f47cff64
kernel: [  301.248414] CR2: 0000000000000000
kernel: [  301.248538] ---[ end trace 15082dbfb353f84c ]---

So it's basically the same. In both cases, the kernel thread fsnotify_mark crashes.

If need be, i can add the list_add corruption warnings the kernel logs after the BUG, but all of them are from 3.2.9, not 3.2.1-gentoo-r2.

Just ask if they would be useful.

Comment 15 Valentin Avram 2012-03-05 08:33:15 UTC

Created attachment 304275 [details]
The 3.2.x kernel config used

This is 3.2.9 kernel config used to generate the BUG and the list_add corruption messages. I diff'ed it against the config used for 3.2.1-gentoo-r2 and they are identical except for the kernel version in the header.

Comment 16 Mike Pagano gentoo-dev

2012-03-06 19:07:07 UTC

Please take this issue upstream at http://bugzilla.kernel.org and post the url back here.

Comment 17 Valentin Avram 2012-03-22 07:33:34 UTC

Reported to bugzilla.kernel.org:
https://bugzilla.kernel.org/show_bug.cgi?id=42882

Reported to LKML:
https://lkml.org/lkml/2012/3/13/200

Reported to audit/redhat:
https://www.redhat.com/archives/linux-audit/2012-March/msg00004.html

No answer from anywhere yet, still waiting for someone to notice.

Comment 18 Mike Pagano gentoo-dev

2012-05-02 23:08:43 UTC

Thanks. We'll watch the upstream bug and work to backport any patches identified.