I've been using rasdaemon for a while to monitor the ECC memory in my machine and I noticed that recently it started crashing on startup. I'm unsure which was the first version affected but the last one definitely is. The crash happens using stable versions of all the package's dependencies, everything compiled with sys-devel/gcc-11.3.1_p20221209 which is also the current stable compiler. The build CFLAGS are just "-O2" and nothing else. The crash only happens when the `--record` flag is passed and yields the following stack trace: #0 ___pthread_mutex_lock (mutex=0x7473656d6974202c) at pthread_mutex_lock.c:80 #1 0x00007ffff7ed379c in sqlite3_finalize (pStmt=0x7fff9800e9b0) at sqlite3.c:87444 #2 0x00005555555687f2 in ras_mc_event_closedb (cpu=27, ras=<optimized out>) at ras-record.c:923 #3 0x0000555555564698 in handle_ras_events_cpu (priv=0x5555555c4e30) at ras-events.c:608 #4 0x00007ffff7ce337a in start_thread (arg=<optimized out>) at pthread_create.c:442 #5 0x00007ffff7d6422c in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81 I've tried to compile with CFLAGS="-O0" to better narrow down the crash but it disappears w/ optimizations disabled.
Are you sure it's crashing in that thread and not another one? Could you share the output of "bt full"?
(In reply to Sam James from comment #1) > Are you sure it's crashing in that thread and not another one? > > Could you share the output of "bt full"? (+ emerge --info please)
Created attachment 848018 [details] `emerge --info` output
I've double-checked and this is indeed the crashing thread, this is the output of `bt full`: #0 ___pthread_mutex_lock (mutex=0x7473656d6974202c) at pthread_mutex_lock.c:80 type = <optimized out> __PRETTY_FUNCTION__ = "___pthread_mutex_lock" id = <optimized out> #1 0x00007ffff7ed379c in sqlite3_finalize (pStmt=0x7fffa400e9b0) at sqlite3.c:87444 v = 0x7fffa400e9b0 db = 0x7fffa400f310 rc = <optimized out> #2 0x00005555555687f2 in ras_mc_event_closedb (cpu=17, ras=<optimized out>) at ras-record.c:923 rc = <optimized out> db = 0x7fffa4001c40 priv = 0x7fffa4001bf0 __func__ = "ras_mc_event_closedb" #3 0x0000555555564698 in handle_ras_events_cpu (priv=0x5555555c4cf0) at ras-events.c:608 fd = 38 kbuf = 0x7fffa4001b80 page = 0x7fffa4000b70 pipe_raw = "per_cpu/cpu17/trace_pipe_raw", '\000' <repeats 4067 times> pdata = <optimized out> #4 0x00007ffff7ce337a in start_thread (arg=<optimized out>) at pthread_create.c:442 ret = <optimized out> pd = <optimized out> out = <optimized out> unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140737488347696, -2861193018973524210, 140736204891840, 2, 140737350873264, 140736196501504, 2861024794797764366, 2861175029158997774}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}} not_first_call = <optimized out> #5 0x00007ffff7d6422c in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81 No locals.
Also this: (gdb) p $_siginfo._sifields._sigfault.si_addr $7 = (void *) 0x0 Looks like a NULL pointer access.
An extra bit of information, I was wrong about the crash not presenting itself when compiling the package with -O0. It still happens, just takes a while longer this issue might be timing-dependent and it doesn't look like it's specific to Gentoo. I've tried with a plain build of the upstream sources and I can still repro. I'll bring this into the bug tracker for the upstream package.
Thanks, please throw a link here when you do. My guess is https://github.com/mchehab/rasdaemon/issues/77.
The bug has been closed via the following commit(s): https://gitweb.gentoo.org/repo/gentoo.git/commit/?id=c5bc82ad10a33da634522bae36d22966485ffbb3 commit c5bc82ad10a33da634522bae36d22966485ffbb3 Author: Sam James <sam@gentoo.org> AuthorDate: 2023-02-19 18:37:38 +0000 Commit: Sam James <sam@gentoo.org> CommitDate: 2023-02-19 18:37:47 +0000 app-admin/rasdaemon: add 0.8.0 Closes: https://bugs.gentoo.org/890286 Signed-off-by: Sam James <sam@gentoo.org> app-admin/rasdaemon/Manifest | 1 + .../files/rasdaemon-0.8.0-bashisms-configure.patch | 40 +++++++++++ app-admin/rasdaemon/rasdaemon-0.8.0.ebuild | 83 ++++++++++++++++++++++ 3 files changed, 124 insertions(+)
Sorry, I mixed up libtracefs/libtraceevent. The new version uses an unbundled, much newer copy of libtraceevent. As for your bug, see https://github.com/mchehab/rasdaemon/issues/77#issuecomment-1399202752.
I'm getting the same bug (identical stacktrace) on app-admin/rasdaemon-0.8.0 as well, with an underlying configuration listed as problematic in the linked bug (AMD CPU with _SC_NPROCESSORS_CONF != _SC_NPROCESSORS_ONLN). I'm attaching a log from running rasdaemon under Valgrind, it seems to come down to a use-after-free bug. I'll comment on the Github issue as well.
Created attachment 858929 [details] Output of a rasdaemon-0.8.0 crash under Valgrind
Might also be related to https://github.com/mchehab/rasdaemon/pull/93.