MemoryDenyWriteExecute is working as intended, but is impossible to successfully use because everything in stage3-armv5tel-systemd-mergedusr appears to have W+X mappings: armv5 ~ # grep libc.so.6 /proc/1/maps b692b000-b6a9c000 r-xp 00000000 00:0d 13950 /usr/lib/libc.so.6 b6a9c000-b6a9e000 r-xp 00170000 00:0d 13950 /usr/lib/libc.so.6 b6a9e000-b6a9f000 rwxp 00172000 00:0d 13950 /usr/lib/libc.so.6 ...so prctl(PR_SET_MDWE), correctly, prevents all subsequent attempts to execve(): prctl(PR_SET_MDWE, PR_MDWE_REFUSE_EXEC_GAIN, 0, 0, 0) = 0 seccomp(SECCOMP_SET_MODE_FILTER, 0, {len=11, filter=0x1ca7e28}) = 0 seccomp(SECCOMP_SET_MODE_FILTER, 0, {len=42, filter=0x1ca9508}) = 0 seccomp(SECCOMP_SET_MODE_FILTER, 0, {len=34, filter=0x1ca8a70}) = 0 seccomp(SECCOMP_SET_MODE_FILTER, 0, {len=12, filter=0x1ca0990}) = 0 personality(0xffffffff) = 0xc00000 (PER_LINUX|READ_IMPLIES_EXEC|ADDR_LIMIT_32BIT) seccomp(SECCOMP_SET_MODE_FILTER, 0, {len=8, filter=0x1ca82b8}) = 0 brk(0x1cde000) = 0x1cde000 brk(0x1cff000) = 0x1cff000 seccomp(SECCOMP_SET_MODE_FILTER, 0, {len=517, filter=0x1cfa348}) = 0 execve("/usr/lib/systemd/systemd-journald", ["/usr/lib/systemd/systemd-journal"...], 0x1c9c790 /* 14 vars */) = -1 EACCES (Permission denied) --- SIGSEGV {si_signo=SIGSEGV, si_code=SI_KERNEL, si_addr=NULL} --- +++ killed by SIGSEGV +++ Applying debug patch [0] to the kernel yields this backtrace from the failed execve(): [ 6.212802][ T68] ------------[ cut here ]------------ [ 6.213002][ T68] WARNING: CPU: 0 PID: 68 at include/linux/mman.h:200 mmap_region+0x60c/0x794 [ 6.213278][ T68] CPU: 0 PID: 68 Comm: systemd-journal Not tainted 6.8.0-rc7-arm-00239-g08d1972cbe6f #4 [ 6.213475][ T68] Hardware name: ARM-Versatile (Device Tree Support) [ 6.214022][ T68] [<c010ae54>] (unwind_backtrace) from [<c0108ac4>] (show_stack+0x18/0x1c) [ 6.214245][ T68] [<c0108ac4>] (show_stack) from [<c0732398>] (dump_stack_lvl+0x38/0x5c) [ 6.214377][ T68] [<c0732398>] (dump_stack_lvl) from [<c01198cc>] (__warn+0x7c/0xf4) [ 6.214507][ T68] [<c01198cc>] (__warn) from [<c0727a10>] (warn_slowpath_fmt+0x70/0x90) [ 6.214663][ T68] [<c0727a10>] (warn_slowpath_fmt) from [<c026c474>] (mmap_region+0x60c/0x794) [ 6.214790][ T68] [<c026c474>] (mmap_region) from [<c026c97c>] (do_mmap+0x380/0x3cc) [ 6.214954][ T68] [<c026c97c>] (do_mmap) from [<c024eac8>] (vm_mmap_pgoff+0xb8/0xf4) [ 6.215113][ T68] [<c024eac8>] (vm_mmap_pgoff) from [<c02e4fa4>] (elf_load+0x18c/0x1f0) [ 6.215280][ T68] [<c02e4fa4>] (elf_load) from [<c02e5580>] (load_elf_binary+0x578/0xf1c) [ 6.215448][ T68] [<c02e5580>] (load_elf_binary) from [<c02a0ea0>] (bprm_execve+0x1a8/0x364) [ 6.215600][ T68] [<c02a0ea0>] (bprm_execve) from [<c02a1640>] (do_execveat_common+0x18c/0x1b0) [ 6.215724][ T68] [<c02a1640>] (do_execveat_common) from [<c02a2180>] (sys_execve+0x34/0x3c) [ 6.215848][ T68] [<c02a2180>] (sys_execve) from [<c01001d0>] (__sys_trace_return+0x0/0x10) [ 6.216003][ T68] Exception stack(0xd0b15fa8 to 0xd0b15ff0) [ 6.216163][ T68] 5fa0: bea44a60 bea449a8 00bba528 00bc5798 00bba770 00bba770 [ 6.216295][ T68] 5fc0: bea44a60 bea449a8 00bc5798 0000000b 00bba770 bea44798 bea44788 bea4472c [ 6.216411][ T68] 5fe0: b6da0cfc bea4448c b6b4f938 b69a1dcc [ 6.216545][ T68] ---[ end trace 0000000000000000 ]--- Applying debug patch [1] to the kernel makes everything work, proving PR_SET_MDWE is the problem. See [2] and [3] and [4] for discussion about a similar MDWE problem on parisc. I guess this is because pre-ARMv6 lack NX support? But: armv5 ~ # grep heap /proc/1/maps 01ec7000-01fab000 rwxp 00000000 00:00 0 [heap] armv5 ~ # grep stack /proc/1/maps bee62000-bee83000 rw-p 00000000 00:00 0 [stack] ...so clearly, at least the stack can be mapped RW without being executable? I don't get it. [0] https://gist.github.com/jcalvinowens/d5b46c707284d8ec2bac6e9ac7a07140 [1] https://gist.github.com/jcalvinowens/50ca950d26d1f9a453fe39b98bb1941a [2] https://bugs.gentoo.org/916469 [3] https://github.com/systemd/systemd/issues/29775 [4] https://lore.kernel.org/linux-parisc/87lebjz9z6.fsf@gentoo.org/T/#u Reproducible: Always Steps to Reproduce: 1. Create VM image with stage3-armv5tel-systemd-mergedusr-20240309T100447Z.tar.xz 2. Boot image using qemu-system-arm -M versatilepb 3. Observe failures Actual Results: [ 6.533025][ T1] systemd[1]: Starting Journal Service... [ 7.832046][ T1] systemd[1]: systemd-journald.service: Main process exited, code=killed, status=11/SEGV [ 7.834339][ T1] systemd[1]: systemd-journald.service: Failed with result 'signal'. [ 7.836415][ T1] systemd[1]: Failed to start Journal Service. [FAILED] Failed to start Journal Service. See 'systemctl status systemd-journald.service' for details. [ 7.895879][ T1] systemd[1]: systemd-journald.service: Scheduled restart job, restart counter is at 2. [ 8.012296][ T1] systemd[1]: Starting Journal Service... [ 9.024597][ T1] systemd[1]: systemd-journald.service: Main process exited, code=killed, status=11/SEGV [ 9.026651][ T1] systemd[1]: systemd-journald.service: Failed with result 'signal'. [ 9.044774][ T1] systemd[1]: Failed to start Journal Service. [FAILED] Failed to start Journal Service. See 'systemctl status systemd-journald.service' for details. [ 9.075509][ T1] systemd[1]: systemd-journald.service: Scheduled restart job, restart counter is at 3. [ 9.207493][ T1] systemd[1]: Starting Journal Service... Expected Results: System should work as normal Workaround is simple, just remove MemoryDenyWriteExecute= from system units
There's a fix on its way for older ARM: https://lore.kernel.org/linux-parisc/20240227013546.15769-4-zev@bewilderbeest.net/T/#t.
Thanks Sam, I'll just mark this as fixed when that hits upstream. Answering my own question, as to how READ_IMPLIES_EXEC can be true, yet I can find a supposedly non-executable stack: armv5 ~ # grep stack /proc/1/maps bee62000-bee83000 rw-p 00000000 00:00 0 [stack] ...it's here, it effectively skips the check in do_mmap everything else hits: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/exec.c#n750
I'm a little surprised there wasn't more pushback on that kernel patch. Unlike with parisc, systemd is arguably at fault: it could check for READ_IMPLIES_EXEC in the return from sys_personality() and not issue the PR_SET_MDWE prctl() in that case. Maybe it should anyway. Something like https://gist.github.com/jcalvinowens/cdbddd7749c390b723146e7b4c9c9f2f
The fix is upstream now: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d5aad4c2ca05 https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=166ce846dc59