This is a very in depth conclusion of me with a small reproducer program that segfaults with glibc-2.34, while with glibc-2.33 it gives a "Permission denied" error. My reproducer (hopefully) is doing the same steps that lead to a crash in firefox-94.0.2 (and probably many other versions too). Let me first start with an strace of firefox when I get a segmentation fault in libc, these are the last lines of the strace: getuid() = 1000 newfstatat(AT_FDCWD, "/etc/nsswitch.conf", {st_mode=S_IFREG|012545400720, st_size=0, ...}, 0) = 262 --- SIGSYS {si_signo=SIGSYS, si_code=SYS_SECCOMP, si_errno=EINVAL, si_call_addr=0x7f5c85cbaaaa, si_syscall=__NR_newfstatat, si_arch=AUDIT_ARCH_X86_64} --- socketpair(AF_UNIX, SOCK_SEQPACKET, 0, [33, 35]) = 0 sendmsg(28, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="\2\0\0\0\0\0\0\0\220\0\0\0\0\0\0\0", iov_len=16}, {iov_base="/etc/nsswitch.conf\0", iov_len=19}, {iov_base=NULL, iov_len=0}], msg_iovlen=3, msg_control=[{cmsg_len=20, cmsg_level=SOL_SOCKET, cmsg_type=SCM_RIGHTS, cmsg_data=[35]}], msg_controllen=24, msg_flags=0}, MSG_NOSIGNAL) = 35 close(35) = 0 recvmsg(33, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="\0\0\0\0", iov_len=4}, {iov_base="\1\374\0\0\0\0\0\0\255p@\0\0\0\0\0\1\0\0\0\0\0\0\0\244\201\0\0\0\0\0\0"..., iov_len=144}], msg_iovlen=2, msg_controllen=0, msg_flags=MSG_CMSG_CLOEXEC}, MSG_CMSG_CLOEXEC) = 148 close(33) = 0 rt_sigreturn({mask=[]}) = 0 newfstatat(AT_FDCWD, "/", {st_mode=S_IFBLK|014535000562, st_rdev=makedev(0, 0), ...}, 0) = 262 --- SIGSYS {si_signo=SIGSYS, si_code=SYS_SECCOMP, si_errno=EINVAL, si_call_addr=0x7f5c85cbaaaa, si_syscall=__NR_newfstatat, si_arch=AUDIT_ARCH_X86_64} --- socketpair(AF_UNIX, SOCK_SEQPACKET, 0, [33, 35]) = 0 sendmsg(28, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="\2\0\0\0\0\0\0\0\220\0\0\0\0\0\0\0", iov_len=16}, {iov_base="/\0", iov_len=2}, {iov_base=NULL, iov_len=0}], msg_iovlen=3, msg_control=[{cmsg_len=20, cmsg_level=SOL_SOCKET, cmsg_type=SCM_RIGHTS, cmsg_data=[35]}], msg_controllen=24, msg_flags=0}, MSG_NOSIGNAL) = 18 close(35) = 0 recvmsg(33, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="\363\377\377\377", iov_len=4}, {iov_base="", iov_len=144}], msg_iovlen=2, msg_controllen=0, msg_flags=MSG_CMSG_CLOEXEC}, MSG_CMSG_CLOEXEC) = 4 close(33) = 0 writev(2, [{iov_base="Sandbox: ", iov_len=9}, {iov_base="Failed errno -13 op stat flags 0"..., iov_len=40}, {iov_base="\n", iov_len=1}], 3) = 50 rt_sigreturn({mask=[]}) = -1 EACCES (Permission denied) --- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_MAPERR, si_addr=NULL} --- rt_sigaction(SIGSEGV, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=SA_RESTORER, sa_restorer=0x7f5c85bffb40}, NULL, 8) = 0 rt_sigreturn({mask=[]}) = 0 --- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_MAPERR, si_addr=NULL} --- +++ killed by SIGSEGV +++ Studying the sandbox of firefox a little bit, it seems that all filesystem calls are blocked by seccomp, such that SIGSYS is raised, which in turn is caught in a signal handler where firefox decides whether a file is allowed to be accessed or not and sets a new rt_sigreturn value. As you can see there is this syscall: newfstatat(AT_FDCWD, "/", {st_mode=S_IFBLK|014535000562, st_rdev=makedev(0, 0), ...}, 0) = 262 It is directly followed by SIGSYS, and then some messages are being sent around between processes to decide whether access should be granted or not, and the decision is -13==EACCESS (don't ask me why it shows up as -1 in rt_sigreturn) Now my reproducer is skipping all the socketpair/sendmsg/recvmsg part, and just sets rt_sigreturn to the values as observed in firefox, i.e. I and up with this strace: getuid() = 1000 newfstatat(AT_FDCWD, "/etc/nsswitch.conf", {st_mode=01200367467, st_size=1000, ...}, 0) = 262 --- SIGSYS {si_signo=SIGSYS, si_code=SYS_SECCOMP, si_call_addr=0x7f040a040aaa, si_syscall=__NR_newfstatat, si_arch=AUDIT_ARCH_X86_64} --- rt_sigreturn({mask=[]}) = 0 newfstatat(AT_FDCWD, "/", {st_mode=000, st_size=0, ...}, 0) = 262 --- SIGSYS {si_signo=SIGSYS, si_code=SYS_SECCOMP, si_call_addr=0x7f040a040aaa, si_syscall=__NR_newfstatat, si_arch=AUDIT_ARCH_X86_64} --- rt_sigreturn({mask=[]}) = -1 EACCES (Permission denied) --- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_MAPERR, si_addr=NULL} --- +++ killed by SIGSEGV +++ Segmentation fault This strace is the similar to the one above besides the messaging, but it seems also irrelevant to the crash. Running the reproducer with glibc-2.33 I cannot observe a segmentation fault (but I can observe a Permission denied). In my humble oppinion there are two problems: 1) glibc should not segfault (see my patch in the referenced bug that would at least avoid the crash) 2) The operation that is denied by the firefox sandbox is an fstatat on `/`, which is difficult to understand from the manpage, but in my oppinion it should always be an allowed call (since fstatat needs no permissions on the file itself, only execute permissions that lead to the path itself, which is kind of difficult to understand what it actually means for the root directory) What I try to say: Even when glibc does not segfault, I think that firefox is behaving incorrectly with its sandbox rules. Also I want to note that I do not know if setting the registers in the sigsys_handler is allowed and leads to defined behaviour, but this is a copy of what firefox does. So if it is undefined behaviour it should be reported to firefox that fiddling with the return values of syscalls is not allowed. Reproducible: Always Steps to Reproduce: 1.Compile reproducer test.c: gcc test.c -lseccomp 2.Execute: ./a.out (strace ./a.out) 3.Watch it crash with glibc-2.34
Created attachment 757256 [details] small reproducer
Your reproducer doesn't crash on <glibc-2.34 but crashes as expected on systems using glibc-2.34 (I also verified on Fedora for example). I am assigning this bug to toolchain for now. They should tell us if this is a problem in glibc or working as designed. Because you mentioned you experienced this crash when using Firefox: Could you please describe how to reproduce with firefox? Haven't seen this crash in firefox yet and I am also failing to find crash reports at https://crash-stats.mozilla.org/ for Fedora.
I have described to ways to consistently crash it here: https://bugs.gentoo.org/803950#c17 I added a backtrace in this comment: https://bugs.gentoo.org/803950#c14 The backtrace involves the functions `g_app_info_get_default_for_type` and `FAMOpen`. I do not have any insight why the two crashing actions (dragging an image / opening the context menu of ublock origin) enters at some point the function `g_app_info_get_default_for_type` which ultimately leads to calling `getpwuid(getuid())` (this happens /usr/lib64/libfam.so.0 as a result of a call to `FAMOpen`). Maybe I have a non-standard setup where libfam is involved but not on Fedore, no idea. But I'm not the only one who sees this exact crash. Also there's a discussion in the forum about this crash: https://forums.gentoo.org/viewtopic-p-8650991.html?sid=30f593c8b7d401d9b8a22662ebed0cb3
Forgot to mention that I also send it to the libc-help mailinglist: https://sourceware.org/pipermail/libc-help/2021-December/006061.html
[Note that there was some discussion first in the other bug and an experimental patch was posted there too: https://bugs.gentoo.org/803950#c15].
Compiling dev-libs/glib with USE=-fam solves the issue too. Maybe on Fedore glib is not compiled with FAM support, that is why there is no bugreport showing up on Fedoras bugtracker. So far it seems that you need to compile dev-libs/glib +fam, use glibc-2.34, do NOT use the patch that I posted in the other bug report, and then you probably experience the crash (start dragging an image, open contet menu of ublock origin addon).
I experienced the described UI behaviour wrt uBlock Origin (and similar for Multi-Account Containers) on updating to glibc-2.34 and firefox-bin-96. I can confirm rebuilding dev-libs/glib USE="-fam" prevents it. Andreas, is this the best workaround for the time being, or would it be better instead to apply the glibc patch from the other bug?
I also use "dev-libs/glib -fam" as workaround. I'm not going to enable FAM support, even when the bug is fixed in glibc. After finding out about it and reading about FAM I got the impression that it is not really useful. Also there is the glibc patch in the other bugreport. Sam reported the bug at the libc bugzilla (see URL above) with a different (and more appropriate) patch.
*** Bug 832504 has been marked as a duplicate of this bug. ***
The bug has been referenced in the following commit(s): https://gitweb.gentoo.org/proj/toolchain/glibc-patches.git/commit/?id=6415ce699bf1dafc403be7464df662b2879687e8 commit 6415ce699bf1dafc403be7464df662b2879687e8 Author: Andreas K. Hüttel <dilfridge@gentoo.org> AuthorDate: 2022-02-12 18:44:46 +0000 Commit: Andreas K. Hüttel <dilfridge@gentoo.org> CommitDate: 2022-02-12 18:44:46 +0000 Add patch for bug 828070 Bug: https://sourceware.org/bugzilla/show_bug.cgi?id=28752 Bug: https://bugs.gentoo.org/828070 Signed-off-by: Andreas K. Hüttel <dilfridge@gentoo.org> ...ault-in-getpwuid-when-stat-fails-BZ-28752.patch | 39 ++++++++++++++++++++++ 1 file changed, 39 insertions(+)
Any plans for new sys-libs/glibc release with updated glibc patchset containing the above fix?
(In reply to Maciej S. Szmigiero from comment #11) > Any plans for new sys-libs/glibc release with updated glibc patchset > containing the above fix? We tend to batch them up with stable backports once they're posted.
> We tend to batch them up with stable backports once they're posted. So we are waiting for upstream to apply the patch, then backport it to glibc 2.34, correct?
(In reply to Maciej S. Szmigiero from comment #13) > > We tend to batch them up with stable backports once they're posted. > > So we are waiting for upstream to apply the patch, then backport it to glibc > 2.34, correct? No, we don't need to wait for this one in particular. We just usually don't do a new patchset release until a few fixes are queued up. I'll check what the situation is when I'm not on mobile.
(In reply to Sam James from comment #14) > (In reply to Maciej S. Szmigiero from comment #13) > > > We tend to batch them up with stable backports once they're posted. > > > > So we are waiting for upstream to apply the patch, then backport it to glibc > > 2.34, correct? > > No, we don't need to wait for this one in particular. We just usually don't > do a new patchset release until a few fixes are queued up. I'll check what > the situation is when I'm not on mobile. https://gitweb.gentoo.org/proj/toolchain/glibc-patches.git/commit/?id=6415ce699bf1dafc403be7464df662b2879687e8 This is in the (currently unkeyworded) =sys-libs/glibc-2.34-r9. Please test and give feedback. Thanks!
(In reply to Sam James from comment #15) > This is in the (currently unkeyworded) =sys-libs/glibc-2.34-r9. Please test > and give feedback. Thanks for quickly providing an updated glibc Sam. With this version the Firefox web page process indeed no longer crashes. Instead, it hangs, with a spinner rotating endlessly. To be sure that it isn't something specific to my setup it would be great if somebody else had confirmed this finding. Andreas? My test picture: https://en.wikichip.org/wiki/File:intel_nehalem_lynfield_die_shot.jpg My test system package versions: * www-client/firefox Latest version available: 97.0.1 Latest version installed: 97.0.1 * sys-devel/clang Latest version available: 13.0.1 Latest version installed: 13.0.1 * sys-devel/gcc Latest version available: 11.2.1_p20220115 Latest version installed: 11.2.1_p20211127
(In reply to Maciej S. Szmigiero from comment #16) > > To be sure that it isn't something specific to my setup it would be great if > somebody else had confirmed this finding. > Andreas? > I am seeing this behavior too, as before with #803950 starting MOZ_DISABLE_CONTENT_SANDBOX=1 did get it going. * www-client/firefox Latest version available: 97.0.1 Latest version installed: 97.0.1 * sys-devel/clang Latest version available: 13.0.1 Latest version installed: 13.0.1 * sys-devel/gcc Latest version available: 11.2.1_p20220115 Latest version installed: 11.2.1_p20220115 (11.2.1 is active)
Same behaviour here (no tabs will load). firefox-bin-97.0.1 glibc-2.34-r9 Resolved by rebuilding dev-libs/glib with USE="-fam"
I think we need someone hitting this to report it to Firefox upstream (Mozilla) then. I'm pretty confident the glibc patch is right (although it could still be an issue on that side) though.
(In reply to Sam James from comment #19) > I think we need someone hitting this to report it to Firefox upstream > (Mozilla) then. I'm pretty confident the glibc patch is right (although it > could still be an issue on that side) though. ... although that said, I can't explain why -fam still fixes it. So maybe I'm missing something.
(In reply to Sam James from comment #20) > (In reply to Sam James from comment #19) > > I think we need someone hitting this to report it to Firefox upstream > > (Mozilla) then. I'm pretty confident the glibc patch is right (although it > > could still be an issue on that side) though. > > ... although that said, I can't explain why -fam still fixes it. So maybe > I'm missing something. waaait. We're taking a lock in nss_database_check_reload_and_get() and not yielding it when we return. Can someone try the attached patch please?
Created attachment 766418 [details, diff] nss-database-yield-lock.patch (Note this patch is *on top* of the previous changes, so just add it in /etc/portage/patches and emerge -v1 glibc).
(In reply to Sam James from comment #21) > (In reply to Sam James from comment #20) > > (In reply to Sam James from comment #19) > > > I think we need someone hitting this to report it to Firefox upstream > > > (Mozilla) then. I'm pretty confident the glibc patch is right (although it > > > could still be an issue on that side) though. > > > > ... although that said, I can't explain why -fam still fixes it. So maybe > > I'm missing something. > > waaait. We're taking a lock in nss_database_check_reload_and_get() and not > yielding it when we return. > > Can someone try the attached patch please? I can confirm that the attached patch fixes the issue for me indeed, thanks Sam! (but the patch needs a small whitespace change to apply to sys-libs/glibc-2.34-r9).
Sorry I haven't found the time to reproduce this yet, but I'm glad to see the progress that's happening in here.
The bug has been referenced in the following commit(s): https://gitweb.gentoo.org/proj/toolchain/glibc-patches.git/commit/?id=6002612f230d2b8d88fefba6c6477a20e77efc23 commit 6002612f230d2b8d88fefba6c6477a20e77efc23 Author: Andreas K. Hüttel <dilfridge@gentoo.org> AuthorDate: 2022-03-07 01:03:46 +0000 Commit: Andreas K. Hüttel <dilfridge@gentoo.org> CommitDate: 2022-03-07 01:03:46 +0000 Additional fixup for the glibc/firefox/seccomp interaction Bug: https://bugs.gentoo.org/828070 Signed-off-by: Andreas K. Hüttel <dilfridge@gentoo.org> ...0302-Drop-glibc-lock-when-returning-early.patch | 36 ++++++++++++++++++++++ 1 file changed, 36 insertions(+)
(In reply to Sam James from comment #22) > Created attachment 766418 [details, diff] [details, diff] > nss-database-yield-lock.patch > > (Note this patch is *on top* of the previous changes, so just add it in > /etc/portage/patches and emerge -v1 glibc). Patch is in 2.34-r10
Don't forget to update the proposed patch upstream at ${URL}.
(In reply to Maciej S. Szmigiero from comment #27) > Don't forget to update the proposed patch upstream at ${URL}. Done & submitted: https://sourceware.org/bugzilla/show_bug.cgi?id=28752#c2.
Thanks to all who helped out in this bug, we've got there! Fix landed upstream: - https://sourceware.org/git/?p=glibc.git;a=commit;h=3fdf0a205b622e40fa7e3c4ed1e4ed4d5c6c5380 - https://sourceware.org/git/?p=glibc.git;a=commit;h=ace9e3edbca62d978b1e8f392d8a5d78500272d9 It'll be in the next patchset for glibc-2.35 in Gentoo (possibly 2.34, but we're looking to move to 2.35 completely soon).
Fixed upstream and in stable now via 2.35.