Updated to glbc-2.36-r3 and noticed that nscd immediately kept crashing (was rock solid forever on 2.35). Restarting, limiting number of threads etc. does not help. Running nscd -d in gdb and resolving a host entry gives: gdb) file nscd Reading symbols from nscd... (No debugging symbols found in nscd) (gdb) set args -d (gdb) r Starting program: /usr/sbin/nscd -d [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib64/libthread_db.so.1". Thu Sep 22 18:39:47 2022 - 19228: monitoring file /etc/nsswitch.conf for database passwd Thu Sep 22 18:39:47 2022 - 19228: monitoring file `/etc/nsswitch.conf` (1) Thu Sep 22 18:39:47 2022 - 19228: monitoring directory `/etc` (2) Thu Sep 22 18:39:47 2022 - 19228: monitoring file /etc/nsswitch.conf for database group Thu Sep 22 18:39:47 2022 - 19228: monitoring file `/etc/nsswitch.conf` (1) Thu Sep 22 18:39:47 2022 - 19228: monitoring directory `/etc` (2) Thu Sep 22 18:39:47 2022 - 19228: monitoring file /etc/nsswitch.conf for database hosts Thu Sep 22 18:39:47 2022 - 19228: monitoring file `/etc/nsswitch.conf` (1) Thu Sep 22 18:39:47 2022 - 19228: monitoring directory `/etc` (2) Thu Sep 22 18:39:47 2022 - 19228: monitoring file /etc/nsswitch.conf for database services Thu Sep 22 18:39:47 2022 - 19228: monitoring file `/etc/nsswitch.conf` (1) Thu Sep 22 18:39:47 2022 - 19228: monitoring directory `/etc` (2) Thu Sep 22 18:39:47 2022 - 19228: monitoring file /etc/nsswitch.conf for database netgroup Thu Sep 22 18:39:47 2022 - 19228: monitoring file `/etc/nsswitch.conf` (1) Thu Sep 22 18:39:47 2022 - 19228: monitoring directory `/etc` (2) [New Thread 0x7fffed7ff6c0 (LWP 19231)] [New Thread 0x7fffed5fe6c0 (LWP 19232)] [New Thread 0x7fffed3fd6c0 (LWP 19233)] [New Thread 0x7fffed1fc6c0 (LWP 19234)] [New Thread 0x7fffecffb6c0 (LWP 19235)] [New Thread 0x7fffecdfa6c0 (LWP 19236)] [New Thread 0x7fffecbf96c0 (LWP 19237)] [New Thread 0x7fffec9f86c0 (LWP 19238)] [New Thread 0x7fffec7f76c0 (LWP 19239)] [New Thread 0x7fffec5f66c0 (LWP 19240)] Thu Sep 22 18:39:51 2022 - 19228: handle_request: request received (Version = 2) from PID 19241 Thu Sep 22 18:39:51 2022 - 19228: GETFDHST Thu Sep 22 18:39:51 2022 - 19228: handle_request: request received (Version = 2) from PID 19241 Thu Sep 22 18:39:51 2022 - 19228: GETAI (www.apple.com) Thu Sep 22 18:39:51 2022 - 19228: Haven't found "www.apple.com" in hosts cache! Thread 8 "nscd" received signal SIGSEGV, Segmentation fault. [Switching to Thread 0x7fffecbf96c0 (LWP 19237)] 0x00007ffff7f20999 in __strlen_avx2 () from /lib64/libc.so.6 (gdb) bt #0 0x00007ffff7f20999 in __strlen_avx2 () from /lib64/libc.so.6 #1 0x00005555555672b4 in addhstai () #2 0x0000555555567c0e in addhstai () #3 0x000055555555c1fa in nscd_run_worker () #4 0x00007ffff7e50933 in start_thread () from /lib64/libc.so.6 #5 0x00007ffff7ed21e0 in clone () from /lib64/libc.so.6 My CPU is a Zen2 and has AVX2, compiler is gcc-12.2.0. Reproducible: Always Steps to Reproduce: 1. update to glibc-2.36-r3 2. start nscd with hosts cache 3. resolve a host Actual Results: Immediate crash Expected Results: No crash
Doing the same on a non-AVX2 machine (old SandyBridge) crashes as well: Thu Sep 22 18:44:25 2022 - 11641: handle_request: request received (Version = 2) from PID 11650 Thu Sep 22 18:44:25 2022 - 11641: GETFDHST Thu Sep 22 18:44:25 2022 - 11641: provide access to FD 9, for hosts Thu Sep 22 18:44:25 2022 - 11641: handle_request: request received (Version = 2) from PID 11650 Thu Sep 22 18:44:25 2022 - 11641: GETAI (ntp1.sda.t-online.de) Thu Sep 22 18:44:25 2022 - 11641: Haven't found "ntp1.sda.t-online.de" in hosts cache! Thread 7 "nscd" received signal SIGSEGV, Segmentation fault. [Switching to Thread 0x7fffecdfa6c0 (LWP 11649)] 0x00007ffff7e7edf6 in __strlen_sse2 () from /lib64/libc.so.6 (gdb) bt #0 0x00007ffff7e7edf6 in __strlen_sse2 () from /lib64/libc.so.6 #1 0x0000555555566ca6 in addhstai () #2 0x000055555556796e in addhstai () #3 0x000055555555c0cc in nscd_run_worker () #4 0x00007ffff7e5e3c5 in start_thread () from /lib64/libc.so.6 #5 0x00007ffff7edecbc in clone3 () from /lib64/libc.so.6 Interestingly if I run 'perf top' I see __strchr_sse2, __strlen_sse2 and others being used in a live system and by perf itself, so I guess things seem to be working, otherwise I'd see many more crashes in other apps (everything else works fine so far).
Another observation: disabling the hosts cache only and enabling all others seems to work fine, no crashes.
Created attachment 813715 [details] emerge --info
Using the glibc-2.35 nscd binary has the same problem, so it's not a new bug in nscd.
Rebuilt glibc with debug symbols/nostrip and also extra -fno-tree-vectorize, because why not. Didn't help but we can go deeper: Thread 11 "nscd" received signal SIGSEGV, Segmentation fault. [Switching to Thread 0x7fffec5f66c0 (LWP 1228)] __strlen_avx2 () at ../sysdeps/x86_64/multiarch/strlen-avx2.S:76 76 ../sysdeps/x86_64/multiarch/strlen-avx2.S: No such file or directory. (gdb) bt #0 __strlen_avx2 () at ../sysdeps/x86_64/multiarch/strlen-avx2.S:76 #1 0x0000555555566fc4 in addhstaiX (db=db@entry=0x555555577340 <dbs+704>, fd=fd@entry=17, req=req@entry=0x7fffec5f5804, key=key@entry=0x7fffec5f5a90, uid=uid@entry=4294967295, he=he@entry=0x0, dh=<optimized out>) at aicache.c:153 #2 0x0000555555567cde in addhstai (db=db@entry=0x555555577340 <dbs+704>, fd=fd@entry=17, req=req@entry=0x7fffec5f5804, key=key@entry=0x7fffec5f5a90, uid=uid@entry=4294967295) at aicache.c:526 #3 0x000055555555c1dd in handle_request (uid=4294967295, pid=<optimized out>, key=0x7fffec5f5a90, req=0x7fffec5f5804, fd=17) at connections.c:1202 #4 nscd_run_worker (p=<optimized out>) at connections.c:1702 #5 0x00007ffff7e51255 in start_thread (arg=<optimized out>) at pthread_create.c:442 aicache.c:153 is here: https://sourceware.org/git/?p=glibc.git;a=blob;f=nscd/aicache.c;hb=HEAD#l153
Also it doesn't seem to crash with single requests done manually, only with multiple requests in rapid succession, e.g. via mtr. nthreads=1 does not help.
Also it doesn't seem to crash with single requests, only with multiple requests in rapid succession, e.g. via mtr. nthreads=1 does not help.
Looks like there is a missing check for at->name is not NULL, but it can be NULL: https://sourceware.org/git/?p=glibc.git;a=blob;f=nss/nss_files/files-hosts.c;hb=HEAD#l459 strlen(NULL) is undefined.
Could you report this upstream at https://www.gnu.org/software/libc/bugs.html?
(In reply to Stephan Hartmann from comment #8) > Looks like there is a missing check for at->name is not NULL, but it can be > NULL: > https://sourceware.org/git/?p=glibc.git;a=blob;f=nss/nss_files/files-hosts.c; > hb=HEAD#l459 > > strlen(NULL) is undefined. Good find! It's also checked later on in #324, but unfortunately doing the same check in #153 doesn't help, it either crashes or breaks name resolution in funky ways, e.g.: mtr: Packet type unsupported: Invalid argument I'll file a bug upstream tomorrow or over the weekend.
Discussion on a related bug (https://sourceware.org/bugzilla/show_bug.cgi?id=29605) only resulted in a lack of reproducibility. Does anybody else running nscd with enabled hosts cache on glibc-2.36-r3 see these crashes? If so it might be something with our patch list. Today I tried nscd with enabled hosts cache on SuSE Tumbleweed -which also has 2.36 - and everything worked.
(In reply to Holger Hoffstätte from comment #11) > Does anybody else running nscd with enabled hosts cache on glibc-2.36-r3 see > these crashes? If so it might be something with our patch list. > As per comment on other side, please try USE=vanilla.
(In reply to Sam James from comment #12) > (In reply to Holger Hoffstätte from comment #11) > > Does anybody else running nscd with enabled hosts cache on glibc-2.36-r3 see > > these crashes? If so it might be something with our patch list. > > > > As per comment on other side, please try USE=vanilla. The patches you linked to are all from upstream's backport branch -- i.e. we're supposed to use them. The only Gentoo-specific stuff is at https://gitweb.gentoo.org/proj/toolchain/glibc-patches.git/tree/9999 and the biggest lot there are from a glibc developer but not yet merged, but shouldn't affect this. So I'm without ideas for now.
I can confirm I've been experiencing these crashes (on 2 machines, both servers... and "weird" behaviour from nscd on my local workstation; otherwise inexplicable lookup failures)... I've switched to unscd in the short-term if only to confirm I don't have other issues. I did manage to get coredump and stack trace from `coredumpctl gdb` which on the face are the same as previous comments; segv somewhere in/around addhstai. I've failed to install with USE=vanilla, firstly because sandbox violations (presumably that's what the first Gentoo specific patch sorts)... and then again with sandbox disabled landed me with collisions on tzselect and zdump with timezone-data package... didn't see that one coming and no idea what's caused it. Going to have to put this down for a bit, maybe tomorrow, but I'll take another crack at it next chance I get.
(In reply to Sam James from comment #12) > (In reply to Holger Hoffstätte from comment #11) > > Does anybody else running nscd with enabled hosts cache on glibc-2.36-r3 see > > these crashes? If so it might be something with our patch list. > > > > As per comment on other side, please try USE=vanilla. So after unmasking +vanilla, removing colliding tzdata stuff from /usr/bin and several attempts to build with +vanilla but without sandboxes to please ldconfig, the verdict is in: nscd with enabled host cache is stable and works just fine, incl. a hammering with mtr. Something something notable patches.. :)
(In reply to Dan Goodliffe from comment #14) > I can confirm I've been experiencing these crashes (on 2 machines, both > servers... and "weird" behaviour from nscd on my local workstation; > otherwise inexplicable lookup failures)... I've switched to unscd in the > short-term if only to confirm I don't have other issues. > > I did manage to get coredump and stack trace from `coredumpctl gdb` which on > the face are the same as previous comments; segv somewhere in/around > addhstai. > > I've failed to install with USE=vanilla, firstly because sandbox violations > (presumably that's what the first Gentoo specific patch sorts)... and then > again with sandbox disabled landed me with collisions on tzselect and zdump > with timezone-data package... didn't see that one coming and no idea what's > caused it. > > Going to have to put this down for a bit, maybe tomorrow, but I'll take > another crack at it next chance I get. Welcome to the club! You can remove the conflicting tzbla files form /usr/bin and then it's hammertime with: FEATURES="-sandbox -usersandbox" USE=vanilla emerge -v1 --nodeps glibc ...highwaaayy to the danger zone...
(In reply to Holger Hoffstätte from comment #15) > (In reply to Sam James from comment #12) > > (In reply to Holger Hoffstätte from comment #11) > > > Does anybody else running nscd with enabled hosts cache on glibc-2.36-r3 see > > > these crashes? If so it might be something with our patch list. > > > > > > > As per comment on other side, please try USE=vanilla. > > So after unmasking +vanilla, removing colliding tzdata stuff from /usr/bin > and several attempts to build with +vanilla but without sandboxes to please > ldconfig, the verdict is in: > > nscd with enabled host cache is stable and works just fine, incl. a > hammering with mtr. > > Something something notable patches.. :) :D Now, the next question is, is it a patch from that small set I linked, or the branch you did? The former is our fault and there's no obvious candidate there IMO, the latter is upstream and I'd guess opensuse isn't on the same commit as us. I'd say bisect using 9999+vanilla and the EGIT override variables for the stable/2.36 branch from release and up?
(tbh, it might be easiest to just grab all the commits, apply to vanilla, then rm each newer .patch until you hit it, as there's not that many)
(In reply to Sam James from comment #17) > > Something something notable patches.. :) > > :D > > Now, the next question is, is it a patch from that small set I linked, or > the branch you did? The former is our fault and there's no obvious candidate > there IMO, the latter is upstream and I'd guess opensuse isn't on the same > commit as us. The 9999 patches see harmless and have nothing to do with the resolver. I was just hunting down what exactly TuMbLeWeEd builds, but I strongly suspect the reason for the problem is in the patches for the complete resolver rewrite by Florian Weimer (resolv, nss_dns commit prefixes). Maybe we're missing a followup. Anyway I'll try to remove them from the -r3 patchset and see what happens.
It reproduces with vanilla -9999 so there's that.
(In reply to Holger Hoffstätte from comment #20) > It reproduces with vanilla -9999 so there's that. I did have a cheesy script which I've lost, but it wa sa git bisect run script which used a local git checkout bu emerged glibc-9999 on the right branch and commit using the EGIT_* vars. If you can make it crash reliably, you might want to do that and just let it run overnight.
As expected the offending patch is: 0048-nss_dns-Rewrite-_nss_dns_gethostbyname4_r-using-curr.patch which just activates stuff added/modified in the previous patches. When I remove this from the patch set and rebuild, the crash is gone.
Thanks for investigating Holger, I've just done an build/install without 0048-...patch and 5 minutes in, solid as a rock... so yeah, seems to be that one.
(In reply to Dan Goodliffe from comment #23) > Thanks for investigating Holger, I've just done an build/install without > 0048-...patch and 5 minutes in, solid as a rock... so yeah, seems to be that > one. Great to know, thanks for verifying! Since that patch just enables code added/modified in previous patches the actual bug could also be somewhere else, however I'll leave that for upstream to figure out. I've added a comment on the upstream bug.
For those who want to run 2.36 but also like a working nscd, the easiest fix is: - emerge patchutils - mkdir -p /etc/portage/patches/sys-libs/glibc - unpack the glibc-2.36-rX patches from your $DISTDIR - run: 'interdiff 0048-nss_dns-Rewrite-_nss_dns_gethostbyname4_r-using-curr.patch /dev/null > /etc/portage/patches/sys-libs/glibc/revert-rewrite-resolver.patch' - emerge -v1 glibc This will create a reverse patch and apply it after everything else.
Rejoice! The bug is fixed in glibc, see the linked bug at Sourceware. I've been running with the patch on several machines and it works reliably again.
Fixed in 2.36-r5 aka commit 961b6054cf5f