Gentoo Websites Logo
Go to: Gentoo Home Documentation Forums Lists Bugs Planet Store Wiki Get Gentoo!
Bug 872401 - sys-libs/glibc-2.36-r3: nscd crashes in __strlen_avx2 when hosts cache is enabled
Summary: sys-libs/glibc-2.36-r3: nscd crashes in __strlen_avx2 when hosts cache is ena...
Status: RESOLVED FIXED
Alias: None
Product: Gentoo Linux
Classification: Unclassified
Component: Current packages (show other bugs)
Hardware: All Linux
: Normal critical (vote)
Assignee: Gentoo Toolchain Maintainers
URL:
Whiteboard:
Keywords:
Depends on:
Blocks: glibc-2.36
  Show dependency tree
 
Reported: 2022-09-22 16:42 UTC by Holger Hoffstätte
Modified: 2022-10-16 08:55 UTC (History)
1 user (show)

See Also:
Package list:
Runtime testing required: ---


Attachments
emerge --info (emerge-info.log,6.76 KB, text/x-log)
2022-09-22 17:04 UTC, Holger Hoffstätte
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Holger Hoffstätte 2022-09-22 16:42:55 UTC
Updated to glbc-2.36-r3 and noticed that nscd immediately kept crashing (was rock solid forever on 2.35). Restarting, limiting number of threads etc. does not help. Running nscd -d in gdb and resolving a host entry gives:

gdb) file nscd
Reading symbols from nscd...
(No debugging symbols found in nscd)
(gdb) set args -d
(gdb) r
Starting program: /usr/sbin/nscd -d
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Thu Sep 22 18:39:47 2022 - 19228: monitoring file /etc/nsswitch.conf for database passwd
Thu Sep 22 18:39:47 2022 - 19228: monitoring file `/etc/nsswitch.conf` (1)
Thu Sep 22 18:39:47 2022 - 19228: monitoring directory `/etc` (2)
Thu Sep 22 18:39:47 2022 - 19228: monitoring file /etc/nsswitch.conf for database group
Thu Sep 22 18:39:47 2022 - 19228: monitoring file `/etc/nsswitch.conf` (1)
Thu Sep 22 18:39:47 2022 - 19228: monitoring directory `/etc` (2)
Thu Sep 22 18:39:47 2022 - 19228: monitoring file /etc/nsswitch.conf for database hosts
Thu Sep 22 18:39:47 2022 - 19228: monitoring file `/etc/nsswitch.conf` (1)
Thu Sep 22 18:39:47 2022 - 19228: monitoring directory `/etc` (2)
Thu Sep 22 18:39:47 2022 - 19228: monitoring file /etc/nsswitch.conf for database services
Thu Sep 22 18:39:47 2022 - 19228: monitoring file `/etc/nsswitch.conf` (1)
Thu Sep 22 18:39:47 2022 - 19228: monitoring directory `/etc` (2)
Thu Sep 22 18:39:47 2022 - 19228: monitoring file /etc/nsswitch.conf for database netgroup
Thu Sep 22 18:39:47 2022 - 19228: monitoring file `/etc/nsswitch.conf` (1)
Thu Sep 22 18:39:47 2022 - 19228: monitoring directory `/etc` (2)
[New Thread 0x7fffed7ff6c0 (LWP 19231)]
[New Thread 0x7fffed5fe6c0 (LWP 19232)]
[New Thread 0x7fffed3fd6c0 (LWP 19233)]
[New Thread 0x7fffed1fc6c0 (LWP 19234)]
[New Thread 0x7fffecffb6c0 (LWP 19235)]
[New Thread 0x7fffecdfa6c0 (LWP 19236)]
[New Thread 0x7fffecbf96c0 (LWP 19237)]
[New Thread 0x7fffec9f86c0 (LWP 19238)]
[New Thread 0x7fffec7f76c0 (LWP 19239)]
[New Thread 0x7fffec5f66c0 (LWP 19240)]
Thu Sep 22 18:39:51 2022 - 19228: handle_request: request received (Version = 2) from PID 19241
Thu Sep 22 18:39:51 2022 - 19228: 	GETFDHST
Thu Sep 22 18:39:51 2022 - 19228: handle_request: request received (Version = 2) from PID 19241
Thu Sep 22 18:39:51 2022 - 19228: 	GETAI (www.apple.com)
Thu Sep 22 18:39:51 2022 - 19228: Haven't found "www.apple.com" in hosts cache!

Thread 8 "nscd" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffecbf96c0 (LWP 19237)]
0x00007ffff7f20999 in __strlen_avx2 () from /lib64/libc.so.6
(gdb) bt
#0  0x00007ffff7f20999 in __strlen_avx2 () from /lib64/libc.so.6
#1  0x00005555555672b4 in addhstai ()
#2  0x0000555555567c0e in addhstai ()
#3  0x000055555555c1fa in nscd_run_worker ()
#4  0x00007ffff7e50933 in start_thread () from /lib64/libc.so.6
#5  0x00007ffff7ed21e0 in clone () from /lib64/libc.so.6

My CPU is a Zen2 and has AVX2, compiler is gcc-12.2.0.


Reproducible: Always

Steps to Reproduce:
1. update to glibc-2.36-r3
2. start nscd with hosts cache
3. resolve a host

Actual Results:  
Immediate crash


Expected Results:  
No crash
Comment 1 Holger Hoffstätte 2022-09-22 16:53:03 UTC
Doing the same on a non-AVX2 machine (old SandyBridge) crashes as well:

Thu Sep 22 18:44:25 2022 - 11641: handle_request: request received (Version = 2) from PID 11650
Thu Sep 22 18:44:25 2022 - 11641: 	GETFDHST
Thu Sep 22 18:44:25 2022 - 11641: provide access to FD 9, for hosts
Thu Sep 22 18:44:25 2022 - 11641: handle_request: request received (Version = 2) from PID 11650
Thu Sep 22 18:44:25 2022 - 11641: 	GETAI (ntp1.sda.t-online.de)
Thu Sep 22 18:44:25 2022 - 11641: Haven't found "ntp1.sda.t-online.de" in hosts cache!

Thread 7 "nscd" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffecdfa6c0 (LWP 11649)]
0x00007ffff7e7edf6 in __strlen_sse2 () from /lib64/libc.so.6
(gdb) bt
#0  0x00007ffff7e7edf6 in __strlen_sse2 () from /lib64/libc.so.6
#1  0x0000555555566ca6 in addhstai ()
#2  0x000055555556796e in addhstai ()
#3  0x000055555555c0cc in nscd_run_worker ()
#4  0x00007ffff7e5e3c5 in start_thread () from /lib64/libc.so.6
#5  0x00007ffff7edecbc in clone3 () from /lib64/libc.so.6

Interestingly if I run 'perf top' I see __strchr_sse2, __strlen_sse2 and others being used in a live system and by perf itself, so I guess things seem to be working, otherwise I'd see many more crashes in other apps (everything else works fine so far).
Comment 2 Holger Hoffstätte 2022-09-22 16:57:14 UTC
Another observation: disabling the hosts cache only and enabling all others seems to work fine, no crashes.
Comment 3 Holger Hoffstätte 2022-09-22 17:04:42 UTC
Created attachment 813715 [details]
emerge --info
Comment 4 Holger Hoffstätte 2022-09-22 17:17:46 UTC
Using the glibc-2.35 nscd binary has the same problem, so it's not a new bug in nscd.
Comment 5 Holger Hoffstätte 2022-09-22 19:23:03 UTC
Rebuilt glibc with debug symbols/nostrip and also extra -fno-tree-vectorize, because why not. Didn't help but we can go deeper:

Thread 11 "nscd" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffec5f66c0 (LWP 1228)]
__strlen_avx2 () at ../sysdeps/x86_64/multiarch/strlen-avx2.S:76
76	../sysdeps/x86_64/multiarch/strlen-avx2.S: No such file or directory.
(gdb) bt
#0  __strlen_avx2 () at ../sysdeps/x86_64/multiarch/strlen-avx2.S:76
#1  0x0000555555566fc4 in addhstaiX (db=db@entry=0x555555577340 <dbs+704>, fd=fd@entry=17, req=req@entry=0x7fffec5f5804, key=key@entry=0x7fffec5f5a90, uid=uid@entry=4294967295, 
    he=he@entry=0x0, dh=<optimized out>) at aicache.c:153
#2  0x0000555555567cde in addhstai (db=db@entry=0x555555577340 <dbs+704>, fd=fd@entry=17, req=req@entry=0x7fffec5f5804, key=key@entry=0x7fffec5f5a90, uid=uid@entry=4294967295)
    at aicache.c:526
#3  0x000055555555c1dd in handle_request (uid=4294967295, pid=<optimized out>, key=0x7fffec5f5a90, req=0x7fffec5f5804, fd=17) at connections.c:1202
#4  nscd_run_worker (p=<optimized out>) at connections.c:1702
#5  0x00007ffff7e51255 in start_thread (arg=<optimized out>) at pthread_create.c:442

aicache.c:153 is here:
https://sourceware.org/git/?p=glibc.git;a=blob;f=nscd/aicache.c;hb=HEAD#l153
Comment 6 Holger Hoffstätte 2022-09-22 19:24:58 UTC
Also it doesn't seem to crash with single requests done manually, only with multiple requests in rapid succession, e.g. via mtr. nthreads=1 does not help.
Comment 7 Holger Hoffstätte 2022-09-22 19:25:18 UTC
Also it doesn't seem to crash with single requests, only with multiple requests in rapid succession, e.g. via mtr. nthreads=1 does not help.
Comment 8 Stephan Hartmann (RETIRED) gentoo-dev 2022-09-22 19:38:58 UTC
Looks like there is a missing check for at->name is not NULL, but it can be NULL:
https://sourceware.org/git/?p=glibc.git;a=blob;f=nss/nss_files/files-hosts.c;hb=HEAD#l459

strlen(NULL) is undefined.
Comment 9 Sam James archtester Gentoo Infrastructure gentoo-dev Security 2022-09-22 19:46:19 UTC
Could you report this upstream at https://www.gnu.org/software/libc/bugs.html?
Comment 10 Holger Hoffstätte 2022-09-22 20:05:32 UTC
(In reply to Stephan Hartmann from comment #8)
> Looks like there is a missing check for at->name is not NULL, but it can be
> NULL:
> https://sourceware.org/git/?p=glibc.git;a=blob;f=nss/nss_files/files-hosts.c;
> hb=HEAD#l459
> 
> strlen(NULL) is undefined.

Good find! It's also checked later on in #324, but unfortunately doing the same check in #153 doesn't help, it either crashes or breaks name resolution in funky ways, e.g.: mtr: Packet type unsupported: Invalid argument

I'll file a bug upstream tomorrow or over the weekend.
Comment 11 Holger Hoffstätte 2022-09-26 18:19:04 UTC
Discussion on a related bug (https://sourceware.org/bugzilla/show_bug.cgi?id=29605) only resulted in a lack of reproducibility.

Does anybody else running nscd with enabled hosts cache on glibc-2.36-r3 see these crashes? If so it might be something with our patch list.

Today I tried nscd with enabled hosts cache on SuSE Tumbleweed -which also has 2.36 - and everything worked.
Comment 12 Sam James archtester Gentoo Infrastructure gentoo-dev Security 2022-09-26 18:20:26 UTC
(In reply to Holger Hoffstätte from comment #11)
> Does anybody else running nscd with enabled hosts cache on glibc-2.36-r3 see
> these crashes? If so it might be something with our patch list.
> 

As per comment on other side, please try USE=vanilla.
Comment 13 Sam James archtester Gentoo Infrastructure gentoo-dev Security 2022-09-26 18:21:36 UTC
(In reply to Sam James from comment #12)
> (In reply to Holger Hoffstätte from comment #11)
> > Does anybody else running nscd with enabled hosts cache on glibc-2.36-r3 see
> > these crashes? If so it might be something with our patch list.
> > 
> 
> As per comment on other side, please try USE=vanilla.

The patches you linked to are all from upstream's backport branch -- i.e. we're supposed to use them.

The only Gentoo-specific stuff is at https://gitweb.gentoo.org/proj/toolchain/glibc-patches.git/tree/9999 and the biggest lot there are from a glibc developer but not yet merged, but shouldn't affect this. So I'm without ideas for now.
Comment 14 Dan Goodliffe 2022-09-26 19:06:14 UTC
I can confirm I've been experiencing these crashes (on 2 machines, both servers... and "weird" behaviour from nscd on my local workstation; otherwise inexplicable lookup failures)... I've switched to unscd in the short-term if only to confirm I don't have other issues.

I did manage to get coredump and stack trace from `coredumpctl gdb` which on the face are the same as previous comments; segv somewhere in/around addhstai.

I've failed to install with USE=vanilla, firstly because sandbox violations (presumably that's what the first Gentoo specific patch sorts)... and then again with sandbox disabled landed me with collisions on tzselect and zdump with timezone-data package... didn't see that one coming and no idea what's caused it.

Going to have to put this down for a bit, maybe tomorrow, but I'll take another crack at it next chance I get.
Comment 15 Holger Hoffstätte 2022-09-26 19:06:56 UTC
(In reply to Sam James from comment #12)
> (In reply to Holger Hoffstätte from comment #11)
> > Does anybody else running nscd with enabled hosts cache on glibc-2.36-r3 see
> > these crashes? If so it might be something with our patch list.
> > 
> 
> As per comment on other side, please try USE=vanilla.

So after unmasking +vanilla, removing colliding tzdata stuff from /usr/bin and several attempts to build with +vanilla but without sandboxes to please ldconfig, the verdict is in:

nscd with enabled host cache is stable and works just fine, incl. a hammering with mtr.

Something something notable patches.. :)
Comment 16 Holger Hoffstätte 2022-09-26 19:08:58 UTC
(In reply to Dan Goodliffe from comment #14)
> I can confirm I've been experiencing these crashes (on 2 machines, both
> servers... and "weird" behaviour from nscd on my local workstation;
> otherwise inexplicable lookup failures)... I've switched to unscd in the
> short-term if only to confirm I don't have other issues.
> 
> I did manage to get coredump and stack trace from `coredumpctl gdb` which on
> the face are the same as previous comments; segv somewhere in/around
> addhstai.
> 
> I've failed to install with USE=vanilla, firstly because sandbox violations
> (presumably that's what the first Gentoo specific patch sorts)... and then
> again with sandbox disabled landed me with collisions on tzselect and zdump
> with timezone-data package... didn't see that one coming and no idea what's
> caused it.
> 
> Going to have to put this down for a bit, maybe tomorrow, but I'll take
> another crack at it next chance I get.

Welcome to the club! You can remove the conflicting tzbla files form /usr/bin and then it's hammertime with:

FEATURES="-sandbox -usersandbox" USE=vanilla emerge -v1 --nodeps glibc

...highwaaayy to the danger zone...
Comment 17 Sam James archtester Gentoo Infrastructure gentoo-dev Security 2022-09-26 19:10:39 UTC
(In reply to Holger Hoffstätte from comment #15)
> (In reply to Sam James from comment #12)
> > (In reply to Holger Hoffstätte from comment #11)
> > > Does anybody else running nscd with enabled hosts cache on glibc-2.36-r3 see
> > > these crashes? If so it might be something with our patch list.
> > > 
> > 
> > As per comment on other side, please try USE=vanilla.
> 
> So after unmasking +vanilla, removing colliding tzdata stuff from /usr/bin
> and several attempts to build with +vanilla but without sandboxes to please
> ldconfig, the verdict is in:
> 
> nscd with enabled host cache is stable and works just fine, incl. a
> hammering with mtr.
> 
> Something something notable patches.. :)

:D

Now, the next question is, is it a patch from that small set I linked, or the branch you did? The former is our fault and there's no obvious candidate there IMO, the latter is upstream and I'd guess opensuse isn't on the same commit as us.

I'd say bisect using 9999+vanilla and the EGIT override variables for the stable/2.36 branch from release and up?
Comment 18 Sam James archtester Gentoo Infrastructure gentoo-dev Security 2022-09-26 19:16:57 UTC
(tbh, it might be easiest to just grab all the commits, apply to vanilla, then rm each newer .patch until you hit it, as there's not that many)
Comment 19 Holger Hoffstätte 2022-09-26 19:31:27 UTC
(In reply to Sam James from comment #17)
> > Something something notable patches.. :)
> 
> :D
> 
> Now, the next question is, is it a patch from that small set I linked, or
> the branch you did? The former is our fault and there's no obvious candidate
> there IMO, the latter is upstream and I'd guess opensuse isn't on the same
> commit as us.

The 9999 patches see harmless and have nothing to do with the resolver. I was just hunting down what exactly TuMbLeWeEd builds, but I strongly suspect the reason for the problem is in the patches for the complete resolver rewrite by Florian Weimer (resolv, nss_dns commit prefixes). Maybe we're missing a followup.
Anyway I'll try to remove them from the -r3 patchset and see what happens.
Comment 20 Holger Hoffstätte 2022-09-26 20:08:18 UTC
It reproduces with vanilla -9999 so there's that.
Comment 21 Sam James archtester Gentoo Infrastructure gentoo-dev Security 2022-09-26 20:09:25 UTC
(In reply to Holger Hoffstätte from comment #20)
> It reproduces with vanilla -9999 so there's that.

I did have a cheesy script which I've lost, but it wa sa git bisect run script which used a local git checkout bu emerged glibc-9999 on the right branch and commit using the EGIT_* vars. If you can make it crash reliably, you might want to do that and just let it run overnight.
Comment 22 Holger Hoffstätte 2022-09-27 05:29:58 UTC
As expected the offending patch is:

0048-nss_dns-Rewrite-_nss_dns_gethostbyname4_r-using-curr.patch

which just activates stuff added/modified in the previous patches.

When I remove this from the patch set and rebuild, the crash is gone.
Comment 23 Dan Goodliffe 2022-09-27 08:33:43 UTC
Thanks for investigating Holger, I've just done an build/install without 0048-...patch and 5 minutes in, solid as a rock... so yeah, seems to be that one.
Comment 24 Holger Hoffstätte 2022-09-27 08:46:40 UTC
(In reply to Dan Goodliffe from comment #23)
> Thanks for investigating Holger, I've just done an build/install without
> 0048-...patch and 5 minutes in, solid as a rock... so yeah, seems to be that
> one.

Great to know, thanks for verifying!
Since that patch just enables code added/modified in previous patches the actual bug could also be somewhere else, however I'll leave that for upstream to figure out. I've added a comment on the upstream bug.
Comment 25 Holger Hoffstätte 2022-10-03 11:54:26 UTC
For those who want to run 2.36 but also like a working nscd, the easiest fix is:

- emerge patchutils
- mkdir -p /etc/portage/patches/sys-libs/glibc
- unpack the glibc-2.36-rX patches from your $DISTDIR
- run: 'interdiff 0048-nss_dns-Rewrite-_nss_dns_gethostbyname4_r-using-curr.patch /dev/null > /etc/portage/patches/sys-libs/glibc/revert-rewrite-resolver.patch'
- emerge -v1 glibc

This will create a reverse patch and apply it after everything else.
Comment 26 Holger Hoffstätte 2022-10-05 06:10:19 UTC
Rejoice! The bug is fixed in glibc, see the linked bug at Sourceware.
I've been running with the patch on several machines and it works reliably again.
Comment 27 Holger Hoffstätte 2022-10-16 08:55:35 UTC
Fixed in 2.36-r5 aka commit 961b6054cf5f