I am not entirely certain what happened, but after an apparent successful merge of sys-libs/glibc-2.35 on my Octane build machine for MIPS, the locale setup was completely broken. This system uses an N32 ABI, so LIBDIR is /lib32 and /usr/lib32, and I am somewhat wondering if there may be a tangible link to Bug #753740 here, in some form. Here's the output after the filesystem merge phase was completed: > >>> Installing (1 of 1) sys-libs/glibc-2.35-r2::gentoo > * Defaulting /etc/host.conf:multi to on > * Last-minute run tests with ./ld.so.1 in /lib32 ... > locale-gen --jobs 3 --config /etc/locale.gen --destdir / > * Building locales in DESTDIR '/' > * Generating 3 locales (this might take a while) with 3 jobs > * (1/3) Generating en_US.ISO-8859-15 ...failed to set locale! > [error] character map file `ISO-8859-15' not found: No such file or directory > failed to set locale! > [error] default character map file `ANSI_X3.4-1968' not found: No such file > or directory > [ !! ] > * (2/3) Generating en_US.UTF-8 ...failed to set locale! > [error] character map file `UTF-8' not found: No such file or directory > failed to set locale! > [error] default character map file `ANSI_X3.4-1968' not found: No such file > or directory > [ !! ] > * (3/3) Generating C.UTF-8 ...failed to set locale! > [error] character map file `UTF-8' not found: No such file or directory > failed to set locale! > [error] default character map file `ANSI_X3.4-1968' not found: No such file > or directory > [ !! ] > * Generation complete > * Adding locales to archive ..."//usr/lib/locale/C.utf8" is no directory; > ignored > "//usr/lib/locale/en_US.utf8" is no directory; ignored > "//usr/lib/locale/en_US" is no directory; ignored > [ !! ] > * After upgrading glibc, please restart all running processes. > * Be sure to include init (telinit u) or systemd (systemctl daemon-reexec). > * Alternatively, reboot your system. > * (See bug #660556, bug #741116, bug #823756, etc) > I tried some manual fixes, checking Google, but couldn't pin down a cause or find a solution. The locale files for 2.35 were either missing or not installed in the right location, which broke both locale-gen and localedef. I ended up having to boot off of a netboot image and manually untar the previous glibc binpkg from 2.34 over root to get the system back into a working order. As such, I don't have a build log available. But I do have the contents of /var/db/pkg/sys-libs/glibc-2.35-r2/ if there are any files in there that may be relevant to debugging this.
Created attachment 769931 [details] emerge --info from IP30
Per bug 753740, we always use /usr/lib/locale, not /usr/LIBDIR/locale. Does the /usr/lib/locale directory exist? If not, could you try creating an empty directory there, and then run locale-gen? If that resolves the issue, we might just need to add a keepdir to the glibc ebuild.
Also, it would be helpful to know if /usr/share/i18n/charmap/{ISO-8859-1,UTF-8}.gz exist when the "bad" version if glibc is installed.
That should be /usr/share/i18n/charmaps/{ISO-8859-1,UTF-8}.gz
(In reply to Mike Gilbert from comment #2) > Per bug 753740, we always use /usr/lib/locale, not /usr/LIBDIR/locale. > > Does the /usr/lib/locale directory exist? If not, could you try creating an > empty directory there, and then run locale-gen? > > If that resolves the issue, we might just need to add a keepdir to the glibc > ebuild. I unfortunately had to restore the machine to a working state, so unpacking the older glibc will have covered up the cause (most likely). However, I DO recall from memory that /usr/lib/locale/ existed and it contained a single file, locale-archive, which was 102KB in size. I had an older /usr/lib32/locale directory sitting around from Oct 2020, with a locale-archive in it that was 4.7MB. Copying that /usr/lib/locale/ briefly got localedef (I think) to spit out the configured locales. Attempting to re-run locale-gen failed, which truncated the file back to 102KB. (In reply to Mike Gilbert from comment #4) > That should be /usr/share/i18n/charmaps/{ISO-8859-1,UTF-8}.gz Per /var/db/pkg/sys-libs/glibc-2.35-r2/CONTENTS: obj /usr/share/i18n/charmaps/ISO-8859-1.gz 21d6f776f2e0c4619960d57bd0ede2e3 1649622164 obj /usr/share/i18n/charmaps/UTF-8.gz 03ca795f88045e90ba6b03895f087180 1649622172 So yes, they existed.
Created attachment 769934 [details] locale-gen -x output Managed to pull the full output of `locale-gen -x` from my scrollback buffer in my PuTTY client.
The filesystem layout looks ok to me. This would require further debugging. I would probably strace the localdef calls to see exactly what paths it is attempting to open.
(In reply to Mike Gilbert from comment #7) > The filesystem layout looks ok to me. This would require further debugging. > > I would probably strace the localdef calls to see exactly what paths it is > attempting to open. I've locally masked 2.35-rX for now. Letting it update to 2.34-r11 instead. If I get time tonight, I'll unmask 2.35 and step through the ebuild phases to 'install' on 2.35-r2 and then inspect the layout under the image directory to see if anything sticks out from 2.34. Maybe that will shed some light on the cause.
Created attachment 769952 [details] Output of 'qmerge' phase of manual ebuild run So here's the output of the 'qmerge' phase of manually rebuilding sys-libs/glibc-2.35-r2. Nothing seems out of the ordinary. Only libSegfault.so and catchsegv are removed, so I don't think those are libraries linked-to by anything outside of glibc. I'll post some info about localedef next.
Created attachment 769955 [details] strace of localedef This is the output of the command: /usr/bin/localedef --no-archive -i en_US -f ISO-8859-15 -A /usr/share/locale/locale.alias --prefix / en_US Copied from my earlier run of `locale-gen -x`. It seems the failure point is localedef is looking for a *.bz2 filename of the locale data, but there only exists the *.gz version: openat(AT_FDCWD, "/usr/share/i18n/charmaps/ISO-8859-15.bz2", O_RDONLY|O_LARGEFILE) = -1 ENOENT (No such file or directory) write(2, "failed to set locale!", 21failed to set locale!) = 21 write(2, "\n", 1) = 1 write(2, "[error] character map file `ISO-"..., 50[error] character map file `ISO-8859-15' not found) = 50 write(2, ": No such file or directory", 27: No such file or directory) = 27 write(2, "\n", 1) = 1 # ls -l /usr/share/i18n/charmaps/ISO-8859-15.bz2 ls: cannot access '/usr/share/i18n/charmaps/ISO-8859-15.bz2': No such file or directory # ls -l /usr/share/i18n/charmaps/ISO-8859-15.gz -rw-r--r-- 1 root root 2.9K Apr 11 00:47 /usr/share/i18n/charmaps/ISO-8859-15.gz It seems that localedef fails to stop looking when it finds a matching filename, because it checks for the gz form first, finds it, but keeps going and then tries the bz2, which fails, and it treats that as a failure.
(In reply to Joshua Kinard from comment #10) > Created attachment 769955 [details] > strace of localedef > > This is the output of the command: > /usr/bin/localedef --no-archive -i en_US -f ISO-8859-15 -A > /usr/share/locale/locale.alias --prefix / en_US [snip] Okay, I didn't copy that snippet of the strace correctly into my comment -- it's missing the successful openat() against the gz form of the file: access("//usr/lib/locale/en_US/", W_OK) = 0 openat(AT_FDCWD, "ISO-8859-15", O_RDONLY|O_LARGEFILE) = -1 ENOENT (No such file or directory) openat(AT_FDCWD, "/usr/share/i18n/charmaps/ISO-8859-15", O_RDONLY|O_LARGEFILE) = -1 ENOENT (No such file or directory) openat(AT_FDCWD, "/usr/share/i18n/charmaps/ISO-8859-15.gz", O_RDONLY|O_LARGEFILE) = 3 ^---- Success; should have stopped at this point, but it doesn't statx(3, "", AT_STATX_SYNC_AS_STAT|AT_NO_AUTOMOUNT|AT_EMPTY_PATH, STATX_BASIC_STATS, {stx_mask=STATX_BASIC_STATS, stx_attributes=0, stx_mode=S_IFREG|0644, stx_size=2966, ...}) = 0 close(3) = 0 openat(AT_FDCWD, "/usr/share/i18n/charmaps/ISO-8859-15.bz2", O_RDONLY|O_LARGEFILE) = -1 ENOENT (No such file or directory) ^---- Fail; file doesn't exist, it thinks this is a real error and gives up write(2, "failed to set locale!", 21failed to set locale!) = 21 write(2, "\n", 1) = 1 write(2, "[error] character map file `ISO-"..., 50[error] character map file `ISO-8859-15' not found) = 50 write(2, ": No such file or directory", 27: No such file or directory) = 27 write(2, "\n", 1) = 1 ^---- Printing out the error to the console
It looks like charmap_open should create a new process (posix_spawn/clone) with a pipe to gzip to decompress the data. I'm not seeing anything like that in the log. https://sourceware.org/git/?p=glibc.git;a=blob;f=locale/programs/charmap-dir.c;h=396a0d76c0e6de0b83c24a4fe5dd6ea0c08c23ef;hb=refs/heads/release/2.35/master#l206
Here's my own strace from amd64/glibc-2.34-r10. https://gist.github.com/floppym/15aadc83e88784e7ebe2c06dc7938124 At line 46 we see a successful openat() call, followed by newfstatat() and pipe(). On line 58, we have a clone3() call to start a new process. On line 193, we execve() gzip. In your log, I see a sucessful openat() followed by statx(). I don't see the pipe() call. In fopen_uncompressed(), the code calls fstat64() which is likely a wrapper around statx() on mips. It then uses the S_ISREG () macro to check for a regular file. Perhaps the code is getting some unexpected result/data from statx()?
(In reply to Mike Gilbert from comment #13) > Here's my own strace from amd64/glibc-2.34-r10. > > https://gist.github.com/floppym/15aadc83e88784e7ebe2c06dc7938124 > > At line 46 we see a successful openat() call, followed by newfstatat() and > pipe(). > On line 58, we have a clone3() call to start a new process. > On line 193, we execve() gzip. > > In your log, I see a sucessful openat() followed by statx(). I don't see the > pipe() call. > > In fopen_uncompressed(), the code calls fstat64() which is likely a wrapper > around statx() on mips. It then uses the S_ISREG () macro to check for a > regular file. > > Perhaps the code is getting some unexpected result/data from statx()? Entirely possible. The stat()-family of syscalls in MIPS is a bit on the weird side, as I discovered some time ago when debugging an issue in musl related to its thin-wrapper of stat(). There's some weird, I believe legacy, things the MIPS kernel code does that hasn't been entirely eliminated yet. Glibc should have that handled, though. I don't routinely follow libc-alpha to keep abreast of changes, so I may have to go dig through that, as well as changelogs, and see if I can spot any MIPS-specific changes that may be to blame. I diff'ed a prepared extract of the glibc-2.34-r11 and glibc-2.35-r2 source directories in the 'locale/programs' subdir, and the only hunk that stands out is this one: diff -Naurp glibc-2.34/locale/programs/locarchive.c glibc-2.35/locale/programs/locarchive.c --- glibc-2.34/locale/programs/locarchive.c 2022-04-11 08:42:28.571380117 -0400 +++ glibc-2.35/locale/programs/locarchive.c 2022-04-11 08:42:58.202064401 -0400 @@ -655,6 +654,13 @@ open_archive (struct locarhandle *ah, bo error (EXIT_FAILURE, errno, _("cannot read archive header")); } + /* Check the magic value */ + if (GET (head.magic) != AR_MAGIC) + { + (void) lockf64 (fd, F_ULOCK, sizeof (struct locarhead)); + error (EXIT_FAILURE, 0, _("bad magic value in archive header")); + } + ah->fd = fd; ah->mmaped = st.st_size; I am not 100% on that, though, so I literally just brought the machine online again and need to setup gdb, then re-build glibc-2.35-r2 w/ debugging to let me step through localedef's logic and see if I can better pinpoint what it's failing on, and if it aligns with the above hunk in any way.
(In reply to Mike Gilbert from comment #13) [snip] > > Perhaps the code is getting some unexpected result/data from statx()? FWIW, the strace output of localedef on glibc-2.34 looks correct as there are the calls to pipe() and then the calling of gzip -dc to decompress the file. So it's definitely something 2.35 does different and does not seem to be related to anything in the kernel at the moment (hopefully): > access("//usr/lib/locale/en_US/", W_OK) = 0 > openat(AT_FDCWD, "ISO-8859-15", O_RDONLY) = -1 ENOENT (No such file or directory) > openat(AT_FDCWD, "/usr/share/i18n/charmaps/ISO-8859-15", O_RDONLY) = -1 ENOENT (No such file or directory) > openat(AT_FDCWD, "/usr/share/i18n/charmaps/ISO-8859-15.gz", O_RDONLY) = 3 > statx(3, "", AT_STATX_SYNC_AS_STAT|AT_NO_AUTOMOUNT|AT_EMPTY_PATH, STATX_BASIC_STATS, {stx_mask=STATX_BASIC_STATS, stx_attributes=0, stx_mode=S_IFREG|0644, stx_size=2966, ...}) = 0 > pipe([4, 5]) = 4 > getrlimit(RLIMIT_NOFILE, {rlim_cur=1024, rlim_max=4*1024}) = 0 > <...> > mmap(NULL, 65536, PROT_READ|PROT_WRITE|PROT_EXEC, MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, -1, 0) = 0x77960000 > rt_sigprocmask(SIG_BLOCK, ~[], [], 16) = 0 > clone(child_stack=0x7796ffe0, flags=CLONE_VM|CLONE_VFORK|SIGCHLD <unfinished ...> > rt_sigprocmask(SIG_BLOCK, NULL, ~[KILL STOP], 16) = 0 > rt_sigaction(SIGHUP, NULL, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=0}, 16) = 0 > rt_sigaction(SIGHUP, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=0}, NULL, 16) = 0 > <...> > dup2(5, 1) = 1 > close(5) = 0 > close(4) = 0 > dup2(3, 0) = 0 > close(3) = 0 > rt_sigprocmask(SIG_SETMASK, [], NULL, 16) = 0 > execve("/root/bin/gzip", ["gzip", "-d", "-c"], 0x7f8c0ba0 /* 28 vars */) = -1 ENOENT (No such file or directory) > execve("/sbin/gzip", ["gzip", "-d", "-c"], 0x7f8c0ba0 /* 28 vars */) = -1 ENOENT (No such file or directory) > execve("/bin/gzip", ["gzip", "-d", "-c"], 0x7f8c0ba0 /* 28 vars */ <unfinished ...>
I would guess that something changed in the fstat64() libc wrapper function, or something it calls in libc.
Maybe related to this 64-bit time change? https://sourceware.org/git/?p=glibc.git;a=commit;h=a6d2f948b71adcb5ea395cb04833bc645eab45e6
(In reply to Mike Gilbert from comment #17) > Maybe related to this 64-bit time change? > > https://sourceware.org/git/?p=glibc.git;a=commit; > h=a6d2f948b71adcb5ea395cb04833bc645eab45e6 That change looks too generic. If it were the culprit, multiple archs should have broken and been reported by now. The only standout MIPS-specific change I can glean from the 2.35 changelog is BZ#28223 "mips: clone does not align stack", which leads to this below change to the clone() syscall in MIPS: > --- a/sysdeps/unix/sysv/linux/mips/clone.S > +++ b/sysdeps/unix/sysv/linux/mips/clone.S > @@ -55,6 +55,13 @@ NESTED(__clone,4*SZREG,sp) > .set at > #endif > > + /* Align stack to 4/8 bytes per the ABI. */ > +#if _MIPS_SIM == _ABIO32 > + li t0,-4 > +#else > + li t0,-8 > +#endif > + and a1,a1,t0 > > /* Sanity check arguments. */ > li v0,EINVAL https://sourceware.org/git/?p=glibc.git;a=patch;h=7af07fe795f43e53d31be1c6f9adba7e05f87b0b But even that feels too generic for MIPS itself. I would expect that if that was the root cause, that something more complex in the userland during boot-up would have tripped all over itself and crashed to single user. But the system will boot to multiuser on glibc-2.35. Missing locales is really just an annoyance, because you get spurious warnings in the console from setlocale failing. The only program that, so far, refuses to execute is tmux. It'll be a few hours before I get gdb built and rebuild glibc-2.35 w/ debugging to see if I can pin down a specific fault. Assuming my friendly kernel race condition/NULL deref in __update_load_avg_se() doesn't stop by for a friendly visit...
(In reply to Mike Gilbert from comment #17) > Maybe related to this 64-bit time change? > > https://sourceware.org/git/?p=glibc.git;a=commit; > h=a6d2f948b71adcb5ea395cb04833bc645eab45e6 I stand corrected. It does look like this is the fault, somehow. I traced the execution flow to `fstatat64_time64_statx` in sysdeps/unix/sysv/linux/fstatat64.c:50, and the code hits this code: 50 int r = INTERNAL_SYSCALL_CALL (statx, fd, file, AT_NO_AUTOMOUNT | flag, 51 STATX_BASIC_STATS, &tmp); 52 if (r != 0) 53 return r; It attempts the syscall four times before somehow failing, setting 'r' to 0, which causes the function to bail out. The vars at this point are all optimized out, so I can't see what is actually being passed, but this smells like the culprit. If I am reading the original bug right, the trigger for that change was some inode numbers coming back bigger than a 32-bit integer. I checked the XFS inode number for "/usr/share/i18n/charmaps/ISO-8859-15.gz", and it returns 113690568, which is small enough to fit into a 32-bit value, so it must be something else. Not sure at this point if this is related to the MIPS ABI or not. N32 is the hybrid API that acts like its 32-bits in places but is really 64-bit, and in generic code, that can sometimes lead to surprises. @vapier: Any clues? Your name is in BZ#15333, specifically for locale stuff.
I can reproduce the issue by emulating MIPS with qemu. I'll do some poking to see if I can debug further.
I wrote a simple test program to output the value of st_mode for a given file using the fstat64 function. > mips-n32 ~ # cat stat64.c > #define _GNU_SOURCE > #include <fcntl.h> > #include <stdio.h> > #include <stdlib.h> > #include <unistd.h> > #include <sys/stat.h> > > int main(int argc, char **argv) > { > int fd; > struct stat64 st; > > if (argc < 2) > return EXIT_FAILURE; > > fd = open(argv[1], O_PATH); > if (fd < 0) > { > perror("open"); > return EXIT_FAILURE; > } > > if (fstat64(fd, &st) < 0) > { > perror("fstat64"); > return EXIT_FAILURE; > } > > printf("st_mode = %08x\n", st.st_mode); > > return 0; > } And ran the following tests: > mips-n32 ~ # gcc -o stat64 stat64.c > mips-n32 ~ # ./stat64 /bin/ls > st_mode = 000081ed > > mips-n32 ~ # gcc -D_FILE_OFFSET_BITS=64 -o stat64_offset64 stat64.c > mips-n32 ~ # ./stat64_offset64 /bin/ls > st_mode = 000081ed > > mips-n32 ~ # gcc -D_FILE_OFFSET_BITS=64 -D_TIME_BITS=64 -o stat64_time64 stat64.c > mips-n32 ~ # ./stat64_time64 /bin/ls > st_mode = 00000000 The st_mode field gets set to zero when _TIME_BITS=64. This test was performed with glibc-2.34-r11. With glibc-2.35, localedef gets compiled with -D_TIME_BITS=64, so the existing problem surfaces.
I think /usr/include/bits/struct_stat.h needs to be changed. It looks like it was wired up for time64 in the O32 ABI (_MIPS_SIM == _ABIO32) section at the top of the file, but the N32 ABI section at the bottom of the file was never updated.
Created attachment 770960 [details, diff] Possible fix for <bits/struct_stat.h> on N32 w/ -D_TIME_BITS=64 (In reply to Mike Gilbert from comment #22) > I think /usr/include/bits/struct_stat.h needs to be changed. > > It looks like it was wired up for time64 in the O32 ABI (_MIPS_SIM == > _ABIO32) section at the top of the file, but the N32 ABI section at the > bottom of the file was never updated. Thanks for the testcase program! Confirmed on hardware that the issue only manifests itself on an N32 userland under glibc. An O32 userland and the N64 target in a multilib userland are unaffected. I took a crack at a test patch for the N32 section of <bits/struct_stat.h>, and it seems to work (st_mode = 000081ed with -D_TIME_BITS=64), but I am not terribly confident that it is how the problem should be solved. Still, it might be enough to open a bug with upstream to get some eyes on things. I definitely think they just overlooked the N32 section entirely because even the comments on the N32 versions of `struct stat` and `struct stat64` are out of date with their O32 counterparts. I'll see about opening a bug tomorrow (Saturday my time) on glibc's bug tracker and post the testcase and fix and see what they say.
Created attachment 771710 [details, diff] Patch from libc-alpha that resolves bz#29069 Upstream has accepted the patch I submitted and, I think, expanded on it to fix this issue. I've attached the cherrypick that I was CC'ed on. Can we have this added to 2.34 and 2.35 when the next -rX is cut?
The bug has been referenced in the following commit(s): https://gitweb.gentoo.org/repo/gentoo.git/commit/?id=1e9f2d420dbf8be174aef307ea10bce4739a7e8f commit 1e9f2d420dbf8be174aef307ea10bce4739a7e8f Author: Andreas K. Hüttel <dilfridge@gentoo.org> AuthorDate: 2022-04-20 21:42:48 +0000 Commit: Andreas K. Hüttel <dilfridge@gentoo.org> CommitDate: 2022-04-20 22:25:31 +0000 sys-libs/glibc: 2.35 patchset bump Mostly fixes issues with m68k and mips-n32 Bug: https://bugs.gentoo.org/837734 Bug: https://sourceware.org/bugzilla/show_bug.cgi?id=29069 Bug: https://sourceware.org/bugzilla/show_bug.cgi?id=29071 Package-Manager: Portage-3.0.30, Repoman-3.0.3 Signed-off-by: Andreas K. Hüttel <dilfridge@gentoo.org> sys-libs/glibc/Manifest | 1 + sys-libs/glibc/glibc-2.35-r3.ebuild | 1593 +++++++++++++++++++++++++++++++++++ 2 files changed, 1594 insertions(+)
The bug has been referenced in the following commit(s): https://gitweb.gentoo.org/repo/gentoo.git/commit/?id=410541fae25da99b5dfc45a386cc9d30db184840 commit 410541fae25da99b5dfc45a386cc9d30db184840 Author: Andreas K. Hüttel <dilfridge@gentoo.org> AuthorDate: 2022-04-22 10:45:42 +0000 Commit: Andreas K. Hüttel <dilfridge@gentoo.org> CommitDate: 2022-04-22 10:45:42 +0000 sys-libs/glibc: Rekeyword 2.35-r3 Bug: https://bugs.gentoo.org/837734 Bug: https://sourceware.org/bugzilla/show_bug.cgi?id=29069 Bug: https://sourceware.org/bugzilla/show_bug.cgi?id=29071 Package-Manager: Portage-3.0.30, Repoman-3.0.3 Signed-off-by: Andreas K. Hüttel <dilfridge@gentoo.org> sys-libs/glibc/glibc-2.35-r3.ebuild | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
Fixed in 2.35-r3 and 2.34-r13 (which I'll rekeyword sometime today or tomorrow).