Summary: | sys-libs/glibc-2.34-r2: fails to install with debug info (ln: failed to create hard link '/var/tmp/portage/sys-libs/glibc-2.34-r2/image/usr/lib/debug/usr/lib64/misc/glibc/getconf/getconf.debug': File exists) | ||
---|---|---|---|
Product: | Portage Development | Reporter: | Sam James <sam> |
Component: | Core | Assignee: | Portage team <dev-portage> |
Status: | RESOLVED FIXED | ||
Severity: | normal | CC: | mgorny, toolchain, whissi |
Priority: | Normal | ||
Version: | unspecified | ||
Hardware: | All | ||
OS: | Linux | ||
See Also: | https://github.com/gentoo/portage/pull/770 | ||
Whiteboard: | |||
Package list: | Runtime testing required: | --- | |
Bug Depends on: | |||
Bug Blocks: | 820107 | ||
Attachments: |
build.log.bz2
ln: failed to create symbolic link debug log |
Description
Sam James
2021-11-15 06:58:07 UTC
Created attachment 751218 [details]
build.log.bz2
Whissi reported the same error with current tip of portage.git. I am unable to reproduce this. Please confirm the exact portage commit you are using. Also, please create a tarball with the contents of /var/tmp/portage/sys-libs/glibc-2.34-r2 and I will take a look. Portage at 704bcd1581e49432f363f0eda648d58411775d7f. However, after tree update today, glibc merged. This appears to be a regression in commit 84206c6200eb003314cf4f2d640bf73f04654012. With commit 84206c6200eb003314cf4f2d640bf73f04654012 getconf.debug is missing when I build glibc-2.34-r2 with splitdebug, as shown by this find command: > $ find /var/tmp/portage/sys-libs/glibc-2.34-r2/image -name 'getconf*' > /var/tmp/portage/sys-libs/glibc-2.34-r2/image/usr/lib64/misc/glibc/getconf > /var/tmp/portage/sys-libs/glibc-2.34-r2/image/usr/bin/getconf > /var/tmp/portage/sys-libs/glibc-2.34-r2/image/usr/lib/debug/usr/lib64/misc/glibc/getconf > /var/tmp/portage/sys-libs/glibc-2.34-r2/image/usr/lib/debug/usr/lib/misc/glibc/getconf > /var/tmp/portage/sys-libs/glibc-2.34-r2/image/usr/lib/misc/glibc/getconf If I revert commit 84206c6200eb003314cf4f2d640bf73f04654012, then getconf.debug appears as expected: > $ find /var/tmp/portage/sys-libs/glibc-2.34-r2/image -name 'getconf*' > /var/tmp/portage/sys-libs/glibc-2.34-r2/image/usr/lib64/misc/glibc/getconf > /var/tmp/portage/sys-libs/glibc-2.34-r2/image/usr/bin/getconf > /var/tmp/portage/sys-libs/glibc-2.34-r2/image/usr/lib/debug/usr/lib64/misc/glibc/getconf > /var/tmp/portage/sys-libs/glibc-2.34-r2/image/usr/lib/debug/usr/bin/getconf.debug > /var/tmp/portage/sys-libs/glibc-2.34-r2/image/usr/lib/debug/usr/lib/misc/glibc/getconf > /var/tmp/portage/sys-libs/glibc-2.34-r2/image/usr/lib/misc/glibc/getconf Apparently 84206c6200eb003314cf4f2d640bf73f04654012 triggers this getconf problem because of getconf hardlinks created in /usr/lib64/misc/glibc/getconf/, as shown by the identical inode numbers here:
> $ stat /var/tmp/portage/sys-libs/glibc-2.34-r2/image/usr/bin/getconf
> File: /var/tmp/portage/sys-libs/glibc-2.34-r2/image/usr/bin/getconf
> Size: 34952 Blocks: 72 IO Block: 4096 regular file
> Device: 40h/64d Inode: 376622 Links: 4
> Access: (0755/-rwxr-xr-x) Uid: ( 0/ root) Gid: ( 0/ root)
> Access: 2021-11-16 15:08:46.581141261 -0800
> Modify: 2021-11-16 15:08:45.465156116 -0800
> Change: 2021-11-16 15:08:46.498142366 -0800
> Birth: -
> $ stat /var/tmp/portage/sys-libs/glibc-2.34-r2/image/usr/lib64/misc/glibc/getconf/POSIX_V6_LP64_OFF64
> File: /var/tmp/portage/sys-libs/glibc-2.34-r2/image/usr/lib64/misc/glibc/getconf/POSIX_V6_LP64_OFF64
> Size: 34952 Blocks: 72 IO Block: 4096 regular file
> Device: 40h/64d Inode: 376622 Links: 4
> Access: (0755/-rwxr-xr-x) Uid: ( 0/ root) Gid: ( 0/ root)
> Access: 2021-11-16 15:08:46.581141261 -0800
> Modify: 2021-11-16 15:08:45.465156116 -0800
> Change: 2021-11-16 15:08:46.498142366 -0800
> Birth: -
> $ stat /var/tmp/portage/sys-libs/glibc-2.34-r2/image/usr/lib64/misc/glibc/getconf/POSIX_V7_LP64_OFF64
> File: /var/tmp/portage/sys-libs/glibc-2.34-r2/image/usr/lib64/misc/glibc/getconf/POSIX_V7_LP64_OFF64
> Size: 34952 Blocks: 72 IO Block: 4096 regular file
> Device: 40h/64d Inode: 376622 Links: 4
> Access: (0755/-rwxr-xr-x) Uid: ( 0/ root) Gid: ( 0/ root)
> Access: 2021-11-16 15:08:46.581141261 -0800
> Modify: 2021-11-16 15:08:45.465156116 -0800
> Change: 2021-11-16 15:08:46.498142366 -0800
> Birth: -
> $ stat /var/tmp/portage/sys-libs/glibc-2.34-r2/image/usr/lib64/misc/glibc/getconf/XBS5_LP64_OFF64
> File: /var/tmp/portage/sys-libs/glibc-2.34-r2/image/usr/lib64/misc/glibc/getconf/XBS5_LP64_OFF64
> Size: 34952 Blocks: 72 IO Block: 4096 regular file
> Device: 40h/64d Inode: 376622 Links: 4
> Access: (0755/-rwxr-xr-x) Uid: ( 0/ root) Gid: ( 0/ root)
> Access: 2021-11-16 15:08:46.581141261 -0800
> Modify: 2021-11-16 15:08:45.465156116 -0800
> Change: 2021-11-16 15:08:46.498142366 -0800
> Birth: -
It looks like the dst_basename variable has the wrong value. (In reply to Zac Medico from comment #5) > This appears to be a regression in commit > 84206c6200eb003314cf4f2d640bf73f04654012. With commit > 84206c6200eb003314cf4f2d640bf73f04654012 getconf.debug is missing when I > build glibc-2.34-r2 with splitdebug, as shown by this find command: That commit changes the way the debug file names are generated. This is not "regression", but an intentional change. (In reply to Zac Medico from comment #7) > It looks like the dst_basename variable has the wrong value. Please elaborate. I would like to add some additional messaging to help further diagnose this. https://github.com/gentoo/portage/pull/770 Created attachment 751742 [details] ln: failed to create symbolic link (In reply to Mike Gilbert from comment #9) > (In reply to Zac Medico from comment #7) > > It looks like the dst_basename variable has the wrong value. > > Please elaborate. I assumed that the missing getconf.debug file was due to some bug in the dst_basename logic, but maybe I was wrong. I've just observed a different failure: strip: x86_64-pc-linux-gnu-strip --strip-unneeded -N __gentoo_check_ldflags__ -R .comment -R .GCC.command.line -R .note.gnu.gold-version /usr/lib/crti.o /usr/lib/crt1.o /usr/lib/crtn.o /usr/lib/libBrokenLocale.a /usr/lib/Mcrt1.o /usr/lib/Scrt1.o /lib/libBrokenLocale.so.1 /usr/lib/gcrt1.o /usr/lib/gconv/ISO8859-1.so ln: failed to create symbolic link '/var/tmp/portage/sys-libs/glibc-2.34-r2/image/usr/lib/debug/.build-id/a4/8463e4bad2e0b9367bb9913a9fb3c63f4ae897': File exists /usr/lib/gconv/ISO8859-9.so * ERROR: sys-libs/glibc-2.34-r2::gentoo failed: * (no error message) (In reply to Zac Medico from comment #11) This feels like some race condition when creating links with the same name. I was able to reproduce the problem by moving PORTAGE_TMPDIR from a tmpfs to a btrfs filesystem. Created attachment 751746 [details]
debug log
From my build log:
ln: failed to create hard link '/var/tmp/portage/sys-libs/glibc-2.34-r2/image/usr/lib/debug/usr/lib64/misc/glibc/getconf/getconf.debug': File exists
ln: failed to create hard link '/var/tmp/portage/sys-libs/glibc-2.34-r2/image/usr/lib/debug/usr/lib64/misc/glibc/getconf/getconf.debug': File exists
* ERROR: sys-libs/glibc-2.34-r2::gentoo failed:
* failed to create hard link '/var/tmp/portage/sys-libs/glibc-2.34-r2/image/usr/lib/debug/usr/lib64/misc/glibc/getconf/getconf.debug'
* ERROR: sys-libs/glibc-2.34-r2::gentoo failed:
*
* failed to create hard link '/var/tmp/portage/sys-libs/glibc-2.34-r2/image/usr/lib/debug/usr/lib64/misc/glibc/getconf/getconf.debug'
It looks like estrip is trying to process these links simultaneously:
image/usr/lib64/misc/glibc/getconf/POSIX_V7_LP64_OFF64 22471252_debug
image/usr/lib64/misc/glibc/getconf/POSIX_V6_LP64_OFF64 22471252_debug
image/usr/lib64/misc/glibc/getconf/XBS5_LP64_OFF64 22471252_debug
These all share the same inode (22471252), and therefore they must share the same debug file (getconf.debug).
estrip attempts to create image/usr/lib64/misc/glibc/getconf/getconf.debug in 3 concurrent processes, and 2 of processes fail.
A simple solution would be to just make the ln calls non-fatal. Alternatively, we could serialize processing of these links somehow.
We could also make the whole thing serial and remove some ugly code in the process. Ok, so the locking in estrip was faulty. I pushed a fix to the linked PR. Please review/test. https://github.com/gentoo/portage/pull/770 (In reply to Michał Górny from comment #15) > We could also make the whole thing serial and remove some ugly code in the > process. I'm not sure what kind of performance impact that would have. (In reply to Mike Gilbert from comment #17) > (In reply to Michał Górny from comment #15) > > We could also make the whole thing serial and remove some ugly code in the > > process. > > I'm not sure what kind of performance impact that would have. Compared to compiling glibc or other large packages? I think we're trying to microoptimize here, at the cost of correctness. I timed estrip when installing sys-libs/glibc-2.34-r2 with 3 enabled abis. 1 thread: 1m51.715s 6 threads: 0m21.342s Without ccache, the entire glibc build takes around 19-20 minutes on the same system. With ccache, it takes around 8-9 minutes. My thoughts: multi-threaded estrip is nice if you happen to be staring at the terminal while it is running. Otherwise, the speedup is small compared to the overall build time. I created a second PR to drop the multijob code. We should decide which of these we want to merge. https://github.com/gentoo/portage/pull/771 I should also point out that parallel estrip used to have very high memory consumption -- before my hardware upgrade, it used to cause intensive swapping when stripping e.g. LLVM (back when it installed split libraries). (In reply to Mike Gilbert from comment #19) it isn't a small speedup when you're on slower systems with many cores (like often happens with mips). or you have a package that installs hundreds of objects (like the kernel). the speedup in parallel builds is significant, especially when comparing with hot caches. (In reply to Michał Górny from comment #21) so does building C++ projects (especially with -pipe). the answer isn't to force -j1. (In reply to Michał Górny from comment #21) We could introduce a new environment variable to override the job count for estrip to address that issue if necessary. (In reply to SpanKY from comment #22) Can you share some benchmarks for the examples you have given? i don't have access to the mips box anymore, but it was a system where clock speeds were <1 GHz, but had it had like 64 cores. so any single threaded process was painful, but any parallel process was fine. coreutils that installs ~100 ELFs would see dramatic speedups. even current mips/arm/arm64 systems are being built in such configurations. if you look back in bug history, you can see a time where kumba@ was fixing a lot of `emake -j1` bugs in the tree because they were painful for him. this can be important even for developers working on fast machines with incremental builds. on a very expensive machine i have from work (Intel Xeon Gold 6154 @ 3GHz which is like 36 total cores w/SMT for 72 hyperthreads), these are the times i have building the kernel (actually building it, not just installing source). repeated runs shows deviation of <5 seconds on the wallclock, so it's not that noisy. # parallel build+strip w/72 jobs real 1m39.737s user 23m0.377s sys 9m52.825s # parallel build w/72 jobs + single threaded strip real 2m35.281s user 23m12.628s sys 8m55.144s even these times look high ... it seems we've regressed in some of our optimizations as we used to be <1min. these cycle times are important for devs who are doing compile+edit+test. the strip helper is really begging to be rewritten in Python. it should have been long ago before its shell code got out of hand. (In reply to SpanKY from comment #25) if you look back in bug > you can see a time where kumba@ was fixing a lot of `emake -j1` > bugs in the tree because they were painful for him. yeah, this was a while back IIRC, but I do remember it. This kind of thing is true for arm64 boxes we still have now. Until very recently, the Packet hardware we had had 96 cores but had pretty slow single-core perf. Anything parallel zipped along. I have an older arm64 box (one of the first server-grade CPUs) which fits this profile. Some sparc hardware is like this as well. The bug has been closed via the following commit(s): https://gitweb.gentoo.org/proj/portage.git/commit/?id=fe3e50025fee15d59c4956a39b675faf50555c46 commit fe3e50025fee15d59c4956a39b675faf50555c46 Author: Mike Gilbert <floppym@gentoo.org> AuthorDate: 2021-11-17 18:23:07 +0000 Commit: Mike Gilbert <floppym@gentoo.org> CommitDate: 2021-11-19 16:16:07 +0000 estrip: fix lockfile handling The previous code would wait for the first link to an inode to be processed, and would then allow multiple processes to process duplicate links to the same inode. This behavior leads to a race condition in save_elf_debug. The new code ensures that each inode is accessed by a single process at a time. Closes: https://bugs.gentoo.org/823798 Signed-off-by: Mike Gilbert <floppym@gentoo.org> bin/estrip | 15 +++++++-------- 1 file changed, 7 insertions(+), 8 deletions(-) Additionally, it has been referenced in the following commit(s): https://gitweb.gentoo.org/proj/portage.git/commit/?id=2cd58ac5a96541d0b1ba0440666114d48dc2c889 commit 2cd58ac5a96541d0b1ba0440666114d48dc2c889 Author: Mike Gilbert <floppym@gentoo.org> AuthorDate: 2021-11-16 19:58:42 +0000 Commit: Mike Gilbert <floppym@gentoo.org> CommitDate: 2021-11-19 16:16:07 +0000 estrip: add death messages and debug prints in save_elf_debug Bug: https://bugs.gentoo.org/823798 Signed-off-by: Mike Gilbert <floppym@gentoo.org> bin/estrip | 15 +++++++++------ 1 file changed, 9 insertions(+), 6 deletions(-) |