Summary: | www-client/firefox-123.0[pgo] fails to build during/after linking stage | ||
---|---|---|---|
Product: | Gentoo Linux | Reporter: | Johannes Penßel <johannesp> |
Component: | Current packages | Assignee: | Mozilla Gentoo Team <mozilla> |
Status: | CONFIRMED --- | ||
Severity: | normal | CC: | alex, anders.gentoo, andreas.thalhammer, b4b1, david.goudou, dememax, deyaa.saifeldin, Etrnls, johannesp, kero7kero, leohdz172, lizhuohua1994, llvm, marcoep, me, mmw, prometheanfire, redblade7, rodolfo.boer, sam, sir_tuam, th-gen, ulf.norberg, vovan, voyageur, w12101111 |
Priority: | Normal | ||
Version: | unspecified | ||
Hardware: | All | ||
OS: | Linux | ||
See Also: |
https://github.com/llvm/llvm-project/issues/84062 https://github.com/canonical/firefox-snap/pull/44 https://github.com/llvm/llvm-project/issues/87894 |
||
Whiteboard: | |||
Package list: | Runtime testing required: | --- | |
Attachments: |
full build log (zstd compressed)
another emerge --info www-client/firefox Build logs (xz compressed) www-client:firefox-123.0:20240221-234927.log emerge --info www-client:firefox-123.0:20240222-103851.log www-client:firefox-123.0:20240222-115537.log emerge info on the system where the build works emerge info on the system where the build breaks emerge--info--firefox compressed build log |
Description
Johannes Penßel
2024-02-20 19:23:42 UTC
Builds fine with USE=-pgo. Builds fine with USE=-pgo. [1003258.958377] llvm-worker-6[154725]: segfault at 60 ip 00007f2c2f99409b sp 00007f2bda9fa080 error 4 in libLLVM-17.so[7f2c2deb8000+3f31000] likely on CPU 15 (core 7, socket 0) [1003258.958403] Code: c6 48 89 85 f8 f6 ff ff 48 8b 87 28 09 00 00 48 8d bd 58 f7 ff ff 48 89 85 50 f7 ff ff c5 f9 7f 85 40 f7 ff ff e8 65 16 b5 fe <49> 8b 47 60 48 89 85 00 f7 ff ff 48 85 c0 74 35 48 8d bd 40 f7 ff encountered this here as well building with X clang dbus gmp-autoupdate hwaccel jack jumbo-build libproxy lto pgo pulseaudio system-av1 system-harfbuzz system-icu system-jpeg system-libevent system-libvpx system-png system-webp wayland (In reply to Johannes Penßel from comment #0) > LDFLAGS="-Wl,-O1 -Wl,--as-needed -Wl,-z,pack-relative-relocs > -Wl,--build-id=sha1 -Wl,-z,nopack-relative-relocs > -Wl,--compress-debug-sections=zlib > -Wl,-rpath=/usr/lib64/firefox,--enable-new-dtags" Try with sane LDFLAGS, I'm pretty sure "-z,nopack-relative-relocs" doesn't work. (In reply to Joonas Niilola from comment #4) > (In reply to Johannes Penßel from comment #0) > > LDFLAGS="-Wl,-O1 -Wl,--as-needed -Wl,-z,pack-relative-relocs > > -Wl,--build-id=sha1 -Wl,-z,nopack-relative-relocs > > -Wl,--compress-debug-sections=zlib > > -Wl,-rpath=/usr/lib64/firefox,--enable-new-dtags" > > Try with sane LDFLAGS, I'm pretty sure "-z,nopack-relative-relocs" doesn't > work. While it looks kinda weird, -z,nopack-relative-relocs is a valid linker flag. (see ld man page) I've been using it for over a year with fireox only because global -z,pack-relative-relocs causes its build to fail. It is indeed a real flag, although nowadays at least, FF builds fine for me with DT_RELR. (In reply to Johannes Penßel from comment #5) > > While it looks kinda weird, -z,nopack-relative-relocs is a valid linker > flag. (see ld man page) I've been using it for over a year with fireox only > because global -z,pack-relative-relocs causes its build to fail. I'm aware, but it's broke building firefox before. As I managed to build 123.0 fine with +pgo (both +clang -clang) I suspect that could be the difference. Created attachment 885587 [details]
another emerge --info www-client/firefox
Hello, I'm also getting the same linking errors with defaults(?):
LDFLAGS="-Wl,-O1 -Wl,--as-needed -Wl,--compress-debug-sections=zlib -Wl,-rpath=/usr/lib64/firefox,--enable-new-dtags"
(In reply to Anders Larsson from comment #8) > Created attachment 885587 [details] > another emerge --info www-client/firefox > > Hello, I'm also getting the same linking errors with defaults(?): > LDFLAGS="-Wl,-O1 -Wl,--as-needed -Wl,--compress-debug-sections=zlib > -Wl,-rpath=/usr/lib64/firefox,--enable-new-dtags" Full build.log please. Also what are your mesa use flags? Post emerge -pv mesa --nodeps Created attachment 885595 [details]
Build logs (xz compressed)
[ebuild R ] media-libs/mesa-24.0.1::gentoo USE="X gles2 (opengl) proprietary-codecs vdpau wayland zstd -d3d9 -debug -gles1 -llvm -lm-sensors -opencl -osmesa (-selinux) -test -unwind -vaapi -valgrind -vulkan -vulkan-overlay -xa (-zink)" ABI_X86="32 (64) (-x32)" CPU_FLAGS_X86="sse2" LLVM_SLOT="17 -15 -16" VIDEO_CARDS="-d3d12 (-freedreno) -intel -lavapipe (-lima) -nouveau (-panfrost) -r300 -r600 -radeon -radeonsi (-v3d) (-vc4) -virgl (-vivante) -vmware" 19,484 KiB
(In reply to Joonas Niilola from comment #7) > (In reply to Johannes Penßel from comment #5) > > > > While it looks kinda weird, -z,nopack-relative-relocs is a valid linker > > flag. (see ld man page) I've been using it for over a year with fireox only > > because global -z,pack-relative-relocs causes its build to fail. > > I'm aware, but it's broke building firefox before. As I managed to build > 123.0 fine with +pgo (both +clang -clang) I suspect that could be the > difference. With -z,nopack-relative-relocs removed, the PGO build still fails, unfortunately. (With USE=-pgo, it works just fine now. Thanks for the info, Sam!) Looking through the complete list of closed tickets for FF123, this one looks like a potential culprit to me: https://bugzilla.mozilla.org/show_bug.cgi?id=1839832 Apparently, Firefox now uses "temporal instrumentation" (-pgo-temporal-instrumentation flag) for PGO if supported by the compiler. Maybe that is causing issues somehow. (In reply to Anders Larsson from comment #10) > Created attachment 885595 [details] > Build logs (xz compressed) > For you, always try without ccache/sccache if a build fails. Although since you're getting the same error, I doubt it's going to fix this. But just in general. (In reply to Johannes Penßel from comment #11) > > Looking through the complete list of closed tickets for FF123, this one > looks like a potential culprit to me: > https://bugzilla.mozilla.org/show_bug.cgi?id=1839832 > > Apparently, Firefox now uses "temporal instrumentation" > (-pgo-temporal-instrumentation flag) for PGO if supported by the compiler. > Maybe that is causing issues somehow. Hmm, there are some regressions listed which have patches, but since they were able to build 123.0 upstream and I was able to build it, there has to be something different "locally" that breaks it. Created attachment 885674 [details]
www-client:firefox-123.0:20240221-234927.log
I got the same error with those LDFLAGS:
LDFLAGS="-Wl,-O2 -Wl,--as-needed -Wl,--sort-common -Wl,--hash-style=both -Wl,-z,relro -Wl,-z,now -fstack-protector-strong -fno-plt -fexceptions -fcf-protection"
Full build log attached.
I didn't have any problems with previous versions, with clang. I recompiled 123.0 with USE="-clang", which worked.
Created attachment 885678 [details]
emerge --info
The useflags are (not all in make.conf, some are by package):
[ebuild R ~] www-client/firefox-123.0:rapid::gentoo USE="X clang* dbus gmp-autoupdate hardened hwaccel jumbo-build libproxy lto openh264 pgo pulseaudio sndio system-harfbuzz system-icu system-jpeg system-libevent system-libvpx system-png system-webp wayland wifi -debug -eme-free -geckodriver -jack (-selinux) (-system-av1) (-system-python-libs) -telemetry (-valgrind)" L10N="de -ach -af -an -ar -ast -az -be -bg -bn -br -bs -ca -ca-valencia -cak -cs -cy -da -dsb -el -en-CA -en-GB -eo -es-AR -es-CL -es-ES -es-MX -et -eu -fa -ff -fi -fr -fur -fy -ga -gd -gl -gn -gu -he -hi -hr -hsb -hu -hy -ia -id -is -it -ja -ka -kab -kk -km -kn -ko -lij -lt -lv -mk -mr -ms -my -nb -ne -nl -nn -oc -pa -pl -pt-BR -pt-PT -rm -ro -ru -sc -sco -si -sk -sl -son -sq -sr -sv -szl -ta -te -th -tl -tr -trs -uk -ur -uz -vi -xh -zh-CN -zh-TW" LLVM_SLOT="17 -16" 0 KiB
(NOTE: clang is marked* as "new" useflag, because it installed with USE="-clang" for now...)
I wonder if it worked with llvm-16 instead when using +clang? (In reply to Joonas Niilola from comment #15) > I wonder if it worked with llvm-16 instead when using +clang? ... which would lead to also downgrading rust. I checked, and decided to go with -clang. # LLVM_SLOT="16" emerge -pv www-client/firefox These are the packages that would be merged, in order: Calculating dependencies... done! Dependency resolution took 6.24 s (backtrack: 0/20). [ebuild UD ] dev-lang/rust-1.71.1:stable/1.71::gentoo [1.74.1:stable/1.74::gentoo] USE="lto verify-sig (-big-endian) -clippy -debug -dist -doc (-llvm-libunwind) (-miri) (-nightly) (-parallel-compiler) -profiler -rust-analyzer -rust-src -rustfmt (-system-bootstrap) (-system-llvm) -test -wasm" ABI_X86="(64) -32 (-x32)" CPU_FLAGS_X86="sse2" LLVM_TARGETS="AMDGPU BPF (X86) -AArch64 -ARM -AVR -Hexagon -Lanai -LoongArch -MSP430 -Mips -NVPTX -PowerPC -RISCV -Sparc -SystemZ -VE -WebAssembly -XCore (-ARC%) (-CSKY%) (-DirectX%) (-M68k%) (-SPIRV%) (-Xtensa%)" 308.049 KiB [ebuild UD ] virtual/rust-1.71.1-r1:0/llvm-16::gentoo [1.74.1:0/llvm-17::gentoo] USE="-rustfmt" ABI_X86="(64) -32 (-x32)" 0 KiB [ebuild R ~] www-client/firefox-123.0:rapid::gentoo USE="X clang* dbus gmp-autoupdate hardened hwaccel jumbo-build libproxy lto openh264 pgo pulseaudio sndio system-harfbuzz system-icu system-jpeg system-libevent system-libvpx system-png system-webp wayland wifi -debug -eme-free -geckodriver -jack (-selinux) (-system-av1) (-system-python-libs) -telemetry (-valgrind)" L10N="de -ach -af -an -ar -ast -az -be -bg -bn -br -bs -ca -ca-valencia -cak -cs -cy -da -dsb -el -en-CA -en-GB -eo -es-AR -es-CL -es-ES -es-MX -et -eu -fa -ff -fi -fr -fur -fy -ga -gd -gl -gn -gu -he -hi -hr -hsb -hu -hy -ia -id -is -it -ja -ka -kab -kk -km -kn -ko -lij -lt -lv -mk -mr -ms -my -nb -ne -nl -nn -oc -pa -pl -pt-BR -pt-PT -rm -ro -ru -sc -sco -si -sk -sl -son -sq -sr -sv -szl -ta -te -th -tl -tr -trs -uk -ur -uz -vi -xh -zh-CN -zh-TW" LLVM_SLOT="16* -17*" 0 KiB Total: 3 packages (2 downgrades, 1 reinstall), Size of downloads: 308.049 KiB Do you want me to try it? Looks like Fedora is also skipping PGO for this release because of build failures: https://src.fedoraproject.org/rpms/firefox/c/6411e9e37788448993f6be62400bb6a27f9c94cc?branch=rawhide (In reply to Joonas Niilola from comment #7) > I'm aware, but it's broke building firefox before. As I managed to build > 123.0 fine with +pgo (both +clang -clang) I suspect that could be the > difference. Would you mind posting your $ emerge --info www-client/firefox, please? I wonder if I can replicate your successful build if I try to match your settings. (In reply to Johannes Penßel from comment #17) > Looks like Fedora is also skipping PGO for this release because of build > failures: > https://src.fedoraproject.org/rpms/firefox/c/ > 6411e9e37788448993f6be62400bb6a27f9c94cc?branch=rawhide I'm pretty sure Fedora build failed during the profiling, not during linking. Canonical also had an issue with PGO and they forced software rendering to be used. They tracked this issue to an older mesa that they're currently shipping, so it could be related to Fedora's issues as well. We don't have that mesa even present in our repo anymore. > > Would you mind posting your $ emerge --info www-client/firefox, please? I > wonder if I can replicate your successful build if I try to match your > settings. It's done with the most basic settings and rust-bin. Latest ~unstable packages, so llvm-17 and clang-17 were used. https://github.com/juippis/incus-gentoo-github-pullrequest-tester/tree/master/container/etc/portage So it could be mesa, it could be upstream pgo-updates (and the new flag), it could be unique system-related stuff, it could be the ffvpx-av1 changes since the linking dies immediately after libmozavcodec... and I wonder if it could be related our wrong addpredict too. I'm just throwing stuff because it worked for me and therefore I can't test a "fix". However would be great to eliminate one thing: https://gitweb.gentoo.org/repo/gentoo.git/tree/www-client/firefox/firefox-123.0.ebuild#n552 add addpredict /dev somewhere inside the "if use pgo ; then" block to test whether colon-limited addpredict is breaking stuff. (In reply to Joonas Niilola from comment #18) > > I'm pretty sure Fedora build failed during the profiling, not during > linking. Canonical also had an issue with PGO and they forced software > rendering to be used. They tracked this issue to an older mesa that they're > currently shipping, so it could be related to Fedora's issues as well. We > don't have that mesa even present in our repo anymore. > > Right and here's the link https://raw.githubusercontent.com/canonical/firefox-snap/stable/patches/pgo-with-software-webrender.patch Oh hmm one more idea: could it be related to the new llvm-r1.eclass? If someone wants to diff 122.0.1 <-> 123.0 and revert llvm changes, using the old llvm.eclass instead, it could prove helpful too. Created attachment 885708 [details] www-client:firefox-123.0:20240222-103851.log (In reply to Joonas Niilola from comment #18) > add > > addpredict /dev > > somewhere inside the "if use pgo ; then" block to test whether colon-limited > addpredict is breaking stuff. Did that (and ONLY that), didn't work. Build log attached, looks the same to me. Created attachment 885713 [details] www-client:firefox-123.0:20240222-115537.log I added pgo-with-software-webrender.patch as a user patch (according to https://wiki.gentoo.org/wiki//etc/portage/patches) and recompiled. Funny enough, this worked: * Applying user patches from /etc/portage/patches ... * Applying pgo-with-software-webrender.patch ... [ ok ] * User patches applied. Build log of successful build attached. What does this mean? Is it due to media-libs/mesa? I'm using currently stable version 23.3.5... # emerge -pv media-libs/mesa These are the packages that would be merged, in order: Calculating dependencies... done! Dependency resolution took 2.54 s (backtrack: 0/20). [ebuild R ] media-libs/mesa-23.3.5::gentoo USE="X d3d9 gles2 llvm lm-sensors osmesa proprietary-codecs vaapi vdpau vulkan wayland xa zstd -debug -gles1 -opencl (-selinux) -test -unwind -valgrind -vulkan-overlay (-zink)" ABI_X86="(64) -32 (-x32)" CPU_FLAGS_X86="sse2" VIDEO_CARDS="d3d12 nouveau radeon radeonsi (-freedreno) -intel -lavapipe (-lima) (-panfrost) -r300 -r600 (-v3d) (-vc4) -virgl (-vivante) -vmware" 0 KiB Total: 1 package (1 reinstall), Size of downloads: 0 KiB Why does it work without the patch for some, and what exactly does the patch fix? My build -- with the patch -- is no longer using hardware rendering, yes? So wouldn't it be preferred to use -clang with hardware rendering instead? (As in important side note: the build log says that I've nouveau installed -- but it is not in use since my laptop has hybrid graphics and amdgpu is in use. Nouveau is installed, yes, but it is also disabled: the dedicated Nvidia graphics card is dormant...) Adding addpredict /dev didn't work for me either, I'm afraid. Same thing with reverting to llvm.eclass. Currently building with the software webrender patch. Can confirm, the patch does the trick! My mesa: media-libs/mesa-24.0.1 USE="X gles2 llvm lm-sensors opencl (opengl) proprietary-codecs unwind vulkan wayland zstd -d3d9 -debug -gles1 -osmesa (-selinux) -test -vaapi -valgrind -vdpau -vulkan-overlay -xa (-zink)" CPU_FLAGS_X86="sse2" LLVM_SLOT="17 -15 -16" VIDEO_CARDS="intel -d3d12 (-freedreno) -lavapipe (-lima) -nouveau (-panfrost) -r300 -r600 -radeon -radeonsi (-v3d) (-vc4) -virgl (-vivante) -vmware" Using that software webrender patch also allows compilation with mesa[zink] (from bug #923054) I compiled with the patch and firefox now segfaults. To clarify, it segfaults on start, even on firefox --help. ok that was unrelated to this patch I have 2 machines, - one on ~amd64 where the build breaks - one on stable with selected packages on ~ where the build works I could use them to do some testing. Created attachment 887474 [details]
emerge info on the system where the build works
Created attachment 887475 [details]
emerge info on the system where the build breaks
I got this file building with USE=-pgo. If I leave the pgo use flag on the build breaks with the llvm error as others reported.
upstream issue: https://github.com/llvm/llvm-project/issues/84062 Again, 125.0.1 with +pgo worked fine for me with -clang and +clang. I guess people still hitting this can compile with gcc (-clang use flag) or use the patch: https://raw.githubusercontent.com/canonical/firefox-snap/stable/patches/pgo-with-software-webrender.patch until upstream llvm is fixed? Note that llvm-18 can't be enabled on Firefox before a matching rust is bumped. rust-1.78 should be built with llvm-18. FYI: there is a discussion in https://github.com/llvm/llvm-project/issues/87894 My citation from there: > In fact, seeing multilib in this output, I've got two hosts with Gentoo, and on one host it compiles, but on an other - it fails. > The difference is in gentoo profiles: > > * default/linux/amd64/23.0/no-multilib - it compiles and works > * default/linux/amd64/17.1/desktop - it fails. > > So, maybe the key of the problem is this multilib/no- problem. Created attachment 891666 [details]
emerge--info--firefox
Created attachment 891667 [details]
compressed build log
Hello ... The ubuntu patch didn't work for the 12.0.2 build. I've added some extra config options that I think shouldn't add to the problem, at least not reproduce the same exact result, namely "--enable-jemalloc --enable-replace-malloc --enable-tests --enable-rust-tests". I started hitting this in the last few days but I've no idea why yet. JFYI: On https://github.com/llvm/llvm-project/issues/87894 alexey-bataev confirmed the problem in the middle-end of clang, saying: > Finally found the reproducer and was able to reproduce the crash with 17.0.4 and 18.1.5. > But the crash does not reproduce with the trunk, looks like the issue is fixed already and it will be fixed in llvm 19.0 There was a question from torvic9 about backporting the fix, alexey-bataev replied: > I tried to identify the fix, but there are too many potential candidates with too many dependencies. > I can try to prepare a fix but it won't be a backport, it will be completely separate patch |