Gentoo Websites Logo
Go to: Gentoo Home Documentation Forums Lists Bugs Planet Store Wiki Get Gentoo!
Bug 935689 - llvm-r1.eclass; llvm-r1_pkg_setup doesn't fix lld version
Summary: llvm-r1.eclass; llvm-r1_pkg_setup doesn't fix lld version
Status: RESOLVED INVALID
Alias: None
Product: Gentoo Linux
Classification: Unclassified
Component: Eclasses (show other bugs)
Hardware: All Linux
: Normal normal (vote)
Assignee: Michał Górny
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2024-07-07 03:33 UTC by Matt Jolly
Modified: 2024-07-07 14:15 UTC (History)
2 users (show)

See Also:
Package list:
Runtime testing required: ---


Attachments
foo-1.2.3.ebuild (file_935689.txt,740 bytes, text/plain)
2024-07-07 03:33 UTC, Matt Jolly
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Matt Jolly gentoo-dev 2024-07-07 03:33:00 UTC
Created attachment 897243 [details]
foo-1.2.3.ebuild

In `llvm-r1_pkg_setup` we run the various "fix variable" functions, `llvm_fix_clang_version` and `llvm_fix_tool_path`, (which do not accept / are not provided with a SLOT), then we prefix the selected LLVM impl in PATH.

https://gitweb.gentoo.org/repo/gentoo.git/tree/eclass/llvm-r1.eclass?id=d6a2ae5aba768d7f7adf69defb04fcaa644820e1#n225

These functions "fix" the variables by executing whatever they are currently set to and evaluating the version string to extract the version, then either use that to construct a path (tool_path) or turn `clang` into (e.g.) `clang-17`.

This behaviour is defective; on systems with later LLVM/Clang impls in PATH (i.e. current ~arch) this is evaluated _before_ the PATH is fixed, resulting in behaviour such as the following (from the current Chromium porting attempt):

```
* CC   = x86_64-pc-linux-gnu-clang-18
* CXX  = x86_64-pc-linux-gnu-clang++-18
* CPP  = x86_64-pc-linux-gnu-clang++-18 -E
* AR   = /usr/lib/llvm/18/bin/llvm-ar
* NM   = /usr/lib/llvm/18/bin/llvm-nm
* LLVM_SLOT = 17
* PATH = /var/tmp/portage/www-client/chromium-127.0.6533.26-r1/temp/python3.13/bin:/usr/lib/ccache/bin:/usr/lib/portage/python3.12/ebuild-helpers/xattr:/usr/lib/portage/python3.12/ebuild-helpers:/usr/local/sbin:/usr/local/bin:/usr/bin:/opt/bin:/usr/lib/llvm/17/bin:/usr/lib/llvm/18/bin:/usr/lib/llvm/16/bin:/etc/eselect/wine/bin
```

This flows through to the build with the selection of mixed impl tools causing fun-to-debug issues and much pulling of hair.

A quick band-aid might be to move the PATH fixup before fixing of paths and binaries, but we'll probably still see the same behaviour if a user has (e.g.) specified a particular Clang version in an env (or system-wide, I guess).

If the intent of llvm-r1 and this function is to _force_ the selection of the selected impl/slot/USE_EXPAND there may be another way to go about it:

- Drop other impls from PATH instead of moving the selected impl above the others
- Validate that the selected binaries major version matches the selected LLVM_SLOT and `die` if something goes doesn't match.

As an aside, a nice-to-have would be a function to easily _force_ CC (and friends) to the appropriate values for the selected impl - this would save me a bunch of boilerplate in Chromium.

Minimal reproducer attached.
Comment 1 Michał Górny archtester Gentoo Infrastructure gentoo-dev Security 2024-07-07 11:23:33 UTC
This is the whole purpose of this function.  If your system compiler is clang-18, what the eclass does is that it ensures that it *stays that way* while the ebuild in question uses LLVM 17, so that the whole system is compiled consistently.  The eclass was never meant to *override* the compiler.
Comment 2 Matt Jolly gentoo-dev 2024-07-07 11:41:10 UTC
I'm not sure I follow.

The eclass selects the llvm 17 slot when 18 is also installed. I get this, it's in the eclass docs and stable vs testing keywords.

When pkg_setup is using llvm-r1_pkg_setup, `llvm_fix_clang_version` is called and it _amends CC and friends_ if they're set, but using whatever is first in PATH.

_Then_ we prefix the eclass selected impl higher in path than the other llvm impls, which means that if CC was set you get "clang-18", but end up linking using ld.lld from slot 17 and _builds break_.

I don't think I'm being a pedant here but I can't see why this situation is desirable at all. I can't see why if the eclass is managing the llvm/clang that we would want other impls available.

This is a recipe for hard to debug issues; ask me how I know.
Comment 3 Matt Jolly gentoo-dev 2024-07-07 11:59:08 UTC
Just as an example here's one of the issues I encountered when chromium was accidentally building with mixed impl bits and pieces, and what prompted me to switch to llvm-r1 in the first place:

```
ld.lld: error: obj/third_party/protobuf-javascript/protoc-gen-js/js_generator.o: Unknown attribute kind (91) (Producer: 'LLVM18.1.8' Reader: 'LLVM 17.0.6')
```
Comment 4 Michał Górny archtester Gentoo Infrastructure gentoo-dev Security 2024-07-07 12:18:07 UTC
Well, then obviously the bug is that we're not fixing the LLD version to match.

Look at it the other way around.  You're dealing with a bad package that supports only LLVM 15.  All your system is built using clang-18.  Should that one package suddenly be built with 15?  That's just a recipe for failure.
Comment 5 Michał Górny archtester Gentoo Infrastructure gentoo-dev Security 2024-07-07 12:37:44 UTC
Actually, I can't reproduce.  FWICS clang picks ld.lld from its own directory, irrespectively of PATH.

$ export PATH=/usr/lib/llvm/18/bin:$PATH
$ ld.lld --version
LLD 18.1.8 (compatible with GNU linkers)
$ clang -x c - -Wl,--version </dev/null
LLD 18.1.8 (compatible with GNU linkers)
$ clang-19 -x c - -Wl,--version </dev/null
LLD 19.0.0, compatible with GNU linkers
Comment 6 Matt Jolly gentoo-dev 2024-07-07 13:59:25 UTC
Chromium, being a google product, has a NIH way of doing this; GN invokes a wrapper script that invokes the linker. The example I provided may have come from when we were still using llvm.eclass as well - it's been a frantic few days of ebuild tinkering - but is an example of what can happen when things get out of sync.

I wouldn't stress too much about it; prefixing the PATH will catch this case and I have a workaround in place for chromium where we call llvm-r1_pkg_setup twice.

I still feel like changing the order of operations in `llvm-r1_pkg_setup` achieves the eclass goal: If a user is providing 'CC="clang"' it's probably safe to assume that they don't particularly care about the version and that whichever impl the eclass has selected should be used to make that more specific. This also applies if we need to override a GCC for some reason.

If they're providing 'clang-19' we shouldn't touch it, but I can't see that prefixing the path for a selected impl of 17 would make 'clang-19' evaluate differently.

I'd expect that if a user set 'clang-19' and the USE_EXPAND was something else that we'd be in an odd situation anyway. I'm not sure what to do about this.

As I see it, the point of the USE_EXPAND is that users can easily manage the specific implementation selection system-wide or per-package, or that they can just ignore it and let profiles and ebuilds select appropriate impls to build their software.

> Look at it the other way around.  You're dealing with a bad package that
> supports only LLVM 15.  All your system is built using clang-18.
> Should that one package suddenly be built with 15?

Why wouldn't that package continue being built against its supported LLVM implementation as long as it's in-tree? We provide users with slotted LLVM after all. As a user I would expect the maintainer of a package to set appropriate LLVM_COMPAT for what the package actually supports. If the package is unable to be updated to support an in-tree LLVM (and it can't build with GCC for some reason) we'd have to remove it from the tree, like we would with any stale package.

> That's just a recipe for failure.

Why? I'm sure there's some disconnect here.

If a user has specified a non-selected impl globally and we remove it from PATH that would cause issues, but I feel like that already conflicts with the concept of LLVM_COMPAT and eclasses making sure we present a sane and supported environment.

Maybe we can do better?
Comment 7 Michał Górny archtester Gentoo Infrastructure gentoo-dev Security 2024-07-07 14:15:44 UTC
Well, if Google is doing stupid stuff, then Google needs to fix their stupid stuff.  Clang behaves correctly here, so it's up to build system and/or ebuild to select the right linker.

The eclass was never meant to select *the compiler*, nor ebuilds were ever really meant to do that.  Just like there's no magical eclass to force a different version of GCC, you are just supposed to respect CC/CXX and do not try to override it because Google thinks they know better.

While I'm not aware of any risk right now (and I think people are paying more attention to this these days), the reason is simple: different versions may produce ABI-incompatible code (think of the days back when GCC produced different ABI based on -std=).  Forcing a different compiler version would produce code that's incompatible with the linked libraries.

Or more realistically these days, static libraries with LLVM bytecode — if you force an older compiler than that used to produce the bytecode, things will break.