sys-devel/llvm-roc is a simple installation of AMD patched llvm-project which does not fit into Gentoo's llvm slotting logic. That causes problems: when a program not only depends on hip but also (indirectly) links to (system) llvm, things broke (runtime error: Option xxx registered more than once!). We need to redesign sys-devel/llvm-roc, probably making it into a slot of llvm, so programs won't link to two different llvm libraries. This bug blocks https://bugs.gentoo.org/693200 Reproducible: Always
Currently media-gfx/blender-3.2.0 is deeply affected (https://bugs.gentoo.org/693200#c9). In the future there maybe more packages that both use llvm and ROCm. As I see there are two ways: 1. Use existing llvm-14, maybe with patches picked from ROCm's llvm. This may cause maintenance overburden for following hip packages, since upstream llvm are not guaranteed to work. 2. Use ROCm's llvm, but make it another slot. Packages that both depend on hip and llvm must only use this rocm slot. This means that once user decide to install rocm packages like blender, they have to rebuild mesa with rocm slot. I think Debian has provided useful information on using valinna llvm instead of ROCm patched llvm: - https://github.com/ROCm-Developer-Tools/HIP/issues/2449 - https://lists.debian.org/debian-ai/2022/05/msg00000.html - https://lists.debian.org/debian-ai/2022/03/msg00035.html - https://lists.debian.org/debian-ai/2022/03/msg00011.html But I don't see a clear picture whether they decide to use upstream llvm (hip, comgr are not made into experimental or sid yet).
> 1. Use existing llvm-14, maybe with patches picked from ROCm's llvm. This > may cause maintenance overburden for following hip packages, since upstream > llvm are not guaranteed to work. > 2. Use ROCm's llvm, but make it another slot. Packages that both depend on > hip and llvm must only use this rocm slot. This means that once user decide > to install rocm packages like blender, they have to rebuild mesa with rocm > slot. Personally I prefer the second approach, because ROCm's llvm-project has yet many changes not upstreamed, especially OpenMP part (which may affect sci-libs/rocsparse, see https://github.com/gentoo/gentoo/pull/25318).
And Fedora is on their way to package ROCm with upstream llvm. They have packaged [rocm-comgr](https://src.fedoraproject.org/rpms/rocm-compilersupport)
Dear Michał, Górny I am exploring the possibilities to drop sys-devel/llvm-roc and use standard llvm and clang as the backend of ROCm. The first issue I encountered, is that Gentoo's llvm has `BUILD_SHARED_LIBS=OFF` and that cause components are built into libLLVM.so rather than being a standalone libLLVM<component>.so. Without standalone components, dev-libs/rocm-device-libs fails to build: ``` ld: cannot find -lLLVMCore ld: cannot find -lLLVMBitReader ld: cannot find -lLLVMBitWriter ```` I tried to turn `BUILD_SHARED_LIBS=OFF` and `LLVM_LINK_LLVM_DYLIB=OFF` but get_distribution_components fails, since now there are lots of standalone components. Of course I can patch rocm-device-libs so it will just link the libLLVM.so rather than link non-existing components, but that means the maintenance overburden of ROCm packages are increased. So I wonder: is there a reason Gentoo set `BUILD_SHARED_LIBS=OFF`? Thanks! Best regards, Yiyang Wu
Upstream strongly discourages using BUILD_SHARED_LIBS, and recommends using the dylib instead. The former is only meant to be used in specific development scenarios, mostly to reduce the cost of recompiling. Both llvm-config and standard LLVM cmake macros should be perfectly happy with the dylib, and able to supply the right libraries when used correctly. I don't know what ROCm does wrong but there's certainly a lot of other packages that get this right, so it must be fixable upstream.
(In reply to Michał Górny from comment #5) > Upstream strongly discourages using BUILD_SHARED_LIBS, and recommends using > the dylib instead. The former is only meant to be used in specific > development scenarios, mostly to reduce the cost of recompiling. > > Both llvm-config and standard LLVM cmake macros should be perfectly happy > with the dylib, and able to supply the right libraries when used correctly. > I don't know what ROCm does wrong but there's certainly a lot of other > packages that get this right, so it must be fixable upstream. OK I'll patch ROCm and consult the ROCm upstream.
(In reply to perestoronin from comment https://bugs.gentoo.org/693200#c29) > I have got new error while try to compile sci-libs/miopen v5.1.3: > > CMake Error at CMakeLists.txt:309 (find_library): > Could not find LIBMLIRMIOPEN using the following names: MLIRMIOpen > > Can you fix this error ? At first glance, I think just add a cmake configuration `-DMIOPEN_USE_MLIR=OFF` solve the issue. MIOpen-5.0.2 default turns this option off by default. While MIOpen-5.1.3, there is a complicated logic between the default value of each options -- BUILD_SHARED_LIBS default is ON, so MIOPEN_USE_MLIR_DEFAULT=ON. If we want to use the LIBMLIRMIOPEN, we need to install the AMD modified mlir project in llvm (https://github.com/ROCmSoftwarePlatform/llvm-project-mlir, branched from llvm-project/mlir in early 2021).
(In reply to Yiyang Wu from comment #7) > At first glance, I think just add a cmake configuration > `-DMIOPEN_USE_MLIR=OFF` solve the issue. MIOpen-5.0.2 default turns this > option off by default. While MIOpen-5.1.3, there is a complicated logic > between the default value of each options -- BUILD_SHARED_LIBS default is > ON, so MIOPEN_USE_MLIR_DEFAULT=ON. > > If we want to use the LIBMLIRMIOPEN, we need to install the AMD modified > mlir project in llvm > (https://github.com/ROCmSoftwarePlatform/llvm-project-mlir, branched from > llvm-project/mlir in early 2021). With `-DMIOPEN_USE_MLIR=OFF` got new error: CMake Error at CMakeLists.txt:300 (message): extractkernel not found
(In reply to perestoronin from comment #8) > With `-DMIOPEN_USE_MLIR=OFF` got new error: > > CMake Error at CMakeLists.txt:300 (message): > extractkernel not found That's because cmake cannot find clang-offload-bundler. Line 45 has to change to the correct path by calling $(get_llvm_prefix ${LLVM_MAX_SLOT}) provided by llvm.eclass. Also, you need to append two cxxflag `--rocm-path="${EPREFIX}"/usr` and `--hip-device-lib-path="${EPREFIX}"/usr/lib/amdgcn/bitcode` to compile.
I was wondering how this effort was progressing and if it's been made any easier with 5.2.0?
(In reply to Mike Lothian from comment #10) > I was wondering how this effort was progressing and if it's been made any > easier with 5.2.0? It is progressing. Actually the dev-util/hip-5.1.3 is done. The next step is the sci-libs. 5.2.0 makes it more difficult, actually -- llvm/clang-14 is not enough, ROCm-5.2.0 supports new architectures and ABI (code object) version but clang-14 lacks (we may have to wait for clang-15). Is there any killing feature 5.2.0 compares to 5.1.3? If not, then I think 5.1.3 is a good enough version stand on llvm/clang-14
I try to complie dev-libs/rccl-5.2.0, and I have got error: ninja -v -j12 -l24 [1/55] /usr/bin/hipcc -DENABLE_COLLTRACE -D__HIP_PLATFORM_AMD__=1 -D__HIP_PLATFORM_HCC__=1 -Drccl_EXPORTS -I/var/tmp/portage/dev-libs/rccl-5.2.0/work/rccl-rocm-5.2.0_build/include -I/var/tmp/portage/dev-libs/rccl-5.2.0/work/rccl-rocm-5.2.0_build/include/rccl -I/var/tmp/portage/dev-libs/rccl-5.2.0/work/rccl-rocm-5.2.0/src -I/var/tmp/portage/dev-libs/rccl-5.2.0/work/rccl-rocm-5.2.0/src/include -I/var/tmp/portage/dev-libs/rccl-5.2.0/work/rccl-rocm-5.2.0/src/collectives -I/var/tmp/portage/dev-libs/rccl-5.2.0/work/rccl-rocm-5.2.0/src/collectives/device -I//hsa/include -I//rocm_smi/include -O2 -pipe -march=znver2 -fPIC -fvisibility=hidden -fgpu-rdc -parallel-jobs=8 -Wno-format-nonliteral -x hip --hip-device-lib-path=/usr/lib64/amdgcn/bitcode --offload-arch=gfx900 -std=c++14 -MD -MT CMakeFiles/rccl.dir/src/collectives/device/functions.cpp.o -MF CMakeFiles/rccl.dir/src/collectives/device/functions.cpp.o.d -o CMakeFiles/rccl.dir/src/collectives/device/functions.cpp.o -c /var/tmp/portage/dev-libs/rccl-5.2.0/work/rccl-rocm-5.2.0_build/src/collectives/device/functions.cpp FAILED: CMakeFiles/rccl.dir/src/collectives/device/functions.cpp.o /usr/bin/hipcc -DENABLE_COLLTRACE -D__HIP_PLATFORM_AMD__=1 -D__HIP_PLATFORM_HCC__=1 -Drccl_EXPORTS -I/var/tmp/portage/dev-libs/rccl-5.2.0/work/rccl-rocm-5.2.0_build/include -I/var/tmp/portage/dev-libs/rccl-5.2.0/work/rccl-rocm-5.2.0_build/include/rccl -I/var/tmp/portage/dev-libs/rccl-5.2.0/work/rccl-rocm-5.2.0/src -I/var/tmp/portage/dev-libs/rccl-5.2.0/work/rccl-rocm-5.2.0/src/include -I/var/tmp/portage/dev-libs/rccl-5.2.0/work/rccl-rocm-5.2.0/src/collectives -I/var/tmp/portage/dev-libs/rccl-5.2.0/work/rccl-rocm-5.2.0/src/collectives/device -I//hsa/include -I//rocm_smi/include -O2 -pipe -march=znver2 -fPIC -fvisibility=hidden -fgpu-rdc -parallel-jobs=8 -Wno-format-nonliteral -x hip --hip-device-lib-path=/usr/lib64/amdgcn/bitcode --offload-arch=gfx900 -std=c++14 -MD -MT CMakeFiles/rccl.dir/src/collectives/device/functions.cpp.o -MF CMakeFiles/rccl.dir/src/collectives/device/functions.cpp.o.d -o CMakeFiles/rccl.dir/src/collectives/device/functions.cpp.o -c /var/tmp/portage/dev-libs/rccl-5.2.0/work/rccl-rocm-5.2.0_build/src/collectives/device/functions.cpp /usr/lib/llvm/14/bin/clang-offload-bundler: error: '/var/tmp/portage/dev-libs/rccl-5.2.0/temp/functions-303183/functions-gfx900.bc': No such file or directory clang-14: error: clang-offload-bundler command failed with exit code 1 (use -v to see invocation) How to fix ?
(In reply to perestoronin from comment #12) > How to fix ? fixed with patches wgetpaste rccl-namespace.patch rccl-nccl.patch https://gist.github.com/raw/838b5b8f28614a3c2202f30fc58aec26
(In reply to perestoronin from comment #13) > (In reply to perestoronin from comment #12) > > How to fix ? > > fixed with patches wgetpaste rccl-namespace.patch rccl-nccl.patch > https://gist.github.com/raw/838b5b8f28614a3c2202f30fc58aec26 That's interesting. Can you explain a bit about this patch? And please post the build.log after this patch, and let's see why this error occured and got mitigated. I can't reproduce it on rccl-5.1.3
Also, while I'm packaging dev-libs/rccl-5.1.3 against rocm-5.1.3 (clang-14 based), I found a compilation error: lld: error: ld-temp.o <inline asm>:1:26: specified hardware register is not supported on this GPU when compiling for gfx1030 target. After backporting https://reviews.llvm.org/D119939 to llvm, whis is resolved.
(In reply to Yiyang Wu from comment #14) > (In reply to perestoronin from comment #13) > > (In reply to perestoronin from comment #12) > > > How to fix ? > > > > fixed with patches wgetpaste rccl-namespace.patch rccl-nccl.patch > > https://gist.github.com/raw/838b5b8f28614a3c2202f30fc58aec26 > > That's interesting. Can you explain a bit about this patch? And please post > the build.log after this patch, and let's see why this error occured and got > mitigated. I can't reproduce it on rccl-5.1.3 In rccl-5.1.3 built without patches. rccl-nccl.patch - fix obsolete pthread_yield to sched_yield rccl-namespace.patch - fix paths, namespace roc::rccl, and remove obsolete hcc, remove constant parallel jobs = 8, remove not supported hc-function-calls ... wgetpaste build.log https://gist.github.com/raw/834efe16d81f808fe0f61819a570ddf8
(In reply to Yiyang Wu from comment #15) > Also, while I'm packaging dev-libs/rccl-5.1.3 against rocm-5.1.3 (clang-14 > based), I found a compilation error: > > lld: error: ld-temp.o <inline asm>:1:26: specified hardware register is not > supported on this GPU > > when compiling for gfx1030 target. > > After backporting https://reviews.llvm.org/D119939 to llvm, whis is resolved. Thanks, I have got only gfx900 AMD Radion Vega Frontier 16Gb, but recompile clang with this patch put patches to /etc/portage/patches/sys-devel/clang, also I applied other nessary patches from list: wgetpaste 00-D69582.patch 01-D118949.patch 02-D119939.patch 03-D120557.patch 04-clang-declbase.patch https://gist.github.com/raw/326b80564355b686b965ff15331aca8c
(In reply to perestoronin from comment #17) > (In reply to Yiyang Wu from comment #15) > > Also, while I'm packaging dev-libs/rccl-5.1.3 against rocm-5.1.3 (clang-14 > > based), I found a compilation error: > > > > lld: error: ld-temp.o <inline asm>:1:26: specified hardware register is not > > supported on this GPU > > > > when compiling for gfx1030 target. > > > > After backporting https://reviews.llvm.org/D119939 to llvm, whis is resolved. > > Thanks, I have got only gfx900 AMD Radion Vega Frontier 16Gb, but recompile > clang with this patch put patches to /etc/portage/patches/sys-devel/clang, > also I applied other nessary patches from list: > wgetpaste 00-D69582.patch 01-D118949.patch 02-D119939.patch 03-D120557.patch > 04-clang-declbase.patch > https://gist.github.com/raw/326b80564355b686b965ff15331aca8c relocate 2-D119939.patch to /etc/portage/patches/sys-devel/llvm
(In reply to perestoronin from comment #17) > Thanks, I have got only gfx900 AMD Radion Vega Frontier 16Gb, but recompile > clang with this patch put patches to /etc/portage/patches/sys-devel/clang, > also I applied other nessary patches from list: > wgetpaste 00-D69582.patch 01-D118949.patch 02-D119939.patch 03-D120557.patch > 04-clang-declbase.patch > https://gist.github.com/raw/326b80564355b686b965ff15331aca8c That's very helpful, solving the major obstacles of upgrading to rocm-5.2.0 against llvm/clang-14.0.6 If I understand correctly, these patches are meant to: 00-D69582.patch: support parallel jobs when compiling. ROCm packages suffers from long compilation time on some extra large source files. For example, Kernels.cpp for rocBLAS can take 10m to compile for a single GPU architecture, and for 6 arch that is 1h. But parallel jobs for clang is a quite controversial topic, because build system already utilize the multiprocessing features, and adding parallel jobs violates MAKEOPTS. 01-D118949.patch: this enables code-objects-v5, rocm-5.2 depends on this feature. 02-D119939.patch: that's what I previously mentioned, this patch fixes rccl compilation for RDNA2 cards. 03-D120557.patch: I think it's a fix. I have detected rocm-device-libs-5.2.0 runtime failure, and I guess this fix that (compatibility issue). As we can see the ROCm-5.2 uses a lot of features in llvm/clang-15 main branch, so if we want to build the entire 5.2 stack upon llvm/clang-14 we need to backport these patches, and maybe even more. Currently I suggest we can stay on 5.1.3; to use rocm-5.2.0 you can emerge the clang-15.0.0.9999 to avoid patching llvm/clang heavily, I suppose.
(In reply to perestoronin from comment #12) > I try to complie dev-libs/rccl-5.2.0, and I have got error: > > ninja -v -j12 -l24 > [1/55] /usr/bin/hipcc -DENABLE_COLLTRACE -D__HIP_PLATFORM_AMD__=1 > -D__HIP_PLATFORM_HCC__=1 -Drccl_EXPORTS > -I/var/tmp/portage/dev-libs/rccl-5.2.0/work/rccl-rocm-5.2.0_build/include > -I/var/tmp/portage/dev-libs/rccl-5.2.0/work/rccl-rocm-5.2.0_build/include/ > rccl -I/var/tmp/portage/dev-libs/rccl-5.2.0/work/rccl-rocm-5.2.0/src > -I/var/tmp/portage/dev-libs/rccl-5.2.0/work/rccl-rocm-5.2.0/src/include > -I/var/tmp/portage/dev-libs/rccl-5.2.0/work/rccl-rocm-5.2.0/src/collectives > -I/var/tmp/portage/dev-libs/rccl-5.2.0/work/rccl-rocm-5.2.0/src/collectives/ > device -I//hsa/include -I//rocm_smi/include -O2 -pipe -march=znver2 -fPIC > -fvisibility=hidden -fgpu-rdc -parallel-jobs=8 -Wno-format-nonliteral -x hip > --hip-device-lib-path=/usr/lib64/amdgcn/bitcode --offload-arch=gfx900 > -std=c++14 -MD -MT > CMakeFiles/rccl.dir/src/collectives/device/functions.cpp.o -MF > CMakeFiles/rccl.dir/src/collectives/device/functions.cpp.o.d -o > CMakeFiles/rccl.dir/src/collectives/device/functions.cpp.o -c > /var/tmp/portage/dev-libs/rccl-5.2.0/work/rccl-rocm-5.2.0_build/src/ > collectives/device/functions.cpp > FAILED: CMakeFiles/rccl.dir/src/collectives/device/functions.cpp.o > /usr/bin/hipcc -DENABLE_COLLTRACE -D__HIP_PLATFORM_AMD__=1 > -D__HIP_PLATFORM_HCC__=1 -Drccl_EXPORTS > -I/var/tmp/portage/dev-libs/rccl-5.2.0/work/rccl-rocm-5.2.0_build/include > -I/var/tmp/portage/dev-libs/rccl-5.2.0/work/rccl-rocm-5.2.0_build/include/ > rccl -I/var/tmp/portage/dev-libs/rccl-5.2.0/work/rccl-rocm-5.2.0/src > -I/var/tmp/portage/dev-libs/rccl-5.2.0/work/rccl-rocm-5.2.0/src/include > -I/var/tmp/portage/dev-libs/rccl-5.2.0/work/rccl-rocm-5.2.0/src/collectives > -I/var/tmp/portage/dev-libs/rccl-5.2.0/work/rccl-rocm-5.2.0/src/collectives/ > device -I//hsa/include -I//rocm_smi/include -O2 -pipe -march=znver2 -fPIC > -fvisibility=hidden -fgpu-rdc -parallel-jobs=8 -Wno-format-nonliteral -x hip > --hip-device-lib-path=/usr/lib64/amdgcn/bitcode --offload-arch=gfx900 > -std=c++14 -MD -MT > CMakeFiles/rccl.dir/src/collectives/device/functions.cpp.o -MF > CMakeFiles/rccl.dir/src/collectives/device/functions.cpp.o.d -o > CMakeFiles/rccl.dir/src/collectives/device/functions.cpp.o -c > /var/tmp/portage/dev-libs/rccl-5.2.0/work/rccl-rocm-5.2.0_build/src/ > collectives/device/functions.cpp > /usr/lib/llvm/14/bin/clang-offload-bundler: error: > '/var/tmp/portage/dev-libs/rccl-5.2.0/temp/functions-303183/functions-gfx900. > bc': No such file or directory > clang-14: error: clang-offload-bundler command failed with exit code 1 (use > -v to see invocation) > > How to fix ? Would minding opening a new bug for ROCm 5.2.0 packages? ROCm is a fast moving target and we focus this issue on the vanilla clang.
(In reply to Benda Xu from comment #20) > (In reply to perestoronin from comment #12) > > I try to complie dev-libs/rccl-5.2.0, and I have got error: > > > > ninja -v -j12 -l24 > > [1/55] /usr/bin/hipcc -DENABLE_COLLTRACE -D__HIP_PLATFORM_AMD__=1 > > -D__HIP_PLATFORM_HCC__=1 -Drccl_EXPORTS > > -I/var/tmp/portage/dev-libs/rccl-5.2.0/work/rccl-rocm-5.2.0_build/include > > -I/var/tmp/portage/dev-libs/rccl-5.2.0/work/rccl-rocm-5.2.0_build/include/ > > rccl -I/var/tmp/portage/dev-libs/rccl-5.2.0/work/rccl-rocm-5.2.0/src > > -I/var/tmp/portage/dev-libs/rccl-5.2.0/work/rccl-rocm-5.2.0/src/include > > -I/var/tmp/portage/dev-libs/rccl-5.2.0/work/rccl-rocm-5.2.0/src/collectives > > -I/var/tmp/portage/dev-libs/rccl-5.2.0/work/rccl-rocm-5.2.0/src/collectives/ > > device -I//hsa/include -I//rocm_smi/include -O2 -pipe -march=znver2 -fPIC > > -fvisibility=hidden -fgpu-rdc -parallel-jobs=8 -Wno-format-nonliteral -x hip > > --hip-device-lib-path=/usr/lib64/amdgcn/bitcode --offload-arch=gfx900 > > -std=c++14 -MD -MT > > CMakeFiles/rccl.dir/src/collectives/device/functions.cpp.o -MF > > CMakeFiles/rccl.dir/src/collectives/device/functions.cpp.o.d -o > > CMakeFiles/rccl.dir/src/collectives/device/functions.cpp.o -c > > /var/tmp/portage/dev-libs/rccl-5.2.0/work/rccl-rocm-5.2.0_build/src/ > > collectives/device/functions.cpp > > FAILED: CMakeFiles/rccl.dir/src/collectives/device/functions.cpp.o > > /usr/bin/hipcc -DENABLE_COLLTRACE -D__HIP_PLATFORM_AMD__=1 > > -D__HIP_PLATFORM_HCC__=1 -Drccl_EXPORTS > > -I/var/tmp/portage/dev-libs/rccl-5.2.0/work/rccl-rocm-5.2.0_build/include > > -I/var/tmp/portage/dev-libs/rccl-5.2.0/work/rccl-rocm-5.2.0_build/include/ > > rccl -I/var/tmp/portage/dev-libs/rccl-5.2.0/work/rccl-rocm-5.2.0/src > > -I/var/tmp/portage/dev-libs/rccl-5.2.0/work/rccl-rocm-5.2.0/src/include > > -I/var/tmp/portage/dev-libs/rccl-5.2.0/work/rccl-rocm-5.2.0/src/collectives > > -I/var/tmp/portage/dev-libs/rccl-5.2.0/work/rccl-rocm-5.2.0/src/collectives/ > > device -I//hsa/include -I//rocm_smi/include -O2 -pipe -march=znver2 -fPIC > > -fvisibility=hidden -fgpu-rdc -parallel-jobs=8 -Wno-format-nonliteral -x hip > > --hip-device-lib-path=/usr/lib64/amdgcn/bitcode --offload-arch=gfx900 > > -std=c++14 -MD -MT > > CMakeFiles/rccl.dir/src/collectives/device/functions.cpp.o -MF > > CMakeFiles/rccl.dir/src/collectives/device/functions.cpp.o.d -o > > CMakeFiles/rccl.dir/src/collectives/device/functions.cpp.o -c > > /var/tmp/portage/dev-libs/rccl-5.2.0/work/rccl-rocm-5.2.0_build/src/ > > collectives/device/functions.cpp > > /usr/lib/llvm/14/bin/clang-offload-bundler: error: > > '/var/tmp/portage/dev-libs/rccl-5.2.0/temp/functions-303183/functions-gfx900. > > bc': No such file or directory > > clang-14: error: clang-offload-bundler command failed with exit code 1 (use > > -v to see invocation) > > > > How to fix ? > > Would minding opening a new bug for ROCm 5.2.0 packages? ROCm is a fast > moving target and we focus this issue on the vanilla clang. No bug, I use rocm 5.2.0 with old clang/llvm 14.0.6 and have got error and need patch clang and llvm, аnd rocm too, also gentoo not use standard paths and so on others releasons to have troubles. But if use clang/llvm 15 and compile and install artefacs to /opt/... , some patches become unnecessary, but if I want to compile tensorflow with rocm, I have got error abount "-march" with compile tensorflow with llvm-roc, so I to resolve it need have only one llvm/clang on my computer to avoid circle of hell.
(In reply to Yiyang Wu from comment #19) > (In reply to perestoronin from comment #17) > > > Thanks, I have got only gfx900 AMD Radion Vega Frontier 16Gb, but recompile > > clang with this patch put patches to /etc/portage/patches/sys-devel/clang, > > also I applied other nessary patches from list: > > wgetpaste 00-D69582.patch 01-D118949.patch 02-D119939.patch 03-D120557.patch > > 04-clang-declbase.patch > > https://gist.github.com/raw/326b80564355b686b965ff15331aca8c > > That's very helpful, solving the major obstacles of upgrading to rocm-5.2.0 > against llvm/clang-14.0.6 > > If I understand correctly, these patches are meant to: > > 00-D69582.patch: support parallel jobs when compiling. ROCm packages suffers > from long compilation time on some extra large source files. For example, > Kernels.cpp for rocBLAS can take 10m to compile for a single GPU > architecture, and for 6 arch that is 1h. But parallel jobs for clang is a > quite controversial topic, because build system already utilize the > multiprocessing features, and adding parallel jobs violates MAKEOPTS. > > 01-D118949.patch: this enables code-objects-v5, rocm-5.2 depends on this > feature. > > 02-D119939.patch: that's what I previously mentioned, this patch fixes rccl > compilation for RDNA2 cards. > > 03-D120557.patch: I think it's a fix. I have detected rocm-device-libs-5.2.0 > runtime failure, and I guess this fix that (compatibility issue). > > As we can see the ROCm-5.2 uses a lot of features in llvm/clang-15 main > branch, so if we want to build the entire 5.2 stack upon llvm/clang-14 we > need to backport these patches, and maybe even more. Currently I suggest we > can stay on 5.1.3; to use rocm-5.2.0 you can emerge the clang-15.0.0.9999 to > avoid patching llvm/clang heavily, I suppose. Yes, all rights. PS Other new patches from llvm/clang 15+ I will apply to my system over clang/llvm 14.0.6 as nessary.
I noticed llvm was updated with patches to make this easier, are we any closer to having 5.1 in tree? Is there an overlay with the development work?
(In reply to Mike Lothian from comment #23) > I noticed llvm was updated with patches to make this easier, Oh, I haven't notice that. Can you give reference? > we any closer to having 5.1 in tree? Is there an overlay with the development work? We are close to 5.1 toolchain in tree. See https://github.com/gentoo/gentoo/pull/26441 There are only one bug remains: hip and rocm-comgr may breaks when upgrading clang (some paths are hard-coded, so hip and rocm-comgr need rebuild after updrade clang, but currently this is not automatically triggered; non-harded-coded method is in development). As of math libraries, I have plenty on https://github.com/littlewu2508/gentoo/tree/rocm-5.1.3. They will be ready after https://github.com/gentoo/gentoo/pull/26441 get merged.
This was the LLVM update it mentions ROCm: https://gitweb.gentoo.org/repo/gentoo.git/commit/sys-devel/llvm?id=c13c98b40beb7d18155a6a25ddfaf3d3ce6d81da I'm just testing your ebuilds now, is there an updated rocm-opencl-runtime too?
(In reply to Mike Lothian from comment #25) > This was the LLVM update it mentions ROCm: > https://gitweb.gentoo.org/repo/gentoo.git/commit/sys-devel/ > llvm?id=c13c98b40beb7d18155a6a25ddfaf3d3ce6d81da This patch is for dev-libs/rccl-5.1.3 (In reply to Yiyang Wu from comment #15) > Also, while I'm packaging dev-libs/rccl-5.1.3 against rocm-5.1.3 (clang-14 > based), I found a compilation error: > > lld: error: ld-temp.o <inline asm>:1:26: specified hardware register is not > supported on this GPU > > when compiling for gfx1030 target. > > After backporting https://reviews.llvm.org/D119939 to llvm, whis is resolved. (In reply to Mike Lothian from comment #25) > I'm just testing your ebuilds now, is there an updated rocm-opencl-runtime > too? I just committed it in https://github.com/littlewu2508/gentoo/commit/533ea5270ea9b8bbea88a38107314a93ab2fb755. src_test is still buggy (some tests needs DISPLAY but virtualx seems not working).
Thanks, luxmark 3 works as long as I disable -cl-fast-relaxed-math luxmark 4 crashed with a llvm error: mesa: CommandLine Error: Option 'h' registered more than once! LLVM ERROR: inconsistency in registered CommandLine options Thread 1 "luxmark.bin" received signal SIGABRT, Aborted. 0x00007ffff0e90aec in ?? () from /lib64/libc.so.6 (gdb) bt #0 0x00007ffff0e90aec in ?? () from /lib64/libc.so.6 #1 0x00007ffff0e3e772 in raise () from /lib64/libc.so.6 #2 0x00007ffff0e2846a in abort () from /lib64/libc.so.6 #3 0x00007fff93f36b8e in llvm::report_fatal_error(llvm::Twine const&, bool) () from /usr/lib/llvm/14/lib64/libLLVM-14.so #4 0x00007fff93f36a36 in llvm::report_fatal_error(char const*, bool) () from /usr/lib/llvm/14/lib64/libLLVM-14.so #5 0x00007fff93f165b2 in ?? () from /usr/lib/llvm/14/lib64/libLLVM-14.so #6 0x00007fff93f0390f in llvm::cl::Option::addArgument() () from /usr/lib/llvm/14/lib64/libLLVM-14.so #7 0x00007fffd4e5a087 in ?? () from /usr/lib64/libamd_comgr.so.2 #8 0x00007fffd4e13073 in ?? () from /usr/lib64/libamd_comgr.so.2 #9 0x00007ffff7fcbf6e in ?? () from /lib64/ld-linux-x86-64.so.2 #10 0x00007ffff7fcc05c in ?? () from /lib64/ld-linux-x86-64.so.2 #11 0x00007ffff0f5a243 in _dl_catch_exception () from /lib64/libc.so.6 #12 0x00007ffff7fd344f in ?? () from /lib64/ld-linux-x86-64.so.2 #13 0x00007ffff0f5a1ee in _dl_catch_exception () from /lib64/libc.so.6 #14 0x00007ffff7fd37f9 in ?? () from /lib64/ld-linux-x86-64.so.2 #15 0x00007ffff0e8ab98 in ?? () from /lib64/libc.so.6 #16 0x00007ffff0f5a1ee in _dl_catch_exception () from /lib64/libc.so.6 #17 0x00007ffff0f5a2a8 in _dl_catch_error () from /lib64/libc.so.6 #18 0x00007ffff0e8a659 in ?? () from /lib64/libc.so.6 #19 0x00007ffff0e8ac50 in dlopen () from /lib64/libc.so.6 #20 0x00007fffd770bf6d in ?? () from /usr/lib64/libamdocl64.so #21 0x00007fffd76f7659 in ?? () from /usr/lib64/libamdocl64.so #22 0x00007ffff0e93d8a in ?? () from /lib64/libc.so.6 #23 0x00007fffd76ede58 in ?? () from /usr/lib64/libamdocl64.so #24 0x00007fffd7796858 in ?? () from /usr/lib64/libamdocl64.so #25 0x00007fffd7795c80 in ?? () from /usr/lib64/libamdocl64.so #26 0x00007fffd76edafc in ?? () from /usr/lib64/libamdocl64.so #27 0x00007fffd778c9df in ?? () from /usr/lib64/libamdocl64.so #28 0x00007fffd7697caa in ?? () from /usr/lib64/libamdocl64.so #29 0x00007ffff0e93d8a in ?? () from /lib64/libc.so.6 #30 0x00007fffd7697ba6 in clIcdGetPlatformIDsKHR () from /usr/lib64/libamdocl64.so #31 0x00007ffff5756296 in ?? () from /usr/lib64/libOpenCL.so.1 #32 0x00007ffff0e93d8a in ?? () from /lib64/libc.so.6 #33 0x00007ffff575b312 in clGetPlatformIDs () from /usr/lib64/libOpenCL.so.1 #34 0x0000555555d08fe0 in cl::Platform::get(std::vector<cl::Platform, std::allocator<cl::Platform> >*) () #35 0x0000555555d051c0 in luxrays::Context::Context(void (*)(char const*), luxrays::Properties const&) () #36 0x00005555557d3aa6 in luxcore::GetOpenCLDeviceDescs() () #37 0x0000555555794032 in HardwareTreeModel::HardwareTreeModel(MainWindow*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) () #38 0x000055555579cf7d in LuxMarkApp::Init(LuxMarkAppMode, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, char const*, bool, bool) () #39 0x000055555572c145 in main () (gdb)
(In reply to Mike Lothian from comment #27) > Thanks, luxmark 3 works as long as I disable -cl-fast-relaxed-math luxmark 4 > crashed with a llvm error: > > mesa: CommandLine Error: Option 'h' registered more than once! > LLVM ERROR: inconsistency in registered CommandLine options These is commonly seen in multiple llvm version mixing together. Can you confirm only llvm-14 is installed?
Yes, only one llvm 14.0.6 My guess would be llvm is already loaded with CommandLine options, then this comes along and either does the same ones, or incompatible ones
Slightly better gdb output mesa: CommandLine Error: Option 'h' registered more than once! LLVM ERROR: inconsistency in registered CommandLine options Thread 1 "luxmark.bin" received signal SIGABRT, Aborted. 0x00007ffff0e90aec in ?? () from /lib64/libc.so.6 (gdb) bt #0 0x00007ffff0e90aec in ?? () from /lib64/libc.so.6 #1 0x00007ffff0e3e772 in raise () from /lib64/libc.so.6 #2 0x00007ffff0e2846a in abort () from /lib64/libc.so.6 #3 0x00007fff97f36b8e in llvm::report_fatal_error(llvm::Twine const&, bool) () from /usr/lib/llvm/14/lib64/libLLVM-14.so #4 0x00007fff97f36a36 in llvm::report_fatal_error(char const*, bool) () from /usr/lib/llvm/14/lib64/libLLVM-14.so #5 0x00007fff97f165b2 in ?? () from /usr/lib/llvm/14/lib64/libLLVM-14.so #6 0x00007fff97f0390f in llvm::cl::Option::addArgument() () from /usr/lib/llvm/14/lib64/libLLVM-14.so #7 0x00007fffb148533b in llvm::cl::alias::done (this=0x7fffb1e0be60 <SectionHeadersShorter>) at /usr/lib/llvm/14/include/llvm/Support/CommandLine.h:1910 #8 0x00007fffb14884bc in llvm::cl::alias::alias<char [2], llvm::cl::desc, llvm::cl::aliasopt> (this=0x7fffb1e0be60 <SectionHeadersShorter>) at /usr/lib/llvm/14/include/llvm/Support/CommandLine.h:1928 #9 0x00007fffb1481715 in __static_initialization_and_destruction_0 (__initialize_p=1, __priority=65535) at /var/tmp/portage/dev-libs/rocm-comgr-5.1.3/work/ROCm-CompilerSupport-rocm-5.1.3/lib/comgr/src/comgr-objdump.cpp:180 #10 0x00007fffb148259d in _GLOBAL__sub_I_comgr_objdump.cpp(void) () at /var/tmp/portage/dev-libs/rocm-comgr-5.1.3/work/ROCm-CompilerSupport-rocm-5.1.3/lib/comgr/src/comgr-objdump.cpp:2440 #11 0x00007ffff7fcbf6e in ?? () from /lib64/ld-linux-x86-64.so.2 #12 0x00007ffff7fcc05c in ?? () from /lib64/ld-linux-x86-64.so.2 #13 0x00007ffff0f5a243 in _dl_catch_exception () from /lib64/libc.so.6 #14 0x00007ffff7fd344f in ?? () from /lib64/ld-linux-x86-64.so.2 #15 0x00007ffff0f5a1ee in _dl_catch_exception () from /lib64/libc.so.6 #16 0x00007ffff7fd37f9 in ?? () from /lib64/ld-linux-x86-64.so.2 #17 0x00007ffff0e8ab98 in ?? () from /lib64/libc.so.6 #18 0x00007ffff0f5a1ee in _dl_catch_exception () from /lib64/libc.so.6 #19 0x00007ffff0f5a2a8 in _dl_catch_error () from /lib64/libc.so.6 #20 0x00007ffff0e8a659 in ?? () from /lib64/libc.so.6 #21 0x00007ffff0e8ac50 in dlopen () from /lib64/libc.so.6 #22 0x00007fffd450bf6d in ?? () from /usr/lib64/libamdocl64.so #23 0x00007fffd44f7659 in ?? () from /usr/lib64/libamdocl64.so #24 0x00007ffff0e93d8a in ?? () from /lib64/libc.so.6 #25 0x00007fffd44ede58 in ?? () from /usr/lib64/libamdocl64.so #26 0x00007fffd4596858 in ?? () from /usr/lib64/libamdocl64.so #27 0x00007fffd4595c80 in ?? () from /usr/lib64/libamdocl64.so #28 0x00007fffd44edafc in ?? () from /usr/lib64/libamdocl64.so #29 0x00007fffd458c9df in ?? () from /usr/lib64/libamdocl64.so #30 0x00007fffd4497caa in ?? () from /usr/lib64/libamdocl64.so #31 0x00007ffff0e93d8a in ?? () from /lib64/libc.so.6 #32 0x00007fffd4497ba6 in clIcdGetPlatformIDsKHR () from /usr/lib64/libamdocl64.so #33 0x00007ffff5756296 in ?? () from /usr/lib64/libOpenCL.so.1 #34 0x00007ffff0e93d8a in ?? () from /lib64/libc.so.6 #35 0x00007ffff575b312 in clGetPlatformIDs () from /usr/lib64/libOpenCL.so.1 #36 0x0000555555d08fe0 in cl::Platform::get(std::vector<cl::Platform, std::allocator<cl::Platform> >*) () #37 0x0000555555d051c0 in luxrays::Context::Context(void (*)(char const*), luxrays::Properties const&) () #38 0x00005555557d3aa6 in luxcore::GetOpenCLDeviceDescs() () #39 0x0000555555794032 in HardwareTreeModel::HardwareTreeModel(MainWindow*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) () #40 0x000055555579cf7d in LuxMarkApp::Init(LuxMarkAppMode, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, char const*, bool, bool) () #41 0x000055555572c145 in main () (gdb) Quit
So it is clashing with radeonsi's usage of llvm, foring softpipe allows the app to run just fine LIBGL_ALWAYS_SOFTWARE=1 GALLIUM_DRIVER=softpipe
The bug has been closed via the following commit(s): https://gitweb.gentoo.org/repo/gentoo.git/commit/?id=82a6c2ca05ccf2dad8cbd75a813d6deafe4f105f commit 82a6c2ca05ccf2dad8cbd75a813d6deafe4f105f Author: Yiyang Wu <xgreenlandforwyy@gmail.com> AuthorDate: 2022-06-15 12:42:07 +0000 Commit: Benda Xu <heroxbd@gentoo.org> CommitDate: 2022-08-06 14:22:03 +0000 dev-util/hip: add 5.1.3 Switch from llvm-roc to vanilla clang -- New variables about clang path in hipvars.pm hip-5.1.3-clang-include-path.patch to fix hipcc finding clang hip-5.1.3-rocm-path.patch: add compile flag to support unpatched clang Using sed cmd to fix clang header location in cmake Closes: https://bugs.gentoo.org/851702 Reference: https://github.com/ROCm-Developer-Tools/hipamd/issues/18 Reference: https://github.com/ROCm-Developer-Tools/hipamd/issues/27 Signed-off-by: Yiyang Wu <xgreenlandforwyy@gmail.com> Signed-off-by: Benda Xu <heroxbd@gentoo.org> dev-util/hip/Manifest | 6 + ...0001-SWDEV-316128-HIP-surface-API-support.patch | 35 +++++ .../hip/files/hip-5.1.3-clang-include-path.patch | 12 ++ .../hip/files/hip-5.1.3-fix-hip_prof_gen.patch | 38 +++++ dev-util/hip/files/hip-5.1.3-rocm-path.patch | 13 ++ dev-util/hip/files/hipvars-5.1.3.pm | 21 +++ dev-util/hip/hip-5.1.3.ebuild | 161 +++++++++++++++++++++ 7 files changed, 286 insertions(+) Additionally, it has been referenced in the following commit(s): https://gitweb.gentoo.org/repo/gentoo.git/commit/?id=8a64d9b3fa74ab7ee3ec9b4d85f813d63648a130 commit 8a64d9b3fa74ab7ee3ec9b4d85f813d63648a130 Author: Benda Xu <heroxbd@gentoo.org> AuthorDate: 2022-08-06 13:47:56 +0000 Commit: Benda Xu <heroxbd@gentoo.org> CommitDate: 2022-08-06 14:22:32 +0000 dev-util/rocm-clang-ocl: use system clang. Bug: https://bugs.gentoo.org/851702 Package-Manager: Portage-3.0.30, Repoman-3.0.3 Signed-off-by: Benda Xu <heroxbd@gentoo.org> .../files/rocm-clang-ocl-5.0.2-system-llvm.patch | 17 +++++++++++++++++ ...-ocl-5.0.2.ebuild => rocm-clang-ocl-5.0.2-r1.ebuild} | 9 +++++---- 2 files changed, 22 insertions(+), 4 deletions(-)
(In reply to Mike Lothian from comment #31) > So it is clashing with radeonsi's usage of llvm, foring softpipe allows the > app to run just fine LIBGL_ALWAYS_SOFTWARE=1 GALLIUM_DRIVER=softpipe Thanks Mike for sharing your findings. I have merged Yiyang's system vanilla llvm version of ROCm in tree as 5.1.3. Please check if the conflict between radeonsi and ROCm usage of llvm still exists and open a new bug for it if so. Yours, Benda
In an idle moment, I tried running "clinfo". In return I got: mesa: CommandLine Error: Option 'h' registered more than once! LLVM ERROR: inconsistency in registered CommandLine options Aborted and a search found this fixed bug. However, as far as I can tell, I have only one LLVM installed, and everything relevant looks newer than the levels herein, specifically: equery list '*llvm*' [IP-] [ ] sys-devel/llvm-15.0.7:15/15 [IP-] [ ] sys-devel/llvm-common-15.0.7:0 [IP-] [ ] sys-devel/llvm-toolchain-symlinks-15-r1:15 [IP-] [ ] sys-devel/llvmgold-15:0 equery list mesa rocm-opencl-runtime [IP-] [ ] media-libs/mesa-22.2.5:0 [IP-] [ ] dev-libs/rocm-opencl-runtime-5.3.3-r1:0/5.3 and no hip or rocm-clang-ocl. I found a fix on the internet, but it doesn't work: LIBGL_ALWAYS_SOFTWARE=1 GALLIUM_DRIVER=softpipe clinfo produces the same results as above. "rocminfo" works OK. I've no idea if this is actually important; it just looks suspicious!
(In reply to Paul Gover from comment #34) > In an idle moment, I tried running "clinfo". In return I got: > mesa: CommandLine Error: Option 'h' registered more than once! > LLVM ERROR: inconsistency in registered CommandLine options > Aborted > and a search found this fixed bug. However, as far as I can tell, I have > only one LLVM installed, and everything relevant looks newer than the levels > herein, specifically: > equery list '*llvm*' > [IP-] [ ] sys-devel/llvm-15.0.7:15/15 > [IP-] [ ] sys-devel/llvm-common-15.0.7:0 > [IP-] [ ] sys-devel/llvm-toolchain-symlinks-15-r1:15 > [IP-] [ ] sys-devel/llvmgold-15:0 > > equery list mesa rocm-opencl-runtime > [IP-] [ ] media-libs/mesa-22.2.5:0 > [IP-] [ ] dev-libs/rocm-opencl-runtime-5.3.3-r1:0/5.3 > > and no hip or rocm-clang-ocl. > > I found a fix on the internet, but it doesn't work: > LIBGL_ALWAYS_SOFTWARE=1 GALLIUM_DRIVER=softpipe clinfo > produces the same results as above. "rocminfo" works OK. > > I've no idea if this is actually important; it just looks suspicious! It seems strange. Can you try to re-emerge rocm-comgr and rocm-opencl-runtime, and see if things resolved?
(In reply to Paul Gover from comment #34) > In an idle moment, I tried running "clinfo". In return I got: > mesa: CommandLine Error: Option 'h' registered more than once! > LLVM ERROR: inconsistency in registered CommandLine options > Aborted > and a search found this fixed bug. However, as far as I can tell, I have > only one LLVM installed, and everything relevant looks newer than the levels > herein, specifically: > > and no hip or rocm-clang-ocl. > > I've no idea if this is actually important; it just looks suspicious! > > equery list mesa rocm-opencl-runtime > [IP-] [ ] media-libs/mesa-22.2.5:0 > [IP-] [ ] dev-libs/rocm-opencl-runtime-5.3.3-r1:0/5.3 > > and no hip or rocm-clang-ocl. > > I've no idea if this is actually important; it just looks suspicious! Stange, all work fine for me: dev-util/hip-5.4.3 (/usr/bin/hipcc) dev-libs/rocm-opencl-runtime-5.4.3 (/usr/bin/clinfo)
(In reply to Paul Gover from comment #34) > In an idle moment, I tried running "clinfo". In return I got: > mesa: CommandLine Error: Option 'h' registered more than once! > LLVM ERROR: inconsistency in registered CommandLine options > Aborted > and a search found this fixed bug. However, as far as I can tell, I have > only one LLVM installed, and everything relevant looks newer than the levels > herein, specifically: Blender is having the same issue with ROCm-5.3.3 (using same clang). It is solved by backporting https://github.com/RadeonOpenCompute/ROCm-CompilerSupport/commit/2d05f9e480cbc591a6b888dfd49d9f7ef1bef25f Maybe this can help, although we don't know why only you encounter this in clinfo.
The bug has been referenced in the following commit(s): https://gitweb.gentoo.org/repo/gentoo.git/commit/?id=82a2720349d070fa86090fd9434bcfae75260a68 commit 82a2720349d070fa86090fd9434bcfae75260a68 Author: Yiyang Wu <xgreenlandforwyy@gmail.com> AuthorDate: 2023-03-01 02:54:09 +0000 Commit: Sam James <sam@gentoo.org> CommitDate: 2023-03-07 07:56:59 +0000 dev-libs/rocm-comgr: Fix comgr and mesa LLVM option collision >=dev-libs/rocm-comgr-5.3 and <=9999 needs backport a patch from upstream to avoid register -h command line option, which resolves conflicts with media-libs/mesa. Benefits media-gfx/blender. Bug: https://bugs.gentoo.org/851702 Reference: https://github.com/gentoo/gentoo/pull/27552 Signed-off-by: Yiyang Wu <xgreenlandforwyy@gmail.com> Closes: https://github.com/gentoo/gentoo/pull/29866 Signed-off-by: Sam James <sam@gentoo.org> .../files/rocm-comgr-5.3.3-remove-h-option.patch | 43 ++++++++++++++++++++++ ...-5.3.3-r1.ebuild => rocm-comgr-5.3.3-r2.ebuild} | 1 + ...mgr-5.4.3.ebuild => rocm-comgr-5.4.3-r1.ebuild} | 1 + 3 files changed, 45 insertions(+)