New AMD GPUs are now available with some initial software support: https://github.com/ROCmSoftwarePlatform/rocBLAS/blob/develop/CMakeLists.txt here we have new amdgpu_targets: gfx1100 gfx1101 gfx1102 rocm.eclass should prepare this new targets to further proper ebuild support Reproducible: Always
I think it should be added to 5) unofficial_amdgpu_targets since this GPUs are cunsumer class GPUs and official support is unlikely.
Thank you Igor! Have you tested one of the RDNA 3 cards? I don't have one at hand, unfortunately.
There is a long way to go to really test it in compute tasks. We need a working rocm ebuilds for it or at least working docker image. Some work have been done for previous generation cards (still 5.3.x rocm version ebuilds while existing 5.4.x release and 5.5 git branch). As for graphics tests, it is all right. Initial mesa and kernel support are exellent.
To be clear, some reports exists for hip working in 5.4.2 (needs to be proved). Full gfx1100 family support is expected for 5.5 release.
I have started to implement/merge some 9999(git)/5.4.2 ebuilds. Currently i don't fully understand Tensile dependency in rocBLAS ebuild with some sources used from littlewu2508 archives.
(In reply to Igor Ulyanov from comment #5) > Currently i don't fully understand Tensile dependency in rocBLAS ebuild with > some sources used from littlewu2508 archives. Tensile is a code generator for GPU kernels. rocBLAS uses Tensile to generate several variants of the BLAS standard functions, profiles their performance before selecting one.
I'm hesitating on turning on RDNA3 cards because: 1. ROCm-5.4 toolchains is quite incompatible with either llvm/clang-15 nor llvm/clang-16 2. Although ROCm-5.3 toolchains already contains RDNA3 support, which may built ROCm 5.4.x libraries (like rocBLAS-5.4.2), there can still be issues, and I need testing. 3. I don't have any RDNA3 cards for testing purpose. 4. RDNA3 support matrix is not yet complete. MIOpen for rocm-5.4.2 is lack RDNA3 support. So I'm waiting for ROCm-5.5 release actually. INow the ROCm-llvm development branch is shifted to llvm-16 base and I hope compatibilities improves. If I have time I'll try building rocBLAS-5.4.2 with hip-5.3.3, but I shall mask them until tested on hardware. It would be great if you have hardware for testing.
(In reply to Igor Ulyanov from comment #5) > I have started to implement/merge some 9999(git)/5.4.2 ebuilds. > > Currently i don't fully understand Tensile dependency in rocBLAS ebuild with > some sources used from littlewu2508 archives. Have you achieved any progress? You can find my Tensile-5.4.2 ebuilds at https://github.com/littlewu2508/gentoo/tree/rocm-scilibs-5.3.3/dev-util/Tensile I haven't tested it fully.
Changed "${FILESDIR}"/${PN}-5.1.3-gentoopath.patch > to "${FILESDIR}"/${PN}-5.4.2-gentoopath.patch > in Tensile-5.4.2 ebuild to build Tensile rocBLAS merge failes with: -- Performing Test COMPILER_HAS_TARGET_ID_gfx1100 -- Performing Test COMPILER_HAS_TARGET_ID_gfx1100 - Success /usr/bin/python3.11 -m venv /home/notmpfs/portage/sci-libs/rocBLAS-5.4.2/work/rocBLAS-rocm-5.4.2_build/virtualenv --system-site-packages --clear virtualenv python version: /home/notmpfs/portage/sci-libs/rocBLAS-5.4.2/work/rocBLAS-rocm-5.4.2_build/virtualenv/bin/python3 Python 3.11.1 /home/notmpfs/portage/sci-libs/rocBLAS-5.4.2/work/rocBLAS-rocm-5.4.2_build/virtualenv/bin/python3 -m pip install /usr/share/Tensile ERROR: Directory '/usr/share/Tensile' is not installable. Neither 'setup.py' nor 'pyproject.toml' found. WARNING: There was an error checking the latest version of pip. It seems ebuild utility tries to use python3.11 wile eselect python is set to 3.10
Created attachment 849195 [details] rocBLAS-5.4.2 build log
The same result with python10: /usr/bin/python3.10 -m venv /home/notmpfs/portage/sci-libs/rocBLAS-5.4.2/work/rocBLAS-rocm-5.4.2_build/virtualenv --system-site-packages --clear virtualenv python version: /home/notmpfs/portage/sci-libs/rocBLAS-5.4.2/work/rocBLAS-rocm-5.4.2_build/virtualenv/bin/python3 Python 3.10.9 /home/notmpfs/portage/sci-libs/rocBLAS-5.4.2/work/rocBLAS-rocm-5.4.2_build/virtualenv/bin/python3 -m pip install /usr/share/Tensile ERROR: Directory '/usr/share/Tensile' is not installable. Neither 'setup.py' nor 'pyproject.toml' found. WARNING: There was an error checking the latest version of pip.
Sorry, it seems i have forgotten to prepare new patches. 5.4.2-unbundle-Tensile.patch helps starting build.
i can confirm rocBLAS indeed successfully merges with dev-util/hip-5.3.3 and fails to merge with 5.4.2
(In reply to Igor Ulyanov from comment #13) > i can confirm rocBLAS indeed successfully merges with dev-util/hip-5.3.3 and > fails to merge with 5.4.2 Sounds great! Do you mean sci-libs/rocBLAS-5.4.2 can be built with dev-util/hip-5.3.3 but not dev-util/hip-5.4.2 ?
(In reply to Igor Ulyanov from comment #1) > I think it should be added to 5) unofficial_amdgpu_targets since this GPUs > are cunsumer class GPUs and official support is unlikely. Can you access RX 7900XT/XTX? It would be great if you can help testing math libraries.
(In reply to Yiyang Wu from comment #14) > (In reply to Igor Ulyanov from comment #13) > > i can confirm rocBLAS indeed successfully merges with dev-util/hip-5.3.3 and > > fails to merge with 5.4.2 > > Sounds great! Do you mean sci-libs/rocBLAS-5.4.2 can be built with > dev-util/hip-5.3.3 but not dev-util/hip-5.4.2 ? Exactly. Some failure with clang-15 with dev-util/hip-5.4.2 I have built sci-libs/rocSOLVER-5.4.2 and sci-libs/hipBLAS-5.4.2 first test libraries with amd hip fails with: Creating backend [hip]... lc0: /var/tmp/portage/dev-libs/rocr-runtime-5.3.3/work/ROCR-Runtime-rocm-5.3.3/src/core/runtime/amd_gpu_agent.cpp:339: void rocr::AMD::GpuAgent::AssembleShader(const char*, AssembleTarget, void*&, size_t&) const: Assertion `code_buf != NULL && "Code buffer allocation failed"' failed.
(In reply to Igor Ulyanov from comment #16) > > Exactly. Some failure with clang-15 with dev-util/hip-5.4.2 > > I have built sci-libs/rocSOLVER-5.4.2 and sci-libs/hipBLAS-5.4.2 > > first test libraries with amd hip fails with: > > Creating backend [hip]... > lc0: > /var/tmp/portage/dev-libs/rocr-runtime-5.3.3/work/ROCR-Runtime-rocm-5.3.3/ > src/core/runtime/amd_gpu_agent.cpp:339: void > rocr::AMD::GpuAgent::AssembleShader(const char*, AssembleTarget, void*&, > size_t&) const: Assertion `code_buf != NULL && "Code buffer allocation > failed"' failed. Seems like roct-thunk-interface/ror-runtime version mismatch. Or maybe because you are using hip-5.4 upon rocr-5.3?
(In reply to Yiyang Wu from comment #17) > (In reply to Igor Ulyanov from comment #16) > > > > Exactly. Some failure with clang-15 with dev-util/hip-5.4.2 > > > > I have built sci-libs/rocSOLVER-5.4.2 and sci-libs/hipBLAS-5.4.2 > > > > first test libraries with amd hip fails with: > > > > Creating backend [hip]... > > lc0: > > /var/tmp/portage/dev-libs/rocr-runtime-5.3.3/work/ROCR-Runtime-rocm-5.3.3/ > > src/core/runtime/amd_gpu_agent.cpp:339: void > > rocr::AMD::GpuAgent::AssembleShader(const char*, AssembleTarget, void*&, > > size_t&) const: Assertion `code_buf != NULL && "Code buffer allocation > > failed"' failed. > > Seems like roct-thunk-interface/ror-runtime version mismatch. Or maybe > because you are using hip-5.4 upon rocr-5.3? I have also build dev-libs/roct-thunk-interface-5.4.2, dev-libs/rocr-runtime-5.4.2 with folowing patch
Created attachment 849293 [details, diff] rocm device libs 5.4.2 patch
dev-libs/rocm-opencl-runtime-5.3.3 with clinfo reports the same error: clinfo clinfo: /var/tmp/portage/dev-libs/rocr-runtime-5.4.2/work/ROCR-Runtime-rocm-5.4.2/src/core/runtime/amd_gpu_agent.cpp:339: void rocr::AMD::GpuAgent::AssembleShader(const char*, AssembleTarget, void*&, size_t&) const: Assertion `code_buf != NULL && "Code buffer allocation failed"' failed.
The same with dev-libs/rocm-opencl-runtime-5.4.2 It seems there is a problem with rocr-runtime itself or its dependencies (both 5.3.3 and 5.4.2). There were some hope for amd hip working without rocr-runtime, but it seems not to be true (or not for lc0, i will probably try some another amd hip test if i find one). I could also try rocr-runtime-9999 with 5.5 branch but there is no 9999 ebuild available currently.
(In reply to Igor Ulyanov from comment #21) > The same with dev-libs/rocm-opencl-runtime-5.4.2 > > It seems there is a problem with rocr-runtime itself or its dependencies > (both 5.3.3 and 5.4.2). > > There were some hope for amd hip working without rocr-runtime, but it seems > not to be true (or not for lc0, i will probably try some another amd hip > test if i find one). > > I could also try rocr-runtime-9999 with 5.5 branch but there is no 9999 > ebuild available currently. I already got roct-thunk-interface-9999 rocr-runtime-9999, rocm-comgr-9999 and hip-9999 working well on my gfx1031 GPU (using llvm-16). Nearly passing all tests. rocr-runtime is a core component. I do not believe any ROCm related computation workload can bypass libhsa-runtime64.so. BTW, roct-thunk-interface-9999 rocr-runtime-9999 is actually no difference from 5.4.2 version.
(In reply to Igor Ulyanov from comment #19) > Created attachment 849293 [details, diff] [details, diff] > rocm device libs 5.4.2 patch Fedora has similar patch for rocm-device-libs-5.4. Just for llvm-15 compatibility, but surely it sacrifice some RDNA3 features. I prefer using llvm-16 with another set of patch. My 9999 ebuilds are located at https://github.com/littlewu2508/gentoo/tree/rocm-9999
Still no further progress with llvm-15. Need some more time to test with llvm-16.
The bug has been closed via the following commit(s): https://gitweb.gentoo.org/repo/gentoo.git/commit/?id=88729b304dbfe1e9bf1d5384bf125704e28ff956 commit 88729b304dbfe1e9bf1d5384bf125704e28ff956 Author: Yiyang Wu <xgreenlandforwyy@gmail.com> AuthorDate: 2023-01-28 10:40:53 +0000 Commit: Benda Xu <heroxbd@gentoo.org> CommitDate: 2023-01-31 14:28:52 +0000 rocm.eclass: support RDNA3 GPU for >=5.4, remove <5 ROCm libraries with version <5 are cleaned up, remove version 4 support for rocm.eclass. RDNA3 has initial support in ROCm libraries starting from 5.4 releases. Enable gfx110* amdgpu_targets in rocm.eclass and add corresponding description. Closes: https://bugs.gentoo.org/891499 Closes: https://github.com/gentoo/gentoo/pull/29320 Signed-off-by: Yiyang Wu <xgreenlandforwyy@gmail.com> Signed-off-by: Benda Xu <heroxbd@gentoo.org> eclass/rocm.eclass | 13 +++++++------ profiles/desc/amdgpu_targets.desc | 5 ++++- 2 files changed, 11 insertions(+), 7 deletions(-)
Some compute tests works with HSA_OVERRIDE_GFX_VERSION=10.3.0 environment variable set. Used llvm-16 and 9999 ebuilds.
Please also note discussion in https://github.com/RadeonOpenCompute/ROCm/issues/1880 -DCMAKE_CXX_FLAGS_RELEASE="${CXXFLAGS} -DNDEBUG" for rocr-runtime should be set in ebuild
(In reply to Igor Ulyanov from comment #26) > Some compute tests works with HSA_OVERRIDE_GFX_VERSION=10.3.0 environment > variable set. Used llvm-16 and 9999 ebuilds. Which GPU are you using? If it's 7900XT/7900XTX, I think you won't need to set HSA_OVERRIDE_GFX_VERSION=10.3.0 for the 9999 ebuilds Also, I'm quite curious on the rocBLAS performance and correctness on RDNA3. You can turn on "benchmark" use flag to install rocblas-bench
(In reply to Igor Ulyanov from comment #26) > Some compute tests works with HSA_OVERRIDE_GFX_VERSION=10.3.0 environment > variable set. Used llvm-16 and 9999 ebuilds. I pushed rocBLAS and miopen 9999 ebuilds to https://github.com/littlewu2508/gentoo/tree/rocm-9999
src_unpack in rocBLAS-9999 uses fixed tar.gz name, should use 5.5.0 for 9999
Run Build Command(s):/usr/bin/ninja cmTC_1d97f && [1/2] Building CXX object CMakeFiles/cmTC_1d97f.dir/testCXXCompiler.cxx.o FAILED: CMakeFiles/cmTC_1d97f.dir/testCXXCompiler.cxx.o /usr/bin/hipcc -O2 -pipe -march=native -o CMakeFiles/cmTC_1d97f.dir/testCXXCompiler.cxx.o -c /home/notmpfs/portage/sci-libs/rocBLAS-9999/work/rocBLAS-9999_build/CMakeFiles/CMakeScratch/TryCompile-GcJ0wb/testCXXCompiler.cxx fatal error: cannot open file '/usr/lib/amdgcn/bitcode/ocml.bc': Unknown attribute kind (86) (Producer: 'LLVM16.0.0git6dc85bd3' Reader: 'LLVM 15.0.7') 1 error generated when compiling for gfx1100. ninja: build stopped: subcommand failed.
equery b /usr/lib/amdgcn/bitcode/ocml.bc dev-libs/rocm-device-libs-9999 (/usr/lib/amdgcn/bitcode/ocml.bc) I have dev-libs/rocm-device-libs merged with llvm-16
To be clear, some tests worked was clinfo and rocminfo. Other more complicated software fails with segfaults, improper results or exceptions, for example: Unhandled exception: clCreateCommandQueue https://bpa.st/I3ZLM
(In reply to Igor Ulyanov from comment #33) > To be clear, some tests worked was clinfo and rocminfo. Other more > complicated software fails with segfaults, improper results or exceptions, > for example: > > Unhandled exception: clCreateCommandQueue > > https://bpa.st/I3ZLM I guess there's still immature stuff for RDNA3. Have you tried to emerge rocm-opencl-runtime, with src_test turned on? Maybe you can detect more bugs there, and report to upstream.
after installing some new packages i have this failure: rocblas-bench : CommandLine Error: Option 'use-dbg-addr' registered more than once! LLVM ERROR: inconsistency in registered CommandLine options Currently I have some problems with clang:16 rc2 emerge, so need some more time to fix and build packages. I have even tried to recover clang:16 from your docker image with quickpkg, but it generates xpak file instead of tbz2, so also no progress here.
The problem with clang:16 is a static linking amdgpu-arch tool in clang with rocr-runtime. clang:16 rc2 compiles fine if rocr-runtime is not installed.
(In reply to Igor Ulyanov from comment #36) > The problem with clang:16 is a static linking amdgpu-arch tool in clang with > rocr-runtime. clang:16 rc2 compiles fine if rocr-runtime is not installed. You're right. The llvm team already has a discussing about this: https://github.com/llvm/llvm-project/issues/60660#issuecomment-1425780289
After two days of hacking I complete most of the rocm-9999 ebuilds, and also the rocm enablement of pytorch (caffe2). The rocm-9999 ebuilds are currently placed at https://github.com/littlewu2508/gentoo/tree/rocm-9999, as well as rocm-enabled caffe2 and torchvision (copied from ::science). I also pushed my docker for testing around at https://hub.docker.com/r/littlewu2508/clang-16_rc2-pytorch-1.12-rocm-9999-5.5-gfx1100, anyone can try it with `docker run -d --network=host --device=/dev/kfd --device=/dev/dri --group-add video `
lc0 OpenCL working test: http://0x0.st/Hr_W.txt lc0 hip working test: http://0x0.st/Hr_v.txt
rocblas-test also works as expected
(In reply to Igor Ulyanov from comment #40) > rocblas-test also works as expected That sounds exciting. Have you tried run some rocblas-bench such as rocblas-bench -f gemm -r f32_r --transposeA N --transposeB N -m 4096 -n 4096 -k 4096 --alpha 1 --lda 4096 --ldb 4096 --beta 0 --ldc 4096 to see the FP32 performance and rocblas-bench -f gemm_ex --a_type f16_r --b_type f16_r --c_type f32_r --d_type f32_r --transposeA N --transposeB N -m 4096 -n 4096 -k 4096 --alpha 1 --lda 4096 --ldb 4096 --beta 0 --ldc 4096 --ldd 4096 to see the performance of AI instructions in RDNA3? More benchmarks can be found in https://rocblas.readthedocs.io/en/rocm-5.4.3/Programmers_Guide.html#rocblas-bench
http://0x0.st/Hr46.txt what additional tests are you interested?
(In reply to Igor Ulyanov from comment #42) > http://0x0.st/Hr46.txt > > what additional tests are you interested? Thanks! It would be greate if you can run more benchmarks in two code blocks "Non-HPA cases (gemm)" and "HPA cases (gemm_ex)", below the table https://rocblas.readthedocs.io/en/rocm-5.4.3/Programmers_Guide.html#id10 The test result is somehow unexpected to me. The gemm performance looks bad, usually this size of gemm should hit the theoretical FP32 performance, which should be 60TFlops rather than 6TFlops. My 6700xt is 12TFlops here. The gemm_ex result is also not satisfactory, although it's better than gemm which seems to be a good sign (AI instruction in use), my 6700xt here is about 4.4TFlops. I believe the optimization is incomplete here.
http://0x0.st/HrGS.txt
(In reply to Igor Ulyanov from comment #44) > http://0x0.st/HrGS.txt There are tests that perform are pretty good, while some are pretty bad. It seems that 7900XTX is excellent at benchmarks with half precision float (BF16, FP16) and int (I8), for both gemm and mixed-precision gemm. However FP32 performance is not well. Maybe you can open an issue to rocBLAS upstream to report this
(In reply to Yiyang Wu from comment #45) > (In reply to Igor Ulyanov from comment #44) > > http://0x0.st/HrGS.txt > > There are tests that perform are pretty good, while some are pretty bad. It > seems that 7900XTX is excellent at benchmarks with half precision float > (BF16, FP16) and int (I8), for both gemm and mixed-precision gemm. However > FP32 performance is not well. Maybe you can open an issue to rocBLAS > upstream to report this The git log for rocBLAS navi31 GEMM: ``` git log --oneline -- library/src/blas3/Tensile/Logic/asm_full/navi31 "library/src/blas3/Tensile/Logic/asm_full/navi31_*.yaml" 23baa554 Updating Code Object versions in lib logics (#1531) b6b92074 add gfx1100 logic yaml for wmma type 7972a13c asm_full directory reorganization (#1369) df665dec support Tensile GEMM functionality on gfx11 platforms ``` So the GEMM of navi3* is not optimized at all. There's basic support, and wmma support (gemm_ex, where AI instructions accelerates). Let's see when there's full optimization. You can also optimize GEMM for rocBLAS yourself using Tensile (https://wiki.gentoo.org/wiki/Tensile). However you need a benchmark config file for 7900XTX. I guess you can reuse the navi21 config files (https://github.com/ROCmSoftwarePlatform/Tensile/tree/develop/Tensile/Configs/navi21) as a basic setup.
(In reply to Yiyang Wu from comment #46) > (In reply to Yiyang Wu from comment #45) > > (In reply to Igor Ulyanov from comment #44) > > > http://0x0.st/HrGS.txt > > > > There are tests that perform are pretty good, while some are pretty bad. It > > seems that 7900XTX is excellent at benchmarks with half precision float > > (BF16, FP16) and int (I8), for both gemm and mixed-precision gemm. However > > FP32 performance is not well. Maybe you can open an issue to rocBLAS > > upstream to report this > > The git log for rocBLAS navi31 GEMM: > > ``` > git log --oneline -- library/src/blas3/Tensile/Logic/asm_full/navi31 > "library/src/blas3/Tensile/Logic/asm_full/navi31_*.yaml" > > 23baa554 Updating Code Object versions in lib logics (#1531) > b6b92074 add gfx1100 logic yaml for wmma type > 7972a13c asm_full directory reorganization (#1369) > df665dec support Tensile GEMM functionality on gfx11 platforms > ``` > > So the GEMM of navi3* is not optimized at all. There's basic support, and > wmma support (gemm_ex, where AI instructions accelerates). Let's see when > there's full optimization. > > You can also optimize GEMM for rocBLAS yourself using Tensile > (https://wiki.gentoo.org/wiki/Tensile). However you need a benchmark config > file for 7900XTX. I guess you can reuse the navi21 config files > (https://github.com/ROCmSoftwarePlatform/Tensile/tree/develop/Tensile/ > Configs/navi21) as a basic setup. Suppose AMD should hire us as a developers and testers in this case :) Are you sure it is worth efforts since we can just wait some time for at least a 5.5 release (maybe even consider 6.0 release)? For now, i am trying to compile onnxruntime for new transformer LC0 nets. Also thanks a lot for your pytorch great work although i am working with tensorflow personally. Tensorflow and pytorch support whould be nice since 24Gb memory allows to fit some advanced models while using free opensource drivers by AMD.