891499 – gfx1100 gfx1101 gfx1102 amdgpu_targets addition to rocm.eclass

Bug 891499 - gfx1100 gfx1101 gfx1102 amdgpu_targets addition to rocm.eclass

Summary: gfx1100 gfx1101 gfx1102 amdgpu_targets addition to rocm.eclass

Status:	RESOLVED FIXED

Alias:	None

Product:	Gentoo Linux
Classification:	Unclassified
Component:	Eclasses (show other bugs)
Hardware:	All Linux

Importance:	Normal enhancement
Assignee:	Gentoo Science Related Packages

URL:
Whiteboard:
Keywords:	PullRequest

Depends on:
Blocks:

Reported:	2023-01-20 17:39 UTC by Igor Ulyanov
Modified:	2023-02-19 11:18 UTC (History)
CC List:	1 user (show)

See Also:	https://github.com/gentoo/gentoo/pull/29320
Package list:
Runtime testing required:	---

Attachments
rocBLAS-5.4.2 build log (ebuild_log.tmp,5.57 KB, text/plain) 2023-01-25 18:27 UTC, Igor Ulyanov	Details
rocm device libs 5.4.2 patch (rocm-device-libs-5.4.2-remove-msg_realtime-gfx11-target-function-dangerous.patch,1.07 KB, patch) 2023-01-27 17:33 UTC, Igor Ulyanov	Details \| Diff
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Igor Ulyanov 2023-01-20 17:39:39 UTC

New AMD GPUs are now available with some initial software support:

https://github.com/ROCmSoftwarePlatform/rocBLAS/blob/develop/CMakeLists.txt

here we have new amdgpu_targets: gfx1100 gfx1101 gfx1102

rocm.eclass should prepare this new targets to further proper ebuild support


Reproducible: Always

Comment 1 Igor Ulyanov 2023-01-20 17:52:00 UTC

I think it should be added to 5) unofficial_amdgpu_targets since this GPUs are cunsumer class GPUs and official support is unlikely.

Comment 2 Benda Xu gentoo-dev

2023-01-24 03:32:50 UTC

Thank you Igor!  Have you tested one of the RDNA 3 cards?  I don't have one at hand, unfortunately.

Comment 3 Igor Ulyanov 2023-01-24 16:59:40 UTC

There is a long way to go to really test it in compute tasks. We need a working rocm ebuilds for it or at least working docker image. Some work have been done for previous generation cards (still 5.3.x rocm version ebuilds while existing 5.4.x release and 5.5 git branch).

As for graphics tests, it is all right. Initial mesa and kernel support are exellent.

Comment 4 Igor Ulyanov 2023-01-24 17:09:46 UTC

To be clear, some reports exists for hip working in 5.4.2 (needs to be proved). Full gfx1100 family support is expected for 5.5 release.

Comment 5 Igor Ulyanov 2023-01-24 17:23:18 UTC

I have started to implement/merge some 9999(git)/5.4.2 ebuilds.

Currently i don't fully understand Tensile dependency in rocBLAS ebuild with some sources used from littlewu2508 archives.

Comment 6 Benda Xu gentoo-dev

2023-01-25 03:14:00 UTC

(In reply to Igor Ulyanov from comment #5)

> Currently i don't fully understand Tensile dependency in rocBLAS ebuild with
> some sources used from littlewu2508 archives.

Tensile is a code generator for GPU kernels.  rocBLAS uses Tensile to generate several variants of the BLAS standard functions, profiles their performance before selecting one.

Comment 7 Yiyang Wu 2023-01-25 03:30:06 UTC

I'm hesitating on turning on RDNA3 cards because:

1. ROCm-5.4 toolchains is quite incompatible with either llvm/clang-15 nor llvm/clang-16
2. Although ROCm-5.3 toolchains already contains RDNA3 support, which may built ROCm 5.4.x libraries (like rocBLAS-5.4.2), there can still be issues, and I need testing.
3. I don't have any RDNA3 cards for testing purpose.
4. RDNA3 support matrix is not yet complete. MIOpen for rocm-5.4.2 is lack RDNA3 support.

So I'm waiting for ROCm-5.5 release actually. INow the ROCm-llvm development branch is shifted to llvm-16 base and I hope compatibilities improves. 

If I have time I'll try building rocBLAS-5.4.2 with hip-5.3.3, but I shall mask them until tested on hardware. It would be great if you have hardware for testing.

Comment 8 Yiyang Wu 2023-01-25 03:32:47 UTC

(In reply to Igor Ulyanov from comment #5)
> I have started to implement/merge some 9999(git)/5.4.2 ebuilds.
> 
> Currently i don't fully understand Tensile dependency in rocBLAS ebuild with
> some sources used from littlewu2508 archives.

Have you achieved any progress?

You can find my Tensile-5.4.2 ebuilds at https://github.com/littlewu2508/gentoo/tree/rocm-scilibs-5.3.3/dev-util/Tensile

I haven't tested it fully.

Comment 9 Igor Ulyanov 2023-01-25 18:20:06 UTC

Changed 
"${FILESDIR}"/${PN}-5.1.3-gentoopath.patch > 
to 
"${FILESDIR}"/${PN}-5.4.2-gentoopath.patch >
in Tensile-5.4.2 ebuild to build Tensile

rocBLAS merge failes with:
-- Performing Test COMPILER_HAS_TARGET_ID_gfx1100
-- Performing Test COMPILER_HAS_TARGET_ID_gfx1100 - Success
/usr/bin/python3.11 -m venv /home/notmpfs/portage/sci-libs/rocBLAS-5.4.2/work/rocBLAS-rocm-5.4.2_build/virtualenv --system-site-packages --clear
virtualenv python version: /home/notmpfs/portage/sci-libs/rocBLAS-5.4.2/work/rocBLAS-rocm-5.4.2_build/virtualenv/bin/python3
Python 3.11.1
/home/notmpfs/portage/sci-libs/rocBLAS-5.4.2/work/rocBLAS-rocm-5.4.2_build/virtualenv/bin/python3 -m pip install /usr/share/Tensile
ERROR: Directory '/usr/share/Tensile' is not installable. Neither 'setup.py' nor 'pyproject.toml' found.
WARNING: There was an error checking the latest version of pip.

It seems ebuild utility tries to use python3.11 wile eselect python is set to 3.10

Comment 10 Igor Ulyanov 2023-01-25 18:27:13 UTC

Created attachment 849195 [details]
rocBLAS-5.4.2 build log

Comment 11 Igor Ulyanov 2023-01-26 15:42:49 UTC

The same result with python10:
/usr/bin/python3.10 -m venv /home/notmpfs/portage/sci-libs/rocBLAS-5.4.2/work/rocBLAS-rocm-5.4.2_build/virtualenv --system-site-packages --clear
virtualenv python version: /home/notmpfs/portage/sci-libs/rocBLAS-5.4.2/work/rocBLAS-rocm-5.4.2_build/virtualenv/bin/python3
Python 3.10.9
/home/notmpfs/portage/sci-libs/rocBLAS-5.4.2/work/rocBLAS-rocm-5.4.2_build/virtualenv/bin/python3 -m pip install /usr/share/Tensile
ERROR: Directory '/usr/share/Tensile' is not installable. Neither 'setup.py' nor 'pyproject.toml' found.
WARNING: There was an error checking the latest version of pip.

Comment 12 Igor Ulyanov 2023-01-26 16:02:51 UTC

Sorry, it seems i have forgotten to prepare new patches. 5.4.2-unbundle-Tensile.patch helps starting build.

Comment 13 Igor Ulyanov 2023-01-26 17:04:30 UTC

i can confirm rocBLAS indeed successfully merges with dev-util/hip-5.3.3 and fails to merge with 5.4.2

Comment 14 Yiyang Wu 2023-01-27 02:41:52 UTC

(In reply to Igor Ulyanov from comment #13)
> i can confirm rocBLAS indeed successfully merges with dev-util/hip-5.3.3 and
> fails to merge with 5.4.2

Sounds great! Do you mean sci-libs/rocBLAS-5.4.2 can be built with dev-util/hip-5.3.3 but not dev-util/hip-5.4.2 ?

Comment 15 Yiyang Wu 2023-01-27 03:24:30 UTC

(In reply to Igor Ulyanov from comment #1)
> I think it should be added to 5) unofficial_amdgpu_targets since this GPUs
> are cunsumer class GPUs and official support is unlikely.

Can you access RX 7900XT/XTX? It would be great if you can help testing math libraries.

Comment 16 Igor Ulyanov 2023-01-27 16:15:31 UTC

(In reply to Yiyang Wu from comment #14)
> (In reply to Igor Ulyanov from comment #13)
> > i can confirm rocBLAS indeed successfully merges with dev-util/hip-5.3.3 and
> > fails to merge with 5.4.2
> 
> Sounds great! Do you mean sci-libs/rocBLAS-5.4.2 can be built with
> dev-util/hip-5.3.3 but not dev-util/hip-5.4.2 ?

Exactly. Some failure with clang-15 with dev-util/hip-5.4.2

I have built sci-libs/rocSOLVER-5.4.2 and sci-libs/hipBLAS-5.4.2

first test libraries with amd hip fails with:

Creating backend [hip]...
lc0: /var/tmp/portage/dev-libs/rocr-runtime-5.3.3/work/ROCR-Runtime-rocm-5.3.3/src/core/runtime/amd_gpu_agent.cpp:339: void rocr::AMD::GpuAgent::AssembleShader(const char*, AssembleTarget, void*&, size_t&) const: Assertion `code_buf != NULL && "Code buffer allocation failed"' failed.

Comment 17 Yiyang Wu 2023-01-27 16:21:23 UTC

(In reply to Igor Ulyanov from comment #16)
> 
> Exactly. Some failure with clang-15 with dev-util/hip-5.4.2
> 
> I have built sci-libs/rocSOLVER-5.4.2 and sci-libs/hipBLAS-5.4.2
> 
> first test libraries with amd hip fails with:
> 
> Creating backend [hip]...
> lc0:
> /var/tmp/portage/dev-libs/rocr-runtime-5.3.3/work/ROCR-Runtime-rocm-5.3.3/
> src/core/runtime/amd_gpu_agent.cpp:339: void
> rocr::AMD::GpuAgent::AssembleShader(const char*, AssembleTarget, void*&,
> size_t&) const: Assertion `code_buf != NULL && "Code buffer allocation
> failed"' failed.

Seems like roct-thunk-interface/ror-runtime version mismatch. Or maybe because you are using hip-5.4 upon rocr-5.3?

Comment 18 Igor Ulyanov 2023-01-27 17:32:08 UTC

(In reply to Yiyang Wu from comment #17)
> (In reply to Igor Ulyanov from comment #16)
> > 
> > Exactly. Some failure with clang-15 with dev-util/hip-5.4.2
> > 
> > I have built sci-libs/rocSOLVER-5.4.2 and sci-libs/hipBLAS-5.4.2
> > 
> > first test libraries with amd hip fails with:
> > 
> > Creating backend [hip]...
> > lc0:
> > /var/tmp/portage/dev-libs/rocr-runtime-5.3.3/work/ROCR-Runtime-rocm-5.3.3/
> > src/core/runtime/amd_gpu_agent.cpp:339: void
> > rocr::AMD::GpuAgent::AssembleShader(const char*, AssembleTarget, void*&,
> > size_t&) const: Assertion `code_buf != NULL && "Code buffer allocation
> > failed"' failed.
> 
> Seems like roct-thunk-interface/ror-runtime version mismatch. Or maybe
> because you are using hip-5.4 upon rocr-5.3?

I have also build dev-libs/roct-thunk-interface-5.4.2, dev-libs/rocr-runtime-5.4.2 with folowing patch

Comment 19 Igor Ulyanov 2023-01-27 17:33:41 UTC

Created attachment 849293 [details, diff]
rocm device libs 5.4.2 patch

Comment 20 Igor Ulyanov 2023-01-27 17:35:08 UTC

dev-libs/rocm-opencl-runtime-5.3.3 with clinfo reports the same error:

clinfo 
clinfo: /var/tmp/portage/dev-libs/rocr-runtime-5.4.2/work/ROCR-Runtime-rocm-5.4.2/src/core/runtime/amd_gpu_agent.cpp:339: void rocr::AMD::GpuAgent::AssembleShader(const char*, AssembleTarget, void*&, size_t&) const: Assertion `code_buf != NULL && "Code buffer allocation failed"' failed.

Comment 21 Igor Ulyanov 2023-01-27 18:10:11 UTC

The same with dev-libs/rocm-opencl-runtime-5.4.2

It seems there is a problem with rocr-runtime itself or its dependencies (both 5.3.3 and 5.4.2).

There were some hope for amd hip working without rocr-runtime, but it seems not to be true (or not for lc0, i will probably try some another amd hip test if i find one).

I could also try rocr-runtime-9999 with 5.5 branch but there is no 9999 ebuild available currently.

Comment 22 Yiyang Wu 2023-01-28 02:38:18 UTC

(In reply to Igor Ulyanov from comment #21)
> The same with dev-libs/rocm-opencl-runtime-5.4.2
> 
> It seems there is a problem with rocr-runtime itself or its dependencies
> (both 5.3.3 and 5.4.2).
> 
> There were some hope for amd hip working without rocr-runtime, but it seems
> not to be true (or not for lc0, i will probably try some another amd hip
> test if i find one).
> 
> I could also try rocr-runtime-9999 with 5.5 branch but there is no 9999
> ebuild available currently.

I already got roct-thunk-interface-9999 rocr-runtime-9999, rocm-comgr-9999 and hip-9999 working well on my gfx1031 GPU (using llvm-16). Nearly passing all tests.

rocr-runtime is a core component. I do not believe any ROCm related computation workload can bypass libhsa-runtime64.so.

BTW, roct-thunk-interface-9999 rocr-runtime-9999 is actually no difference from 5.4.2 version.

Comment 23 Yiyang Wu 2023-01-28 02:41:04 UTC

(In reply to Igor Ulyanov from comment #19)
> Created attachment 849293 [details, diff] [details, diff]
> rocm device libs 5.4.2 patch

Fedora has similar patch for rocm-device-libs-5.4. Just for llvm-15 compatibility, but surely it sacrifice some RDNA3 features. I prefer using llvm-16 with another set of patch. My 9999 ebuilds are located at https://github.com/littlewu2508/gentoo/tree/rocm-9999

Comment 24 Igor Ulyanov 2023-01-28 19:23:27 UTC

Still no further progress with llvm-15. Need some more time to test with llvm-16.

Comment 25 Larry the Git Cow gentoo-dev

2023-01-31 14:30:32 UTC

The bug has been closed via the following commit(s):

https://gitweb.gentoo.org/repo/gentoo.git/commit/?id=88729b304dbfe1e9bf1d5384bf125704e28ff956

commit 88729b304dbfe1e9bf1d5384bf125704e28ff956
Author:     Yiyang Wu <xgreenlandforwyy@gmail.com>
AuthorDate: 2023-01-28 10:40:53 +0000
Commit:     Benda Xu <heroxbd@gentoo.org>
CommitDate: 2023-01-31 14:28:52 +0000

    rocm.eclass: support RDNA3 GPU for >=5.4, remove <5
    
    ROCm libraries with version <5 are cleaned up, remove version 4 support
    for rocm.eclass.
    
    RDNA3 has initial support in ROCm libraries starting from 5.4 releases.
    Enable gfx110* amdgpu_targets in rocm.eclass and add corresponding
    description.
    
    Closes: https://bugs.gentoo.org/891499
    Closes: https://github.com/gentoo/gentoo/pull/29320
    Signed-off-by: Yiyang Wu <xgreenlandforwyy@gmail.com>
    Signed-off-by: Benda Xu <heroxbd@gentoo.org>

 eclass/rocm.eclass                | 13 +++++++------
 profiles/desc/amdgpu_targets.desc |  5 ++++-
 2 files changed, 11 insertions(+), 7 deletions(-)

Comment 26 Igor Ulyanov 2023-02-07 16:23:57 UTC

Some compute tests works with HSA_OVERRIDE_GFX_VERSION=10.3.0 environment variable set. Used llvm-16 and 9999 ebuilds.

Comment 27 Igor Ulyanov 2023-02-07 16:32:21 UTC

Please also note discussion in https://github.com/RadeonOpenCompute/ROCm/issues/1880

-DCMAKE_CXX_FLAGS_RELEASE="${CXXFLAGS} -DNDEBUG" for rocr-runtime should be set in ebuild

Comment 28 Yiyang Wu 2023-02-08 02:46:29 UTC

(In reply to Igor Ulyanov from comment #26)
> Some compute tests works with HSA_OVERRIDE_GFX_VERSION=10.3.0 environment
> variable set. Used llvm-16 and 9999 ebuilds.

Which GPU are you using?

If it's 7900XT/7900XTX, I think you won't need to set HSA_OVERRIDE_GFX_VERSION=10.3.0 for the 9999 ebuilds

Also, I'm quite curious on the rocBLAS performance and correctness on RDNA3. You can turn on "benchmark" use flag to install rocblas-bench

Comment 29 Yiyang Wu 2023-02-08 04:10:49 UTC

(In reply to Igor Ulyanov from comment #26)
> Some compute tests works with HSA_OVERRIDE_GFX_VERSION=10.3.0 environment
> variable set. Used llvm-16 and 9999 ebuilds.

I pushed rocBLAS and miopen 9999 ebuilds to https://github.com/littlewu2508/gentoo/tree/rocm-9999

Comment 30 Igor Ulyanov 2023-02-08 18:14:53 UTC

src_unpack in rocBLAS-9999 uses fixed tar.gz name, should use 5.5.0 for 9999

Comment 31 Igor Ulyanov 2023-02-08 18:17:44 UTC

   Run Build Command(s):/usr/bin/ninja cmTC_1d97f && [1/2] Building CXX object CMakeFiles/cmTC_1d97f.dir/testCXXCompiler.cxx.o
    FAILED: CMakeFiles/cmTC_1d97f.dir/testCXXCompiler.cxx.o 
    /usr/bin/hipcc    -O2 -pipe -march=native -o CMakeFiles/cmTC_1d97f.dir/testCXXCompiler.cxx.o -c /home/notmpfs/portage/sci-libs/rocBLAS-9999/work/rocBLAS-9999_build/CMakeFiles/CMakeScratch/TryCompile-GcJ0wb/testCXXCompiler.cxx
    fatal error: cannot open file '/usr/lib/amdgcn/bitcode/ocml.bc': Unknown attribute kind (86) (Producer: 'LLVM16.0.0git6dc85bd3' Reader: 'LLVM 15.0.7')
    1 error generated when compiling for gfx1100.
    ninja: build stopped: subcommand failed.

Comment 32 Igor Ulyanov 2023-02-08 18:19:11 UTC

equery b /usr/lib/amdgcn/bitcode/ocml.bc
dev-libs/rocm-device-libs-9999 (/usr/lib/amdgcn/bitcode/ocml.bc)

I have dev-libs/rocm-device-libs merged with llvm-16

Comment 33 Igor Ulyanov 2023-02-08 18:23:52 UTC

To be clear, some tests worked was clinfo and rocminfo. Other more complicated software fails with segfaults, improper results or exceptions, for example:

Unhandled exception: clCreateCommandQueue

https://bpa.st/I3ZLM

Comment 34 Yiyang Wu 2023-02-09 03:46:33 UTC

(In reply to Igor Ulyanov from comment #33)
> To be clear, some tests worked was clinfo and rocminfo. Other more
> complicated software fails with segfaults, improper results or exceptions,
> for example:
> 
> Unhandled exception: clCreateCommandQueue
> 
> https://bpa.st/I3ZLM

I guess there's still immature stuff for RDNA3. Have you tried to emerge rocm-opencl-runtime, with src_test turned on? Maybe you can detect more bugs there, and report to upstream.

Comment 35 Igor Ulyanov 2023-02-12 17:18:48 UTC

after installing some new packages i have this failure:

rocblas-bench 
: CommandLine Error: Option 'use-dbg-addr' registered more than once!
LLVM ERROR: inconsistency in registered CommandLine options

Currently I have some problems with clang:16 rc2 emerge, so need some more time to fix and build packages. I have even tried to recover clang:16 from your docker image with quickpkg, but it generates xpak file instead of tbz2, so also no progress here.

Comment 36 Igor Ulyanov 2023-02-13 16:13:41 UTC

The problem with clang:16 is a static linking amdgpu-arch tool in clang with rocr-runtime. clang:16 rc2 compiles fine if rocr-runtime is not installed.

Comment 37 Yiyang Wu 2023-02-13 16:38:43 UTC

(In reply to Igor Ulyanov from comment #36)
> The problem with clang:16 is a static linking amdgpu-arch tool in clang with
> rocr-runtime. clang:16 rc2 compiles fine if rocr-runtime is not installed.

You're right. The llvm team already has a discussing about this: https://github.com/llvm/llvm-project/issues/60660#issuecomment-1425780289

Comment 38 Yiyang Wu 2023-02-13 16:43:38 UTC

After two days of hacking I complete most of the rocm-9999 ebuilds, and also the rocm enablement of pytorch (caffe2).

The rocm-9999 ebuilds are currently placed at https://github.com/littlewu2508/gentoo/tree/rocm-9999, as well as rocm-enabled caffe2 and torchvision (copied from ::science). I also pushed my docker for testing around at https://hub.docker.com/r/littlewu2508/clang-16_rc2-pytorch-1.12-rocm-9999-5.5-gfx1100, anyone can try it with `docker run -d --network=host --device=/dev/kfd --device=/dev/dri --group-add video `

Comment 39 Igor Ulyanov 2023-02-13 17:18:39 UTC

lc0 OpenCL working test: http://0x0.st/Hr_W.txt
lc0 hip working test: http://0x0.st/Hr_v.txt

Comment 40 Igor Ulyanov 2023-02-13 18:37:13 UTC

rocblas-test also works as expected

Comment 41 Yiyang Wu 2023-02-14 03:53:02 UTC

(In reply to Igor Ulyanov from comment #40)
> rocblas-test also works as expected

That sounds exciting. Have you tried run some rocblas-bench such as

rocblas-bench -f gemm -r f32_r --transposeA N --transposeB N -m 4096 -n 4096 -k 4096 --alpha 1 --lda 4096 --ldb 4096 --beta 0 --ldc 4096

to see the FP32 performance 

and 

rocblas-bench -f gemm_ex --a_type f16_r --b_type f16_r --c_type f32_r --d_type f32_r --transposeA N --transposeB N -m 4096 -n 4096 -k 4096 --alpha 1 --lda 4096 --ldb 4096 --beta 0 --ldc 4096 --ldd 4096

to see the performance of AI instructions in RDNA3?

More benchmarks can be found in https://rocblas.readthedocs.io/en/rocm-5.4.3/Programmers_Guide.html#rocblas-bench

Comment 42 Igor Ulyanov 2023-02-14 17:03:14 UTC

http://0x0.st/Hr46.txt

what additional tests are you interested?

Comment 43 Yiyang Wu 2023-02-15 00:16:49 UTC

(In reply to Igor Ulyanov from comment #42)
> http://0x0.st/Hr46.txt
> 
> what additional tests are you interested?

Thanks! It would be greate if you can run more benchmarks in two code blocks "Non-HPA cases (gemm)" and "HPA cases (gemm_ex)", below the table https://rocblas.readthedocs.io/en/rocm-5.4.3/Programmers_Guide.html#id10

The test result is somehow unexpected to me. The gemm performance looks bad, usually this size of gemm should hit the theoretical FP32 performance, which should be 60TFlops rather than 6TFlops. My 6700xt is 12TFlops here.

The gemm_ex result is also not satisfactory, although it's better than gemm which seems to be a good sign (AI instruction in use), my 6700xt here is about 4.4TFlops.

I believe the optimization is incomplete here.

Comment 44 Igor Ulyanov 2023-02-16 17:27:54 UTC

http://0x0.st/HrGS.txt

Comment 45 Yiyang Wu 2023-02-17 01:29:14 UTC

(In reply to Igor Ulyanov from comment #44)
> http://0x0.st/HrGS.txt

There are tests that perform are pretty good, while some are pretty bad. It seems that 7900XTX is excellent at benchmarks with half precision float (BF16, FP16) and int (I8), for both gemm and mixed-precision gemm. However FP32 performance is not well. Maybe you can open an issue to rocBLAS upstream to report this

Comment 46 Yiyang Wu 2023-02-19 06:45:18 UTC

(In reply to Yiyang Wu from comment #45)
> (In reply to Igor Ulyanov from comment #44)
> > http://0x0.st/HrGS.txt
> 
> There are tests that perform are pretty good, while some are pretty bad. It
> seems that 7900XTX is excellent at benchmarks with half precision float
> (BF16, FP16) and int (I8), for both gemm and mixed-precision gemm. However
> FP32 performance is not well. Maybe you can open an issue to rocBLAS
> upstream to report this

The git log for rocBLAS navi31 GEMM:

```
git log --oneline -- library/src/blas3/Tensile/Logic/asm_full/navi31 "library/src/blas3/Tensile/Logic/asm_full/navi31_*.yaml"

23baa554 Updating Code Object versions in lib logics (#1531)
b6b92074 add gfx1100 logic yaml for wmma type
7972a13c asm_full directory reorganization (#1369)
df665dec support Tensile GEMM functionality on gfx11 platforms
```

So the GEMM of navi3* is not optimized at all. There's basic support, and wmma support (gemm_ex, where AI instructions accelerates). Let's see when there's full optimization.

You can also optimize GEMM for rocBLAS yourself using Tensile (https://wiki.gentoo.org/wiki/Tensile). However you need a benchmark config file for 7900XTX. I guess you can reuse the navi21 config files (https://github.com/ROCmSoftwarePlatform/Tensile/tree/develop/Tensile/Configs/navi21) as a basic setup.

Comment 47 Igor Ulyanov 2023-02-19 11:18:26 UTC

(In reply to Yiyang Wu from comment #46)
> (In reply to Yiyang Wu from comment #45)
> > (In reply to Igor Ulyanov from comment #44)
> > > http://0x0.st/HrGS.txt
> > 
> > There are tests that perform are pretty good, while some are pretty bad. It
> > seems that 7900XTX is excellent at benchmarks with half precision float
> > (BF16, FP16) and int (I8), for both gemm and mixed-precision gemm. However
> > FP32 performance is not well. Maybe you can open an issue to rocBLAS
> > upstream to report this
> 
> The git log for rocBLAS navi31 GEMM:
> 
> ```
> git log --oneline -- library/src/blas3/Tensile/Logic/asm_full/navi31
> "library/src/blas3/Tensile/Logic/asm_full/navi31_*.yaml"
> 
> 23baa554 Updating Code Object versions in lib logics (#1531)
> b6b92074 add gfx1100 logic yaml for wmma type
> 7972a13c asm_full directory reorganization (#1369)
> df665dec support Tensile GEMM functionality on gfx11 platforms
> ```
> 
> So the GEMM of navi3* is not optimized at all. There's basic support, and
> wmma support (gemm_ex, where AI instructions accelerates). Let's see when
> there's full optimization.
> 
> You can also optimize GEMM for rocBLAS yourself using Tensile
> (https://wiki.gentoo.org/wiki/Tensile). However you need a benchmark config
> file for 7900XTX. I guess you can reuse the navi21 config files
> (https://github.com/ROCmSoftwarePlatform/Tensile/tree/develop/Tensile/
> Configs/navi21) as a basic setup.

Suppose AMD should hire us as a developers and testers in this case :) Are you sure it is worth efforts since we can just wait some time for at least a 5.5 release (maybe even consider 6.0 release)?

For now, i am trying to compile onnxruntime for new transformer LC0 nets. Also thanks a lot for your pytorch great work although i am working with tensorflow personally. Tensorflow and pytorch support whould be nice since 24Gb memory allows to fit some advanced models while using free opensource drivers by AMD.