693200 – media-gfx/blender should use llvm.eclass / add rocm use flag

Bug 693200 - media-gfx/blender should use llvm.eclass / add rocm use flag

Summary: media-gfx/blender should use llvm.eclass / add rocm use flag

Status:	UNCONFIRMED

Alias:	None

Product:	Gentoo Linux
Classification:	Unclassified
Component:	Current packages (show other bugs)
Hardware:	All Linux

Importance:	Normal enhancement
Assignee:	Paul Zander

URL:
Whiteboard:
Keywords:	PullRequest

Depends on:	851702
Blocks:
	Show dependency tree

Reported:	2019-08-31 13:20 UTC by Luke A. Guest
Modified:	2024-04-21 12:51 UTC (History)
CC List:	15 users (show)

See Also:	https://github.com/gentoo/gentoo/pull/35973
Package list:
Runtime testing required:	---

Attachments
Patch (diff between blender-3.1.2.ebuild and 3.2.0.ebuild) enabling rocm on blender (blender-rocm.patch,2.49 KB, patch) 2022-06-15 12:28 UTC, Yiyang Wu	Details \| Diff
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Luke A. Guest 2019-08-31 13:20:37 UTC

Currently the blender ebuild's just find the latest installed when it should be possible to enable a specific llvm.

Add a rocm use flag so that the llvm-roc / llvm:roc can be built against.

Comment 1 Adrian 2019-09-14 12:52:02 UTC

I do not have access to an AMD GPU and would be unable to develop this myself. However I would be happy to review a patch.

Comment 2 Luke A. Guest 2019-09-14 15:08:33 UTC

You don't need access to a AMD GPU to test it, you need access to a USE flag! This is a ebuild problem.

Comment 3 Adrian 2019-09-15 03:37:28 UTC

If a straight swap of sys-devel/llvm for sys-devel/llvm-roc is all that is required, it might be possible to add this feature by putting
llvm? ( rocm? ( sys-devel/llvm-roc:= ) !rocm? ( sys-devel/llvm:= ) ) in RDEPEND and rocm? ( llvm opencl ) in REQUIRED_USE

My concern is whether llvm-roc might used during rendering for creating the opencl kernel or for compiling the OSL shaders. Looking at the github page there are also a lot of rocm libraries and I don't know whether some of these might be required as well.

Without hardware I can only test whether emerge blender is successful, not whether blender crashes during rendering. This should be developed and tested by someone with hardware so I can be sure it works prior to integration.

The main advantage of the llvm eclass seems to be the ability to limit the maximum version of llvm to use when several are installed, however blender can use all versions of llvm from the tree. It seems that it is not possible to specify use of llvm-roc using it yet, pending bug #693198.

Comment 4 Johannes Hirte 2019-12-23 18:48:37 UTC

(In reply to Adrian from comment #3)
> If a straight swap of sys-devel/llvm for sys-devel/llvm-roc is all that is
> required, it might be possible to add this feature by putting
> llvm? ( rocm? ( sys-devel/llvm-roc:= ) !rocm? ( sys-devel/llvm:= ) ) in
> RDEPEND and rocm? ( llvm opencl ) in REQUIRED_USE
>

Sadly no, blender doesn't work with rocm and amdgpu. When activating cycles-useflag with rocm-opencl, the system-llvm and rocm-llvm will be mixed. The system-llvm is pulled in by mesa and interferes with the rocm-llvm for opencl. I've tried to get it work, but always get this error:

mesa: CommandLine Error: Option 'help-list' registered more than once!
LLVM ERROR: inconsistency in registered CommandLine options

I'm afraid, as long as rocm is independent from upstream llvm, this doesn't really work.

Comment 5 Johannes Hirte 2019-12-27 10:26:55 UTC

llvm upstream bugreports:

https://bugs.llvm.org/show_bug.cgi?id=30587
https://bugs.llvm.org/show_bug.cgi?id=22952

Comment 6 Martin Rott 2021-09-01 18:33:58 UTC

I'd suggest to wait a bit until opencl and AMD somehow settles on anything anywhere. 
From my findings I have one machine working with opencl libs from amdgpu-pro drivers, sadly no luck on other machine - what I know for sure, it does not work with rocm or amdgpu(or mesa?) opencl. So even if you had a usable use flag, you'll probably end with not working opencl. 
I'm open to some testing or providing more details.. (having both AMD and nvidia and using Blender(from cg overlay) on daily basis.

Comment 7 Luke A. Guest 2021-09-01 20:02:53 UTC

(In reply to Martin Rott from comment #6)
> I'd suggest to wait a bit until opencl and AMD somehow settles on anything
> anywhere. 

I think it was pretty much decided it not to add this flag. TBH, it should be possible to use any OpenCL installed, but I'm not sure of how the icd is suposed to work with CL.

> From my findings I have one machine working with opencl libs from amdgpu-pro
> drivers, sadly no luck on other machine - what I know for sure, it does not
> work with rocm or amdgpu(or mesa?) opencl. So even if you had a usable use
> flag, you'll probably end with not working opencl. 

Yeah, it's an absolute joke tbf. Once I'm in a better position, I intend to learn AMD's assembly language and hope to help with the Clover stuff to get a full OpenCL implementation, if it's not done by then :/ They've dropped my hw completely from ROCm and I've no idea, since zero information, whether it would be possible to retrofit the work onto it as an external developer.

> I'm open to some testing or providing more details.. (having both AMD and
> nvidia and using Blender(from cg overlay) on daily basis.

Comment 8 Yiyang Wu 2022-06-12 13:11:59 UTC

Hello, I have made some efforts on rocm support for blender-3.2.0, which is officially supported by ROCm on Linux (although only RDNA cards are supported).

My work is located at https://github.com/littlewu2508/gentoo/tree/blender-rocm, currently contains 3 commits: simple version bump on media-gfx/openvdb, blender-3.2.0.ebuild with rocm enabled, and 

The compilation is smooth, after

Comment 9 Yiyang Wu 2022-06-12 13:29:19 UTC

Hello, I have made some efforts on rocm support for blender-3.2.0, which is officially supported by ROCm on Linux (although only RDNA cards are supported).

My work is located at https://github.com/littlewu2508/gentoo/tree/blender-rocm, currently contains 3 commits: simple version bump on media-gfx/openvdb, blender-3.2.0.ebuild with rocm enabled, and some nasty hacks to resolve multiple llvm instances caused by sys-devel/llvm-roc.

The compilation is smooth (calling hipcc to compile cycle kernels to fatbin binaries is successful), but blender simply broke at runtime when trying to call the HIP cycles.

Currently I'm blocked by

: CommandLine Error: Option 'use-dbg-addr' registered more than once!
LLVM ERROR: inconsistency in registered CommandLine options

with SIGABRT. This is a common issue when multiple versions of llvm are mixed. I managed to unmerge all llvm and clang and remains only sys-devel/llvm-roc for the HIP compiler. Then those common registered-more-than-once errors are gone, but still leaves another one:

CommandLine Error: Option 'limited-coverage-experimental' registered more than once!

By searching, I found this string is contained in /usr/lib/llvm/roc/lib/libclang-cpp.so.14roc and /usr/lib/llvm/roc/lib/libclangCodeGen.so.14roc so maybe that's why conflict exists. I brutally removed libclangCodeGen.so.14roc, and the error is gone but replaced by an invalid pointer bug.

So the situation is, compilation of blender-rocm seems OK, but the sys-devel/llvm-roc brings another llvm that does not follow the standard gentoo llvm slotting rules, which breaks a lot.

**Conclusion: a little progress on packaging blender-3.2 with rocm, compile seems OK; sys-devel/llvm-roc needs fixes to get things work. Stay tunned.**

Comment 10 Sebastian Parborg 2022-06-13 10:45:13 UTC

You probably need to statically link in the special rocm llvm version to the HIP runtime.

Otherwise it will crash when any program uses the system wide llvm version.

Comment 11 Yiyang Wu 2022-06-13 12:04:46 UTC

(In reply to Sebastian Parborg from comment #10)
> You probably need to statically link in the special rocm llvm version to the
> HIP runtime.
> 
> Otherwise it will crash when any program uses the system wide llvm version.

Does that mean, if one is linked to llvm:n, then all its dependencies has to link to llvm:n?

Comment 12 Sebastian Parborg 2022-06-13 12:57:04 UTC

(In reply to Wu Yiyang from comment #11)
> 
> Does that mean, if one is linked to llvm:n, then all its dependencies has to
> link to llvm:n?

Yes.

If a program in dynamically linked to llvm version X and a library that program uses is dynamically linked to llvm version Y. It will crash because namespace collision. (The functions and namespaces are the same between llvm versions so the program will not know which dynamic library it should call).

I've ran into this issue in the past when the Mesa drivers and some of Blenders dependencies are built with different llvm versions.

Comment 13 perestoronin 2022-06-13 19:13:13 UTC

I have same error while try to compile tensorflow with rocm support as descibed in https://stackoverflow.com/questions/72510724/tensorflow-build-from-sources-with-frag-rocm-failed-with-error-tf-to-kernel-f

valgrind ./tf_to_kernel ...
==3134537== Memcheck, a memory error detector
==3134537== Copyright (C) 2002-2022, and GNU GPL'd, by Julian Seward et al.
==3134537== Using Valgrind-3.19.0 and LibVEX; rerun with -h for copyright info
==3134537== Command: ./tf_to_kernel ...
==3134537== 
: CommandLine Error: Option 'march' registered more than once!
LLVM ERROR: inconsistency in registered CommandLine options
==3134537== 
==3134537== Process terminating with default action of signal 6 (SIGABRT): dumping core
==3134537==    at 0x6E2A9EC: __pthread_kill_implementation (in /lib64/libc.so.6)
==3134537==    by 0x6DDD7A1: raise (in /lib64/libc.so.6)
==3134537==    by 0x6DC81E8: abort (in /lib64/libc.so.6)
==3134537==    by 0x5CBFD05: llvm::report_fatal_error(llvm::Twine const&, bool) (in /var/tmp/portage/sci-libs/tensorflow-2.9.1-r3/work/tensorflow-2.9.1-bazel-base/execroot/org_tensorflow/bazel-out/k8-opt-exec-50AE0418/bin/tensorflow/libtensorflow_framework.so.2.9.1)
==3134537==    by 0x5CBFE5A: llvm::report_fatal_error(char const*, bool) (in /var/tmp/portage/sci-libs/tensorflow-2.9.1-r3/work/tensorflow-2.9.1-bazel-base/execroot/org_tensorflow/bazel-out/k8-opt-exec-50AE0418/bin/tensorflow/libtensorflow_framework.so.2.9.1)
==3134537==    by 0x5CA4A83: (anonymous namespace)::CommandLineParser::addOption(llvm::cl::Option*, llvm::cl::SubCommand*) (in /var/tmp/portage/sci-libs/tensorflow-2.9.1-r3/work/tensorflow-2.9.1-bazel-base/execroot/org_tensorflow/bazel-out/k8-opt-exec-50AE0418/bin/tensorflow/libtensorflow_framework.so.2.9.1)
==3134537==    by 0x5CA4DA1: llvm::cl::Option::addArgument() (in /var/tmp/portage/sci-libs/tensorflow-2.9.1-r3/work/tensorflow-2.9.1-bazel-base/execroot/org_tensorflow/bazel-out/k8-opt-exec-50AE0418/bin/tensorflow/libtensorflow_framework.so.2.9.1)
==3134537==    by 0x250ED2: llvm::codegen::RegisterCodeGenFlags::RegisterCodeGenFlags() (in /var/tmp/portage/sci-libs/tensorflow-2.9.1-r3/work/tensorflow-2.9.1-bazel-base/execroot/org_tensorflow/bazel-out/k8-opt-exec-50AE0418/bin/tensorflow/compiler/mlir/tools/kernel_gen/tf_to_kernel)
==3134537==    by 0x6DC91D2: __libc_start_main@@GLIBC_2.34 (in /lib64/libc.so.6)
==3134537==    by 0x19A310: (below main) (in /var/tmp/portage/sci-libs/tensorflow-2.9.1-r3/work/tensorflow-2.9.1-bazel-base/execroot/org_tensorflow/bazel-out/k8-opt-exec-50AE0418/bin/tensorflow/compiler/mlir/tools/kernel_gen/tf_to_kernel)
==3134537==

Comment 14 Yiyang Wu 2022-06-15 10:17:20 UTC

I packaged rocm-device-libs. roc-comgr and hip version 5.1.3 against gentoo default llvm-14. Now blender can detect RDNA2 cards and render using HIP cycles, although the [blender-3.2 demo](https://cloud.blender.org/p/gallery/629f23f908e12d4ff15241d3) is not render-demanding so I don't see a large GPU occupation.

Running `rocm-smi --showpids` gives:

```
======================= ROCm System Management Interface =======================
================================ KFD Processes =================================
KFD process information:
PID     PROCESS NAME    GPU(s)  VRAM USED       SDMA USED       CU OCCUPANCY
2214542 blender-3.2     0       0               0               0
================================================================================
============================= End of ROCm SMI Log ==============================
```

I'll try to clean up my changes and push ROCm-5.1.3 components to gentoo in the following weeks. Changing to upstream llvm may introduce breaking changes to existing ROCm packages, so still a lot to do.

Comment 15 Yiyang Wu 2022-06-15 10:30:44 UTC

> Running `rocm-smi --showpids` gives:
> 
> ```
> ======================= ROCm System Management Interface
> =======================
> ================================ KFD Processes
> =================================
> KFD process information:
> PID     PROCESS NAME    GPU(s)  VRAM USED       SDMA USED       CU OCCUPANCY
> 2214542 blender-3.2     0       0               0               0
> =============================================================================
> ===
> ============================= End of ROCm SMI Log
> ==============================
> ```

OK that means It is not occupying GPU memories and run the hip kernel. Maybe I was not rendering anything -- When will blender use cycles to render?

Comment 16 Yiyang Wu 2022-06-15 11:55:05 UTC

> OK that means It is not occupying GPU memories and run the hip kernel. Maybe
> I was not rendering anything -- When will blender use cycles to render?

Oh I need to press F12 to render. Now the RX 6700 XT are running at full speed

```
======================= ROCm System Management Interface =======================
================================ KFD Processes =================================
KFD process information:
PID     PROCESS NAME    GPU(s)  VRAM USED       SDMA USED       CU OCCUPANCY
2525317 blender-3.2     1       2141466624      0               0
================================================================================
============================= End of ROCm SMI Log ==============================
```

Comment 17 Yiyang Wu 2022-06-15 12:26:14 UTC

On a Ryzen 5950X + Radeon RX 6700XT, I bumped blender to 3.2.0, and enable its hip cycles. It built and successfully rendered the blender 3.2 demo and 3.1 demo using HIP cycles on 6700XT.

Some benchmarks:

blender demo 3.1 https://cloud.blender.org/p/gallery/6220ae43b4a486f53171c89e:
Rendering using Cycles:
| pure CPU | HIP 6700XT | HIP 6700XT+5950X|
| 3m16s    | 1m54s,1m40s| 1m24s           |

The uncertainty may be large, but clearly shows blender-3.2 on Gentoo is capable of using HIP cycles on RDNA2 cards to render.

I have uploaded the patch of blender-3.2.ebuild which enables HIP cycles

Comment 18 Yiyang Wu 2022-06-15 12:28:44 UTC

Created attachment 785432 [details, diff]
Patch (diff between blender-3.1.2.ebuild and 3.2.0.ebuild) enabling rocm on blender

Comment 19 Sebastian Parborg 2022-06-15 16:59:49 UTC

Awesome!

Did you have to change much to make rocm compile with the vanilla llvm release?
I thought that AMD had changed quite a bit in their llvm version and last time I checked they didn't provide any "disable special functionality so upstream llvm can be use" flag.

Comment 20 Yiyang Wu 2022-06-16 00:32:34 UTC

(In reply to Sebastian Parborg from comment #19)
> Awesome!
> 
> Did you have to change much to make rocm compile with the vanilla llvm
> release?
> I thought that AMD had changed quite a bit in their llvm version and last
> time I checked they didn't provide any "disable special functionality so
> upstream llvm can be use" flag.

Speaking of hip, we don't have to change much, llvm/clang-14 just work out-of-box (actually Debian has been shipping rocm with beginning from clang-13)[1]. Patches are mainly for location issues, because AMD assume all components are in /opt/rocm. We install it under /usr, which result in passing '-isystem /usr/include' flag early to clang, causing wrong order of include dirs which fails `#include_next <math.h>`.

Although I do observe test failures in test suites which is common among all distributions packaging ROCm against upstream llvm[2]. Luckily I don't observe blender run into those problems.

You can find all my commits in https://github.com/littlewu2508/gentoo/tree/blender-rocm. First upgrade to clang-14.0.5-r1 (with a ROCm patch fixing include dir searches), and install/upgrade rocm-device-libs, rocm-comgr, hip to 5.1.3. Then emerge blender. I'll do some clean up and more tests, then land ROCm changes to ::gentoo in the following days.

Comment 21 Yiyang Wu 2022-06-16 00:37:08 UTC

And I think this may also be the ultimate solution to the previous discussion. ROCm provides the opencl, so in blender-2.x mixing of llvm-roc and llvm also happens. This will do the trick.

Comment 22 perestoronin 2022-06-25 04:13:38 UTC

I have got error while try compile sci-libs/rocFFT or sci-libs/rocRAND with dev-util/hip v5.1.3:

-- Configuring done
CMake Error in library/src/CMakeLists.txt:
  Imported target "hip::device" includes non-existent path

    "HIP_CLANG_INCLUDE_PATH-NOTFOUND/.."

  in its INTERFACE_INCLUDE_DIRECTORIES.  Possible reasons include:

  * The path was deleted, renamed, or moved to another location.

  * An install or uninstall procedure did not complete successfully.

  * The installation package was faulty and references files it does not
  provide.

How to fix this errors ?

Comment 23 Yiyang Wu 2022-06-25 04:23:28 UTC

(In reply to perestoronin from comment #22)
> I have got error while try compile sci-libs/rocFFT or sci-libs/rocRAND with
> dev-util/hip v5.1.3:
> 
> -- Configuring done
> CMake Error in library/src/CMakeLists.txt:
>   Imported target "hip::device" includes non-existent path
> 
>     "HIP_CLANG_INCLUDE_PATH-NOTFOUND/.."
> 
>   in its INTERFACE_INCLUDE_DIRECTORIES.  Possible reasons include:
> 
>   * The path was deleted, renamed, or moved to another location.
> 
>   * An install or uninstall procedure did not complete successfully.
> 
>   * The installation package was faulty and references files it does not
>   provide.
> 
> How to fix this errors ?

Yes, I can reproduced that, too. I'm working on it.

This is caused by .cmake files from dev-util/hip. Switching to vanilla clang means some directory changes compared to llvm-roc, so although I patched hipcc to work with vanilla clang, cmake modules are not working properly.

Comment 24 Yiyang Wu 2022-06-26 09:43:50 UTC

(In reply to perestoronin from comment #22)
> I have got error while try compile sci-libs/rocFFT or sci-libs/rocRAND with
> dev-util/hip v5.1.3:
> ....
> How to fix this errors ?

Updates:

I pushed some new commits into https://github.com/littlewu2508/gentoo/tree/blender-rocm, which should fix the problem. Now rocBLAS compiles and I suppose rocFFT and rocSPARSE as well.

I also get rid of the patched clang (move hack to hip), so we don't have to depend on the sys-devel/clang-14.0.5-r1.

As for blender, things works normally on RDNA2 cards. I backported https://developer.blender.org/D15242 to enable pre-RDNA devices, but the blender aborted when I try to render on Radeon VII:

```
Memory access fault by GPU node-1 (Agent handle: 0x557fab130c90) on address 0x7f6e6ffff000. Reason: Page not present or supervisor privilege.
Nearby memory map:
0x7f6e70000000, 0xa306000, VRAM
0x7f6e8a000000, 0x960000, VRAM
0x7f6e8b000000, 0x960000, VRAM

PtrInfo:
        Address: 0x7f6e70000000-0x7f6e7a306000/0x7f6e70000000-0x7f6e7a306000
        Size: 0xa306000
        Type: 1
        Owner: 0x557fab130c90
        CanAccess: 1
                0x557fab130c90
        In block: 0x7f6e70000000, 0xa400000
PtrInfo:
        Address: 0x7f6e8a000000-0x7f6e8a960000/0x7f6e8a000000-0x7f6e8a960000
        Size: 0x960000
        Type: 1
        Owner: 0x557fab130c90
        CanAccess: 1
                0x557fab130c90
        In block: 0x7f6e8a000000, 0xa00000
PtrInfo:
        Address: 0x7f6e8b000000-0x7f6e8b960000/0x7f6e8b000000-0x7f6e8b960000
        Size: 0x960000
        Type: 1
        Owner: 0x557fab130c90
        CanAccess: 1
                0x557fab130c90
        In block: 0x7f6e8b000000, 0xa00000
blender-3.2: /fast/portage/dev-libs/rocr-runtime-5.1.3/work/ROCR-Runtime-rocm-5.1.3/src/core/runtime/runtime.cpp:1276: static bool rocr::core::Runtime::VMFaultHandler(hsa_signal_value_t, void*): Assertion `false && "GPU memory access fault."' failed.
```

And https://developer.blender.org/D15242 says "This needs a newer HIP SDK", I guess maybe a new version of ROCm. So until then blender-3.2 HIP Cycles only works on RDNA cards. So I reverted that backport.

Comment 25 perestoronin 2022-06-26 21:21:45 UTC

(In reply to Yiyang Wu from comment #24)
> (In reply to perestoronin from comment #22)
> > I have got error while try compile sci-libs/rocFFT or sci-libs/rocRAND with
> > dev-util/hip v5.1.3:
> > ....
> > How to fix this errors ?
> 
> Updates:
> 
> I pushed some new commits into
> https://github.com/littlewu2508/gentoo/tree/blender-rocm, which should fix
> the problem. Now rocBLAS compiles and I suppose rocFFT and rocSPARSE as well.

rocFFT соmpile too, аfter this patch

--- a/library/src/include/twiddles.h
+++ b/library/src/include/twiddles.h
@@ -14,6 +14,7 @@
 #include <numeric>
 #include <tuple>
 #include <vector>
+#include <stdexcept>
 
 static const size_t LTWD_BASE_DEFAULT       = 8;
 static const size_t LARGE_TWIDDLE_THRESHOLD = 4096;

> As for blender, things works normally on RDNA2 cards. I backported
> https://developer.blender.org/D15242 to enable pre-RDNA devices, but the
> blender aborted when I try to render on Radeon VII:
> 
> And https://developer.blender.org/D15242 says "This needs a newer HIP SDK",
> I guess maybe a new version of ROCm. So until then blender-3.2 HIP Cycles
> only works on RDNA cards. So I reverted that backport.

No, old cards not supported by AMD in rocm, and on Vega Frontier GPU blender also segfault after attempted use HIP in cycles addon, but I want to find who can fix amdgpu kernel drivers to work fully with blender, rocm-smi, tensorflow https://gist.github.com/raw/0c06a9a8a38770b2cf18000ec4d18462

Comment 26 perestoronin 2022-06-26 21:23:53 UTC

(In reply to perestoronin from comment #25)
> > As for blender, things works normally on RDNA2 cards. I backported
> > https://developer.blender.org/D15242 to enable pre-RDNA devices, but the
> > blender aborted when I try to render on Radeon VII:
> > 
> > And https://developer.blender.org/D15242 says "This needs a newer HIP SDK",
> > I guess maybe a new version of ROCm. So until then blender-3.2 HIP Cycles
> > only works on RDNA cards. So I reverted that backport.
> 
> No, old cards not supported by AMD in rocm, and on Vega Frontier GPU blender
> also segfault after attempted use HIP in cycles addon, but I want to find
> who can fix amdgpu kernel drivers to work fully with blender, rocm-smi,
> tensorflow https://gist.github.com/raw/0c06a9a8a38770b2cf18000ec4d18462

ERROR: 2 GPU[0]: % memory use: RSMI_STATUS_NOT_SUPPORTED: This function is not supported in the current environment.
ERROR: 9 GPU[0]: od volt: The called function has not been implemented in this system for this device type
ERROR: 2 GPU[0]: ras: RSMI_STATUS_NOT_SUPPORTED: This function is not supported in the current environment.
ERROR: 9 GPU[0]: od volt: The called function has not been implemented in this system for this device type
ERROR: 9 GPU[0]: od volt: The called function has not been implemented in this system for this device type
ERROR: 9 GPU[0]: od volt: The called function has not been implemented in this system for this device type
ERROR: 9 GPU[0]: od volt: The called function has not been implemented in this system for this device type
ERROR: 2 GPU[0]: % Energy Counter: RSMI_STATUS_NOT_SUPPORTED: This function is not supported in the current environment.

Comment 27 Luke A. Guest 2022-06-26 21:33:18 UTC

> No, old cards not supported by AMD in rocm, and on Vega Frontier GPU blender
> also segfault after attempted use HIP in cycles addon, but I want to find
> who can fix amdgpu kernel drivers to work fully with blender, rocm-smi,
> tensorflow https://gist.github.com/raw/0c06a9a8a38770b2cf18000ec4d18462


I don't think it's a kernel issue, it's a blob issue.

Comment 28 Luke A. Guest 2022-06-27 05:05:13 UTC

(In reply to Luke A. Guest from comment #27)
> > No, old cards not supported by AMD in rocm, and on Vega Frontier GPU blender
> > also segfault after attempted use HIP in cycles addon, but I want to find
> > who can fix amdgpu kernel drivers to work fully with blender, rocm-smi,
> > tensorflow https://gist.github.com/raw/0c06a9a8a38770b2cf18000ec4d18462
> 
> 
> I don't think it's a kernel issue, it's a blob issue.

You can find a reference to a hawaii_mec.bin.1a7 inside one of the many amd rocm issues lists, which I think explains it a bit more.

Comment 29 perestoronin 2022-06-28 04:58:01 UTC

(In reply to Yiyang Wu from comment #24)
> Updates:
> 
> I pushed some new commits into
> https://github.com/littlewu2508/gentoo/tree/blender-rocm, which should fix
> the problem. Now rocBLAS compiles and I suppose rocFFT and rocSPARSE as well.

I have got new error while try to compile sci-libs/miopen v5.1.3:

CMake Error at CMakeLists.txt:309 (find_library):
  Could not find LIBMLIRMIOPEN using the following names: MLIRMIOpen

Can you fix this error ?
 
> And https://developer.blender.org/D15242 says "This needs a newer HIP SDK",
> I guess maybe a new version of ROCm. So until then blender-3.2 HIP Cycles
> only works on RDNA cards. So I reverted that backport.

I asked share with me (perestoronin@gmail.com) about "new HIP SDK" from https://github.com/sayakbiswas via email sayak90@gmail.com but not responded.
If you have "new HIP SDK" please share it with me.

Comment 30 Yiyang Wu 2022-06-28 05:17:50 UTC

(In reply to perestoronin from comment #29)
> I have got new error while try to compile sci-libs/miopen v5.1.3:
> 
> CMake Error at CMakeLists.txt:309 (find_library):
>   Could not find LIBMLIRMIOPEN using the following names: MLIRMIOpen
> 
> Can you fix this error ?

I'll have a look. But my main focus is to refine dev-util/hip-5.1.3 and land it to ::gentoo, and bump ROCm packages in sci-libs.

Also it is not related to blender, so shall we discuss in https://bugs.gentoo.org/851702?

> 
> I asked share with me (perestoronin@gmail.com) about "new HIP SDK" from
> https://github.com/sayakbiswas via email sayak90@gmail.com but not responded.
> If you have "new HIP SDK" please share it with me.

I don't have personal releationship to him or blender developers, either. I think the new HIP SDK means the later releases of HIP. So I'll try making those -9999 ebuild work, and then we can keep up the latest progress of HIP.

Comment 31 Yiyang Wu 2022-06-29 08:20:01 UTC

(In reply to Yiyang Wu from comment #24)

> Memory access fault by GPU node-1 (Agent handle: 0x557fab130c90) on address
> 0x7f6e6ffff000. Reason: Page not present or supervisor privilege.
> Nearby memory map:
> 0x7f6e70000000, 0xa306000, VRAM
> 0x7f6e8a000000, 0x960000, VRAM
> 0x7f6e8b000000, 0x960000, VRAM

With ROCm 5.2.0 released recently I'm still getting this error.

> 
> And https://developer.blender.org/D15242 says "This needs a newer HIP SDK",
> I guess maybe a new version of ROCm. So until then blender-3.2 HIP Cycles
> only works on RDNA cards. So I reverted that backport.

According to https://developer.blender.org/rBabfa09752f5c4d1fa2ae9df5e4ee0c9d77b50f3e, the required hip version is 5.2.21440, while the newest hip release is 5.2.21151 (see https://repo.radeon.com/rocm/apt/5.2/pool/main/h/hip-runtime-amd/), so I suppose we have to wait for the next patch release ROCm 5.2.1

Comment 32 Yiyang Wu 2022-06-29 09:00:00 UTC

(In reply to Yiyang Wu from comment #31)

> According to
> https://developer.blender.org/rBabfa09752f5c4d1fa2ae9df5e4ee0c9d77b50f3e,
> the required hip version is 5.2.21440, while the newest hip release is
> 5.2.21151 (see
> https://repo.radeon.com/rocm/apt/5.2/pool/main/h/hip-runtime-amd/), so I
> suppose we have to wait for the next patch release ROCm 5.2.1

I did a quick investigation on the version of hip. The version string (21151,21440) are determined in bin/hipvars.pm, variable $HIP_BASE_VERSION_PATCH [1]. The version string stay the same within minor release, so ROCm 5.2.x won't be the release that made blender work on Vega devices.

There is not a single commit in HIP that introduce the patch version 21440.

Should wait for ROCm 5.3 and see.

[1] https://github.com/ROCm-Developer-Tools/HIP/blob/60b60f78e6b8ed3fb2e64388b5f27771a16673e8/bin/hipvars.pm#L30

Comment 33 perestoronin 2022-07-02 16:13:31 UTC

(In reply to Yiyang Wu from comment #30)
> (In reply to perestoronin from comment #29)
> > I have got new error while try to compile sci-libs/miopen v5.1.3:
> > 
> > CMake Error at CMakeLists.txt:309 (find_library):
> >   Could not find LIBMLIRMIOPEN using the following names: MLIRMIOpen
> > 
> > Can you fix this error ?
> 
> I'll have a look. But my main focus is to refine dev-util/hip-5.1.3 and land
> it to ::gentoo, and bump ROCm packages in sci-libs.
> 
> Also it is not related to blender, so shall we discuss in
> https://bugs.gentoo.org/851702?
> 
> > 
> > I asked share with me (perestoronin@gmail.com) about "new HIP SDK" from
> > https://github.com/sayakbiswas via email sayak90@gmail.com but not responded.
> > If you have "new HIP SDK" please share it with me.
> 
> I don't have personal releationship to him or blender developers, either. I
> think the new HIP SDK means the later releases of HIP. So I'll try making
> those -9999 ebuild work, and then we can keep up the latest progress of HIP.

Thanks! How test_all.sh from https://github.com/ROCm-Developer-Tools/HIP-Examples.git passed successful after add in top of test_all.sh next lines:

export HIP_PATH="/usr"
export HIP_PLATFORM="amd"

And rocBLAS now compiled successful. I will try upgrade ebuild to rocm v5.2.0 and have got:

-- Performing Test HIP_CLANG_SUPPORTS_PARALLEL_JOBS - Failed

How to fix warnigns ?

*------------------------------- ROCMChecks WARNING --------------------------*
  Options and properties should be set on a cmake target where possible. The
  variable 'CMAKE_CXX_FLAGS' may be set by the cmake toolchain, either by
  calling 'cmake -DCMAKE_CXX_FLAGS="-O2 -pipe -march=znver2 -Wno-unused-command-line-argument"'
  or set in a toolchain file and added with
  'cmake -DCMAKE_TOOLCHAIN_FILE=<toolchain-file>'

CMake Warning at /usr/share/rocm/cmake/ROCMChecks.cmake:46 (message):
  'CMAKE_CXX_FLAGS' is set at
  /var/tmp/portage/sci-libs/rocRAND-5.2.0/work/rocRAND-rocm-5.2.0/cmake/CMakeLists.txt

And how to fix warnign ?

Comment 34 Yiyang Wu 2022-07-03 02:46:07 UTC

(In reply to perestoronin from comment #33)
> And rocBLAS now compiled successful. I will try upgrade ebuild to rocm
> v5.2.0 and have got:
> 
> -- Performing Test HIP_CLANG_SUPPORTS_PARALLEL_JOBS - Failed
> 
> How to fix warnigns ?
> 

Currently clang does not support parallel jobs, I suppose. Maybe llvm/clang-15 will include that support, see https://reviews.llvm.org/D69582, maybe not.

> *------------------------------- ROCMChecks WARNING
> --------------------------*
>   Options and properties should be set on a cmake target where possible. The
>   variable 'CMAKE_CXX_FLAGS' may be set by the cmake toolchain, either by
>   calling 'cmake -DCMAKE_CXX_FLAGS="-O2 -pipe -march=znver2
> -Wno-unused-command-line-argument"'
>   or set in a toolchain file and added with
>   'cmake -DCMAKE_TOOLCHAIN_FILE=<toolchain-file>'
> 
> CMake Warning at /usr/share/rocm/cmake/ROCMChecks.cmake:46 (message):
>   'CMAKE_CXX_FLAGS' is set at
>  
> /var/tmp/portage/sci-libs/rocRAND-5.2.0/work/rocRAND-rocm-5.2.0/cmake/
> CMakeLists.txt
> 
> And how to fix warnign ?

rocRAND and hipRAND's CMakeLists.txt contains `set(CMAKE_CXX_FLAGS`, which triggered the warning. Simply remove these blocks and handle CXX_FLAGS by portage. We should also report that warning to upstream, and remind them that CMAKE_CXX_FLAGS should be set in toolchain file rather than CMakeLists.

Comment 35 perestoronin 2022-07-03 14:20:22 UTC

(In reply to Yiyang Wu from comment #34)
> (In reply to perestoronin from comment #33)
> > And rocBLAS now compiled successful. I will try upgrade ebuild to rocm
> > v5.2.0 and have got:
> > 
> > -- Performing Test HIP_CLANG_SUPPORTS_PARALLEL_JOBS - Failed
> > 
> > How to fix warnigns ?
> > 
> 
> Currently clang does not support parallel jobs, I suppose. Maybe
> llvm/clang-15 will include that support, see
> https://reviews.llvm.org/D69582, maybe not.

Thanks, аfter adopt patch D69582 to llvm-14.0.6 branch (new patch may be taken from https://gist.github.com/raw/8f79f3435e1a1f600ab5cd07d401b686):

-- Performing Test HIP_CLANG_SUPPORTS_PARALLEL_JOBS - Success

> rocRAND and hipRAND's CMakeLists.txt contains `set(CMAKE_CXX_FLAGS`, which
> triggered the warning. Simply remove these blocks and handle CXX_FLAGS by
> portage. We should also report that warning to upstream, and remind them
> that CMAKE_CXX_FLAGS should be set in toolchain file rather than CMakeLists.

Thanks, done.

Comment 36 Rafael Ristovski 2022-09-30 15:25:24 UTC

What is the current progress on resolving the fact that multiple LLVM versions get loaded into Blender when using HIP? This causes even the older Blender 2.x releases to trigger the llvm error when using rocm-opencl-runtime.

From what I could gather, the solution is to get the ROCm stack to build with the systems LLVM version which will then match the other one that gets loaded?

Comment 37 Yiyang Wu 2022-09-30 16:19:33 UTC

(In reply to Rafael Ristovski from comment #36)
> What is the current progress on resolving the fact that multiple LLVM
> versions get loaded into Blender when using HIP? This causes even the older
> Blender 2.x releases to trigger the llvm error when using
> rocm-opencl-runtime.
> 
> From what I could gather, the solution is to get the ROCm stack to build
> with the systems LLVM version which will then match the other one that gets
> loaded?

ROCm-5.1.3 ebuilds in Gentoo are now built against system llvm/clang-14, so using them (>=5.1.3) would be safe.

Comment 38 Sebastian Parborg 2022-10-01 10:12:22 UTC

(In reply to Yiyang Wu from comment #37)
> (In reply to Rafael Ristovski from comment #36)
> > What is the current progress on resolving the fact that multiple LLVM
> > versions get loaded into Blender when using HIP? This causes even the older
> > Blender 2.x releases to trigger the llvm error when using
> > rocm-opencl-runtime.
> > 
> > From what I could gather, the solution is to get the ROCm stack to build
> > with the systems LLVM version which will then match the other one that gets
> > loaded?
> 
> ROCm-5.1.3 ebuilds in Gentoo are now built against system llvm/clang-14, so
> using them (>=5.1.3) would be safe.

Great! I'll try to get around to adding a HIP useflag to Blender soonish then.

As a follow up question, do you know if the HIP versions in portage will be bumped to 5.3.0 soon? Seems like they fixed some RDNA1 issues:
https://devtalk.blender.org/t/cycles-amd-hip-device-feedback/21400/419

Comment 39 Yiyang Wu 2022-10-01 10:41:40 UTC

> Great! I'll try to get around to adding a HIP useflag to Blender soonish
> then.
> 

Just proposed that in PR https://github.com/gentoo/gentoo/pull/27552

Please test it with RDNA2 cards. Months ago I succeeded (also mentioned in previous comments), but I don't test it on the new blender version.

I tried blender-2.93.10 with opencl, but that did not work due to llvm symbol collision (although I left only one SLOT, there are still multiple symbols; no idea).

> As a follow up question, do you know if the HIP versions in portage will be
> bumped to 5.3.0 soon? Seems like they fixed some RDNA1 issues:
> https://devtalk.blender.org/t/cycles-amd-hip-device-feedback/21400/419

ROCm-5.3 is not out. And it takes time for me to land it in Gentoo. If there are other developers willing to help maintaining ROCm ebuilds, it would be nice and fast.

Comment 40 Sebastian Parborg 2022-10-01 11:23:55 UTC

(In reply to Yiyang Wu from comment #39)
> Just proposed that in PR https://github.com/gentoo/gentoo/pull/27552
> 

Ok, lets continue the conversation there.

> Please test it with RDNA2 cards. Months ago I succeeded (also mentioned in
> previous comments), but I don't test it on the new blender version.
>

Hopefully I will have some time to test it next week, but I can't promise anything.
 
> I tried blender-2.93.10 with opencl, but that did not work due to llvm
> symbol collision (although I left only one SLOT, there are still multiple
> symbols; no idea).
>

I think we can just ignore opencl at this point. When I tried it in the past it as very unstable and would lock up the computer frequently.
If it did actually render, it would be slower than my CPU. So at least to me there isn't really any point in spending time on trying to get that to work.

Lets just focus on HIP :)

> ROCm-5.3 is not out. And it takes time for me to land it in Gentoo. If there
> are other developers willing to help maintaining ROCm ebuilds, it would be
> nice and fast.

It was released 18 hours ago:
https://github.com/ROCm-Developer-Tools/hipamd/releases/tag/rocm-5.3.0

So it could be something that we could work towards.

Comment 41 Larry the Git Cow gentoo-dev

2024-04-21 12:51:26 UTC

The bug has been referenced in the following commit(s):

https://gitweb.gentoo.org/repo/gentoo.git/commit/?id=175d65e779e69e5702ca52cb3af973a2fa0b0e62

commit 175d65e779e69e5702ca52cb3af973a2fa0b0e62
Author:     Paul Zander <negril.nx+gentoo@gmail.com>
AuthorDate: 2024-03-28 22:08:25 +0000
Commit:     Sam James <sam@gentoo.org>
CommitDate: 2024-04-21 12:50:05 +0000

    media-gfx/blender: add 4.0.2-r1, cleanup
    
    hopefully fixed osl build
    re-added hip flag in 4.0.2-r1
    hide test code in release versions
    
    Bug: https://bugs.gentoo.org/693200
    Closes: https://bugs.gentoo.org/925534
    Closes: https://bugs.gentoo.org/927281
    Closes: https://bugs.gentoo.org/927715
    Closes: https://bugs.gentoo.org/927835
    Closes: https://bugs.gentoo.org/927931
    Signed-off-by: Paul Zander <negril.nx+gentoo@gmail.com>
    Closes: https://github.com/gentoo/gentoo/pull/35973
    Signed-off-by: Sam James <sam@gentoo.org>

 media-gfx/blender/blender-3.3.15.ebuild            |   4 +-
 media-gfx/blender/blender-3.3.8.ebuild             |   4 +-
 media-gfx/blender/blender-3.6.8.ebuild             |   4 +-
 ...lender-4.0.2.ebuild => blender-4.0.2-r1.ebuild} | 128 +++++---
 media-gfx/blender/blender-9999.ebuild              | 119 ++++---
 .../blender/files/blender-4.0.1-openvdb-11.patch   |   2 +
 .../files/blender-4.0.2-CUDA_NVCC_FLAGS.patch      |  14 +
 .../blender/files/blender-4.0.2-FindClang.patch    |  14 +
 .../blender/files/blender-4.0.2-r1-osl-1.13.patch  | 342 +++++++++++++++++++++
 profiles/arch/amd64/package.use.mask               |   4 +
 profiles/arch/base/package.use.mask                |   4 +
 11 files changed, 556 insertions(+), 83 deletions(-)