Created attachment 880241 [details] AMD_LOG_LEVEL=4 ./vadd_hip HIP programs do not exit normally with hip-5.7.1-r1. Any HIP-involving program will hang and become unkillable as far as i can tell. Even "sudo kill -9 $PID" doesn't end its zombified existence. Tested on vadd_hip from https://wiki.gentoo.org/wiki/HIP#Testing_your_HIP_installation. Steps to reproduce: 1. Compile any HIP program 2. Launch it 3. Watch it complete its task and become a zombie task Expected behavior: 1. Compile any HIP program 2. Launch it 3. Watch it complete its task and successfully terminate My LLVM version is 17 if that's of any use.
Can you attach the output of: 1. `dmesg` after starting the hip programxsd 2. `rocm-smi --showpids` after the program hangs
dmesg shows nothing after program hangs # rocm-smi --showpids ======================= ROCm System Management Interface ======================= ================================ KFD Processes ================================= KFD process information: PID PROCESS NAME GPU(s) VRAM USED SDMA USED CU OCCUPANCY 16938 blender-4.1 <defunct> 0 0 0 0 246625 vadd_hip 1 159436800 0 0 284855 vadd_hip 1 159436800 0 0 263009 vadd_hip 1 159436800 0 0 10045 blender-4.1 <defunct> 0 0 0 0 262181 blender-4.1 <defunct> 0 0 0 0 194298 rocminfo 0 0 0 0 46475 rocm-bandwidth- 0 0 0 0 247754 vadd_hip <defunct> 1 159436800 0 0 ================================================================================ ============================= End of ROCm SMI Log ============================== blender instances are me trying to enable HIP in the app earlier, and four vadd_hip's are my attempts at, well, launching them
Created attachment 880251 [details] Full dmesg -x log
In your dmesg: kern :err : [ 818.388261] amdgpu 0000:08:00.0: amdgpu: bo 00000000f74c912b va 0x0800000000-0x0800000001 conflict with 0x0800000000-0x0800000200 kern :err : [ 818.388267] amdgpu: Failed to map VA 0x800000000000 in vm. ret -22 kern :err : [ 818.388269] amdgpu: Failed to map bo to gpuvm Seems that there's some early issues happening with the amdgpu driver. And your `rocm-smi --showpids` shows that zombie processes are piling up. I suspect it's the amdgpu driver that causes the hangs. What happens if you reset your GPU by `sudo rocm-smi --gpureset -d 0`, or even reboot the machine, and then test the hip program vecadd?
I tried restarting and found out that everything is absolutely fine until i launch blender and try to select HIP, then it hangs itself, persisting somehow between gpu resets and forces any other HIP program to hang on exit as described.
When I try to enable HIP in blender, there is a crash somewhere in the driver(?) that's causing the effect maybe? dmesg is in the attachment
Created attachment 880253 [details] blender HIP crash
I have found some similar issues: https://github.com/ROCm/ROCm/issues/2596 https://gitlab.freedesktop.org/drm/amd/-/issues/2991 Not sure if they are related to this bug, but they happens after Linux 6.6. Can you try Linux 6.5?
On kernel Linux FruitPlantation 6.5.13-gentoo-dist #1 SMP PREEMPT_DYNAMIC Sun Dec 24 17:04:30 +10 2023 x86_64 AMD Ryzen 5 3600 6-Core Processor AuthenticAMD GNU/Linux When i try to select HIP, it works but hangs half a second later, at least showing me the menu From there it's a mess, the broken blender instance can do anything from a "clean" exit on kill to a complete system hang, rocm-smi reports it as "running" or defunct. Will try 6.5.6 and report
Created attachment 880279 [details] Blender log on 6.5.13 until hang
Created attachment 880281 [details] dmesg on 6.5.6 until hang when selecting the CPU
I was also using 6700XT on 6.5 kernel with blender, and I can't reproduce this issue. I'm using Gentoo Prefix on Debian 12, with some small customization on vanilla Debian kernel. Can you share your kernel configuration?
On kernel Linux FruitPlantation 6.5.6-gentoo-dist #1 SMP PREEMPT_DYNAMIC Sun Dec 24 18:17:59 +10 2023 x86_64 AMD Ryzen 5 3600 6-Core Processor AuthenticAMD GNU/Linux Hanging continues, although UI stays responsive for another ~second before hanging. When trying to terminate the program it hangs the GPU. After 'rocm-smi -d 0 --gpureset' gdm is able to launch although session doesn't work. Kernel log attached, blender log unchanged, what should I do next?
Created attachment 880282 [details] Kerel config (6.5.6)
Created attachment 880283 [details] Kernel config (6.5.13)
> Kerel config (6.5.6) The amdgpu related differences between your config and mine are: 1. I use CONFIG_DRM_AMDGPU=y (shouldn't be a problem; most people on the world use CONFIG_DRM_AMDGPU=m) 2. I don't set CONFIG_HSA_AMD_SVM=y Another thing you can try is finding out which code triggers the kernel issue. Turn on USE=debug and follow instructions from https://wiki.gentoo.org/wiki/Project:Quality_Assurance/Backtraces to get backtraces of blender. After finding the trigger we can report to upstream, including amdgpu kernel driver, HIP (https://github.com/ROCm/clr) and blender. Also, since this is related to amdgpu kernel driver, maybe reporting the issue to https://gitlab.freedesktop.org/drm/amd/-/issues/ will receive more insights
(In reply to Yiyang Wu from comment #16) > Another thing you can try is finding out which code triggers the kernel > issue. Turn on USE=debug and follow instructions from > https://wiki.gentoo.org/wiki/Project:Quality_Assurance/Backtraces to get > backtraces of blender. After finding the trigger we can report to upstream, > including amdgpu kernel driver, HIP (https://github.com/ROCm/clr) and > blender. What exactly should I do? Blender doesn't exactly end its execution, even on SIGKILLs ant everything. Even when i'm rebooting every time i need to hard-reset my PC or wait for systemd to stop caring about it.
(In reply to tigrmango from comment #17) > (In reply to Yiyang Wu from comment #16) > > Another thing you can try is finding out which code triggers the kernel > > issue. Turn on USE=debug and follow instructions from > > https://wiki.gentoo.org/wiki/Project:Quality_Assurance/Backtraces to get > > backtraces of blender. After finding the trigger we can report to upstream, > > including amdgpu kernel driver, HIP (https://github.com/ROCm/clr) and > > blender. > > What exactly should I do? Blender doesn't exactly end its execution, even on > SIGKILLs ant everything. Even when i'm rebooting every time i need to > hard-reset my PC or wait for systemd to stop caring about it. Well if you just want to get rid of GPU hangs and do not need the blender HIP cycle, you could just turn off USE=hip, or disable using HIP cycles in your configuration. If you want to figure the root cause and let HIP cycles accelerate rendering on your Gentoo system, you have to report the issue to the correct experts. I can't reproduce your issue and I'm not an expert on graphics card driver, so I don't know what to do next either. My experience tells me that this issue should be reported at https://gitlab.freedesktop.org/drm/amd/-/issues/, you can open an issue there and paste the the blender blog, dmesg, kernel config, and give the link of this bug ticket as reference. Also, read through https://wiki.gentoo.org/wiki/Project:Quality_Assurance/Backtraces, build a blender with debug info (CXXFLAGS="-Og -ggdb") and use gdb to get backtrace. Looking at the blender log, the last output before hang is "Added device ...", so go to the source code of blender (you can keep the source code after emeger using FEATURES="keepwork"), use tools like grep (I use `ag "Added device") to locate the code, it's at intern/cycles/device/hip/device.cpp:201. Then you know hang happens after this line, so the breakpoint should be put here: 1. gdb --args <command of launching b> 2. break intern/cycles/device/hip/device.cpp:201 3. run After hitting the breakpoint, use `n` to run step by step, and monitor the dmesg, until you hit the issue. Therefore you locate the code that causes the hang, and provide that piece of important information to upstream (HIP, blender)
(In reply to Yiyang Wu from comment #18) > (In reply to tigrmango from comment #17) > > (In reply to Yiyang Wu from comment #16) > > > Another thing you can try is finding out which code triggers the kernel > > > issue. Turn on USE=debug and follow instructions from > > > https://wiki.gentoo.org/wiki/Project:Quality_Assurance/Backtraces to get > > > backtraces of blender. After finding the trigger we can report to upstream, > > > including amdgpu kernel driver, HIP (https://github.com/ROCm/clr) and > > > blender. > > > > What exactly should I do? Blender doesn't exactly end its execution, even on > > SIGKILLs ant everything. Even when i'm rebooting every time i need to > > hard-reset my PC or wait for systemd to stop caring about it. > > Well if you just want to get rid of GPU hangs and do not need the blender > HIP cycle, you could just turn off USE=hip, or disable using HIP cycles in > your configuration. > > If you want to figure the root cause and let HIP cycles accelerate rendering > on your Gentoo system, you have to report the issue to the correct experts. > I can't reproduce your issue and I'm not an expert on graphics card driver, > so I don't know what to do next either. > > My experience tells me that this issue should be reported at > https://gitlab.freedesktop.org/drm/amd/-/issues/, you can open an issue > there and paste the the blender blog, dmesg, kernel config, and give the > link of this bug ticket as reference. > > Also, read through > https://wiki.gentoo.org/wiki/Project:Quality_Assurance/Backtraces, build a > blender with debug info (CXXFLAGS="-Og -ggdb") and use gdb to get backtrace. > > Looking at the blender log, the last output before hang is "Added device > ...", so go to the source code of blender (you can keep the source code > after emeger using FEATURES="keepwork"), use tools like grep (I use `ag > "Added device") to locate the code, it's at > intern/cycles/device/hip/device.cpp:201. Then you know hang happens after > this line, so the breakpoint should be put here: > > 1. gdb --args <command of launching b> > 2. break intern/cycles/device/hip/device.cpp:201 > 3. run > > After hitting the breakpoint, use `n` to run step by step, and monitor the > dmesg, until you hit the issue. Therefore you locate the code that causes > the hang, and provide that piece of important information to upstream (HIP, > blender) I need the HIP acceleration and will try to figure out what causes the hang for me and then report it, I guess. Thank you very much for your help and support!
(In reply to tigrmango from comment #19) > I need the HIP acceleration and will try to figure out what causes the hang > for me and then report it, I guess. Thank you very much for your help and > support! (In reply to Yiyang Wu from comment #16) > The amdgpu related differences between your config and mine are: > > 1. I use CONFIG_DRM_AMDGPU=y (shouldn't be a problem; most people on the > world use CONFIG_DRM_AMDGPU=m) > 2. I don't set CONFIG_HSA_AMD_SVM=y > Have you tried to unset CONFIG_HSA_AMD_SVM?
As far as I can tell, the crash is happening somewhere in python(?) because last available call is somewhere in libpython, before a really long sequence of "Cannot find bounds of current function"
(In reply to Yiyang Wu from comment #20) > (In reply to tigrmango from comment #19) > > I need the HIP acceleration and will try to figure out what causes the hang > > for me and then report it, I guess. Thank you very much for your help and > > support! > > (In reply to Yiyang Wu from comment #16) > > The amdgpu related differences between your config and mine are: > > > > 1. I use CONFIG_DRM_AMDGPU=y (shouldn't be a problem; most people on the > > world use CONFIG_DRM_AMDGPU=m) > > 2. I don't set CONFIG_HSA_AMD_SVM=y > > > > Have you tried to unset CONFIG_HSA_AMD_SVM? I did not try that yet, I'll try to find the crash spot in gdb and then recompile the kernel without this
Recompiled the kernel with CONFIG_HSA_AMD_SVM=n, still crashing and bringing down my GPU. Are there any other differences between our configs?
(In reply to tigrmango from comment #21) > As far as I can tell, the crash is happening somewhere in python(?) because > last available call is somewhere in libpython, before a really long sequence > of "Cannot find bounds of current function" Did you try to recompile blender with debug info on and optimization off, mentioned in https://wiki.gentoo.org/wiki/Project:Quality_Assurance/Backtraces#Compiler_flags? Use CXXFLAGS="-Og -ggdb" and FEATURES="keepwork splitdebug" when emerging (keepwork means you need to manually clean the build directory at ${PORTAGE_TMPDIR} before next round of emerge)
(In reply to tigrmango from comment #23) > Recompiled the kernel with CONFIG_HSA_AMD_SVM=n, still crashing and bringing > down my GPU. Are there any other differences between our configs? Well there are many others, but I don't think they are relevant. I will upload mine.
Created attachment 880286 [details] Kernel config 6.5.10 (no crash issue observed)
(In reply to Yiyang Wu from comment #24) > (In reply to tigrmango from comment #21) > > As far as I can tell, the crash is happening somewhere in python(?) because > > last available call is somewhere in libpython, before a really long sequence > > of "Cannot find bounds of current function" > > Did you try to recompile blender with debug info on and optimization off, > mentioned in > https://wiki.gentoo.org/wiki/Project:Quality_Assurance/ > Backtraces#Compiler_flags? > > Use CXXFLAGS="-Og -ggdb" and FEATURES="keepwork splitdebug" when emerging > (keepwork means you need to manually clean the build directory at > ${PORTAGE_TMPDIR} before next round of emerge) Yes, I did all of that except FEATURES="nostrip"
(In reply to Yiyang Wu from comment #26) > Created attachment 880286 [details] > Kernel config 6.5.10 (no crash issue observed) Tried this config, same crash, still observed