920553 – dev-util/hip-5.7.1-r1: programs hang/zombify instead of exiting

Bug 920553 - dev-util/hip-5.7.1-r1: programs hang/zombify instead of exiting

Summary: dev-util/hip-5.7.1-r1: programs hang/zombify instead of exiting

Status:	UNCONFIRMED

Alias:	None

Product:	Gentoo Linux
Classification:	Unclassified
Component:	Current packages (show other bugs)
Hardware:	AMD64 Linux

Importance:	Normal major
Assignee:	Gentoo Science Related Packages

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2023-12-22 20:53 UTC by tigrmango
Modified:	2023-12-25 03:03 UTC (History)
CC List:	4 users (show)

See Also:
Package list:
Runtime testing required:	---

Attachments
AMD_LOG_LEVEL=4 ./vadd_hip (file_920553.txt,11.92 KB, text/plain) 2023-12-22 20:53 UTC, tigrmango	Details
Full dmesg -x log (dmesg.log,151.85 KB, text/plain) 2023-12-23 06:56 UTC, tigrmango	Details
blender HIP crash (file_920553.txt,13.27 KB, text/plain) 2023-12-23 08:24 UTC, tigrmango	Details
Blender log on 6.5.13 until hang (file_920553.txt,853 bytes, text/plain) 2023-12-24 07:27 UTC, tigrmango	Details
dmesg on 6.5.6 until hang when selecting the CPU (file_920553.txt,7.08 KB, text/plain) 2023-12-24 08:45 UTC, tigrmango	Details
Kerel config (6.5.6) (gentoo-kernel-6.5.6,255.57 KB, application/x-troff-man) 2023-12-24 08:58 UTC, tigrmango	Details
Kernel config (6.5.13) (gentoo-kernel-6.5.13,255.60 KB, text/plain) 2023-12-24 08:59 UTC, tigrmango	Details
Kernel config 6.5.10 (no crash issue observed) (config-6.5.0-0.deb12.4-amd64,259.65 KB, text/plain) 2023-12-24 14:55 UTC, Yiyang Wu	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description tigrmango 2023-12-22 20:53:15 UTC

Created attachment 880241 [details]
AMD_LOG_LEVEL=4 ./vadd_hip

HIP programs do not exit normally with hip-5.7.1-r1. Any HIP-involving program will hang and become unkillable as far as i can tell. Even "sudo kill -9 $PID" doesn't end its zombified existence. Tested on vadd_hip from https://wiki.gentoo.org/wiki/HIP#Testing_your_HIP_installation.

Steps to reproduce:
1. Compile any HIP program
2. Launch it
3. Watch it complete its task and become a zombie task

Expected behavior:
1. Compile any HIP program
2. Launch it
3. Watch it complete its task and successfully terminate

My LLVM version is 17 if that's of any use.

Comment 1 Yiyang Wu 2023-12-23 03:19:21 UTC

Can you attach the output of:

1. `dmesg` after starting the hip programxsd
2. `rocm-smi --showpids` after the program hangs

Comment 2 tigrmango 2023-12-23 06:55:14 UTC

dmesg shows nothing after program hangs

# rocm-smi --showpids


======================= ROCm System Management Interface =======================
================================ KFD Processes =================================
KFD process information:
PID   	PROCESS NAME         	GPU(s)	VRAM USED	SDMA USED	CU OCCUPANCY	
16938 	blender-4.1 <defunct>	0     	0        	0        	0           	
246625	vadd_hip             	1     	159436800	0        	0           	
284855	vadd_hip             	1     	159436800	0        	0           	
263009	vadd_hip             	1     	159436800	0        	0           	
10045 	blender-4.1 <defunct>	0     	0        	0        	0           	
262181	blender-4.1 <defunct>	0     	0        	0        	0           	
194298	rocminfo             	0     	0        	0        	0           	
46475 	rocm-bandwidth-      	0     	0        	0        	0           	
247754	vadd_hip <defunct>   	1     	159436800	0        	0           	
================================================================================
============================= End of ROCm SMI Log ==============================


blender instances are me trying to enable HIP in the app earlier, and four vadd_hip's are my attempts at, well, launching them

Comment 3 tigrmango 2023-12-23 06:56:31 UTC

Created attachment 880251 [details]
Full dmesg -x log

Comment 4 Yiyang Wu 2023-12-23 07:21:27 UTC

In your dmesg:

kern  :err   : [  818.388261] amdgpu 0000:08:00.0: amdgpu: bo 00000000f74c912b va 0x0800000000-0x0800000001 conflict with 0x0800000000-0x0800000200
kern  :err   : [  818.388267] amdgpu: Failed to map VA 0x800000000000 in vm. ret -22
kern  :err   : [  818.388269] amdgpu: Failed to map bo to gpuvm

Seems that there's some early issues happening with the amdgpu driver.

And your `rocm-smi --showpids` shows that zombie processes are piling up. I suspect it's the amdgpu driver that causes the hangs.

What happens if you reset your GPU by `sudo rocm-smi --gpureset -d 0`, or even reboot the machine, and then test the hip program vecadd?

Comment 5 tigrmango 2023-12-23 08:20:03 UTC

I tried restarting and found out that everything is absolutely fine until i launch blender and try to select HIP, then it hangs itself, persisting somehow between gpu resets and forces any other HIP program to hang on exit as described.

Comment 6 tigrmango 2023-12-23 08:24:35 UTC

When I try to enable HIP in blender, there is a crash somewhere in the driver(?) that's causing the effect maybe?

dmesg is in the attachment

Comment 7 tigrmango 2023-12-23 08:24:59 UTC

Created attachment 880253 [details]
blender HIP crash

Comment 8 Yiyang Wu 2023-12-23 09:07:02 UTC

I have found some similar issues:

https://github.com/ROCm/ROCm/issues/2596
https://gitlab.freedesktop.org/drm/amd/-/issues/2991

Not sure if they are related to this bug, but they happens after Linux 6.6. Can you try Linux 6.5?

Comment 9 tigrmango 2023-12-24 07:26:02 UTC

On kernel
Linux FruitPlantation 6.5.13-gentoo-dist #1 SMP PREEMPT_DYNAMIC Sun Dec 24 17:04:30 +10 2023 x86_64 AMD Ryzen 5 3600 6-Core Processor AuthenticAMD GNU/Linux
When i try to select HIP, it works but hangs half a second later, at least showing me the menu
From there it's a mess, the broken blender instance can do anything from a "clean" exit on kill to a complete system hang, rocm-smi reports it as "running" or defunct. Will try 6.5.6 and report

Comment 10 tigrmango 2023-12-24 07:27:12 UTC

Created attachment 880279 [details]
Blender log on 6.5.13 until hang

Comment 11 tigrmango 2023-12-24 08:45:18 UTC

Created attachment 880281 [details]
dmesg on 6.5.6 until hang when selecting the CPU

Comment 12 Yiyang Wu 2023-12-24 08:55:55 UTC

I was also using 6700XT on 6.5 kernel with blender, and I can't reproduce this issue. I'm using Gentoo Prefix on Debian 12, with some small customization on vanilla Debian kernel. Can you share your kernel configuration?

Comment 13 tigrmango 2023-12-24 08:57:49 UTC

On kernel 
Linux FruitPlantation 6.5.6-gentoo-dist #1 SMP PREEMPT_DYNAMIC Sun Dec 24 18:17:59 +10 2023 x86_64 AMD Ryzen 5 3600 6-Core Processor AuthenticAMD GNU/Linux

Hanging continues, although UI stays responsive for another ~second before hanging. When trying to terminate the program it hangs the GPU. After 'rocm-smi -d 0 --gpureset' gdm is able to launch although session doesn't work. Kernel log attached, blender log unchanged, what should I do next?

Comment 14 tigrmango 2023-12-24 08:58:42 UTC

Created attachment 880282 [details]
Kerel config (6.5.6)

Comment 15 tigrmango 2023-12-24 08:59:07 UTC

Created attachment 880283 [details]
Kernel config (6.5.13)

Comment 16 Yiyang Wu 2023-12-24 09:24:34 UTC

> Kerel config (6.5.6)

The amdgpu related differences between your config and mine are:

1. I use CONFIG_DRM_AMDGPU=y (shouldn't be a problem; most people on the world use CONFIG_DRM_AMDGPU=m)
2. I don't set CONFIG_HSA_AMD_SVM=y

Another thing you can try is finding out which code triggers the kernel issue. Turn on USE=debug and follow instructions from https://wiki.gentoo.org/wiki/Project:Quality_Assurance/Backtraces to get backtraces of blender. After finding the trigger we can report to upstream, including amdgpu kernel driver, HIP (https://github.com/ROCm/clr) and blender.

Also, since this is related to amdgpu kernel driver, maybe reporting the issue to https://gitlab.freedesktop.org/drm/amd/-/issues/ will receive more insights

Comment 17 tigrmango 2023-12-24 12:14:26 UTC

(In reply to Yiyang Wu from comment #16)
> Another thing you can try is finding out which code triggers the kernel
> issue. Turn on USE=debug and follow instructions from
> https://wiki.gentoo.org/wiki/Project:Quality_Assurance/Backtraces to get
> backtraces of blender. After finding the trigger we can report to upstream,
> including amdgpu kernel driver, HIP (https://github.com/ROCm/clr) and
> blender.

What exactly should I do? Blender doesn't exactly end its execution, even on SIGKILLs ant everything. Even when i'm rebooting every time i need to hard-reset my PC or wait for systemd to stop caring about it.

Comment 18 Yiyang Wu 2023-12-24 12:34:54 UTC

(In reply to tigrmango from comment #17)
> (In reply to Yiyang Wu from comment #16)
> > Another thing you can try is finding out which code triggers the kernel
> > issue. Turn on USE=debug and follow instructions from
> > https://wiki.gentoo.org/wiki/Project:Quality_Assurance/Backtraces to get
> > backtraces of blender. After finding the trigger we can report to upstream,
> > including amdgpu kernel driver, HIP (https://github.com/ROCm/clr) and
> > blender.
> 
> What exactly should I do? Blender doesn't exactly end its execution, even on
> SIGKILLs ant everything. Even when i'm rebooting every time i need to
> hard-reset my PC or wait for systemd to stop caring about it.

Well if you just want to get rid of GPU hangs and do not need the blender HIP cycle, you could just turn off USE=hip, or disable using HIP cycles in your configuration.

If you want to figure the root cause and let HIP cycles accelerate rendering on your Gentoo system, you have to report the issue to the correct experts. I can't reproduce your issue and I'm not an expert on graphics card driver, so I don't know what to do next either.

My experience tells me that this issue should be reported at https://gitlab.freedesktop.org/drm/amd/-/issues/, you can open an issue there and paste the the blender blog, dmesg, kernel config, and give the link of this bug ticket as reference.

Also, read through https://wiki.gentoo.org/wiki/Project:Quality_Assurance/Backtraces, build a blender with debug info (CXXFLAGS="-Og -ggdb") and use gdb to get backtrace.

Looking at the blender log, the last output before hang is "Added device ...", so go to the source code of blender (you can keep the source code after emeger using FEATURES="keepwork"), use tools like grep (I use `ag "Added device") to locate the code, it's at intern/cycles/device/hip/device.cpp:201. Then you know hang happens after this line, so the breakpoint should be put here:

1. gdb --args <command of launching b>
2. break intern/cycles/device/hip/device.cpp:201
3. run

After hitting the breakpoint, use `n` to run step by step, and monitor the dmesg, until you hit the issue. Therefore you locate the code that causes the hang, and provide that piece of important information to upstream (HIP, blender)

Comment 19 tigrmango 2023-12-24 12:44:31 UTC

(In reply to Yiyang Wu from comment #18)
> (In reply to tigrmango from comment #17)
> > (In reply to Yiyang Wu from comment #16)
> > > Another thing you can try is finding out which code triggers the kernel
> > > issue. Turn on USE=debug and follow instructions from
> > > https://wiki.gentoo.org/wiki/Project:Quality_Assurance/Backtraces to get
> > > backtraces of blender. After finding the trigger we can report to upstream,
> > > including amdgpu kernel driver, HIP (https://github.com/ROCm/clr) and
> > > blender.
> > 
> > What exactly should I do? Blender doesn't exactly end its execution, even on
> > SIGKILLs ant everything. Even when i'm rebooting every time i need to
> > hard-reset my PC or wait for systemd to stop caring about it.
> 
> Well if you just want to get rid of GPU hangs and do not need the blender
> HIP cycle, you could just turn off USE=hip, or disable using HIP cycles in
> your configuration.
> 
> If you want to figure the root cause and let HIP cycles accelerate rendering
> on your Gentoo system, you have to report the issue to the correct experts.
> I can't reproduce your issue and I'm not an expert on graphics card driver,
> so I don't know what to do next either.
> 
> My experience tells me that this issue should be reported at
> https://gitlab.freedesktop.org/drm/amd/-/issues/, you can open an issue
> there and paste the the blender blog, dmesg, kernel config, and give the
> link of this bug ticket as reference.
> 
> Also, read through
> https://wiki.gentoo.org/wiki/Project:Quality_Assurance/Backtraces, build a
> blender with debug info (CXXFLAGS="-Og -ggdb") and use gdb to get backtrace.
> 
> Looking at the blender log, the last output before hang is "Added device
> ...", so go to the source code of blender (you can keep the source code
> after emeger using FEATURES="keepwork"), use tools like grep (I use `ag
> "Added device") to locate the code, it's at
> intern/cycles/device/hip/device.cpp:201. Then you know hang happens after
> this line, so the breakpoint should be put here:
> 
> 1. gdb --args <command of launching b>
> 2. break intern/cycles/device/hip/device.cpp:201
> 3. run
> 
> After hitting the breakpoint, use `n` to run step by step, and monitor the
> dmesg, until you hit the issue. Therefore you locate the code that causes
> the hang, and provide that piece of important information to upstream (HIP,
> blender)

I need the HIP acceleration and will try to figure out what causes the hang for me and then report it, I guess. Thank you very much for your help and support!

Comment 20 Yiyang Wu 2023-12-24 13:01:54 UTC

(In reply to tigrmango from comment #19)
> I need the HIP acceleration and will try to figure out what causes the hang
> for me and then report it, I guess. Thank you very much for your help and
> support!

(In reply to Yiyang Wu from comment #16)
> The amdgpu related differences between your config and mine are:
> 
> 1. I use CONFIG_DRM_AMDGPU=y (shouldn't be a problem; most people on the
> world use CONFIG_DRM_AMDGPU=m)
> 2. I don't set CONFIG_HSA_AMD_SVM=y
> 

Have you tried to unset CONFIG_HSA_AMD_SVM?

Comment 21 tigrmango 2023-12-24 13:31:02 UTC

As far as I can tell, the crash is happening somewhere in python(?) because last available call is somewhere in libpython, before a really long sequence of "Cannot find bounds of current function"

Comment 22 tigrmango 2023-12-24 13:32:05 UTC

(In reply to Yiyang Wu from comment #20)
> (In reply to tigrmango from comment #19)
> > I need the HIP acceleration and will try to figure out what causes the hang
> > for me and then report it, I guess. Thank you very much for your help and
> > support!
> 
> (In reply to Yiyang Wu from comment #16)
> > The amdgpu related differences between your config and mine are:
> > 
> > 1. I use CONFIG_DRM_AMDGPU=y (shouldn't be a problem; most people on the
> > world use CONFIG_DRM_AMDGPU=m)
> > 2. I don't set CONFIG_HSA_AMD_SVM=y
> > 
> 
> Have you tried to unset CONFIG_HSA_AMD_SVM?

I did not try that yet, I'll try to find the crash spot in gdb and then recompile the kernel without this

Comment 23 tigrmango 2023-12-24 14:31:16 UTC

Recompiled the kernel with CONFIG_HSA_AMD_SVM=n, still crashing and bringing down my GPU. Are there any other differences between our configs?

Comment 24 Yiyang Wu 2023-12-24 14:48:33 UTC

(In reply to tigrmango from comment #21)
> As far as I can tell, the crash is happening somewhere in python(?) because
> last available call is somewhere in libpython, before a really long sequence
> of "Cannot find bounds of current function"

Did you try to recompile blender with debug info on and optimization off, mentioned in https://wiki.gentoo.org/wiki/Project:Quality_Assurance/Backtraces#Compiler_flags?

Use CXXFLAGS="-Og -ggdb" and FEATURES="keepwork splitdebug" when emerging (keepwork means you need to manually clean the build directory at ${PORTAGE_TMPDIR} before next round of emerge)

Comment 25 Yiyang Wu 2023-12-24 14:54:53 UTC

(In reply to tigrmango from comment #23)
> Recompiled the kernel with CONFIG_HSA_AMD_SVM=n, still crashing and bringing
> down my GPU. Are there any other differences between our configs?

Well there are many others, but I don't think they are relevant. I will upload mine.

Comment 26 Yiyang Wu 2023-12-24 14:55:27 UTC

Created attachment 880286 [details]
Kernel config 6.5.10 (no crash issue observed)

Comment 27 tigrmango 2023-12-24 15:06:11 UTC

(In reply to Yiyang Wu from comment #24)
> (In reply to tigrmango from comment #21)
> > As far as I can tell, the crash is happening somewhere in python(?) because
> > last available call is somewhere in libpython, before a really long sequence
> > of "Cannot find bounds of current function"
> 
> Did you try to recompile blender with debug info on and optimization off,
> mentioned in
> https://wiki.gentoo.org/wiki/Project:Quality_Assurance/
> Backtraces#Compiler_flags?
> 
> Use CXXFLAGS="-Og -ggdb" and FEATURES="keepwork splitdebug" when emerging
> (keepwork means you need to manually clean the build directory at
> ${PORTAGE_TMPDIR} before next round of emerge)

Yes, I did all of that except FEATURES="nostrip"

Comment 28 tigrmango 2023-12-25 03:03:41 UTC

(In reply to Yiyang Wu from comment #26)
> Created attachment 880286 [details]
> Kernel config 6.5.10 (no crash issue observed)

Tried this config, same crash, still observed