Gentoo Websites Logo
Go to: Gentoo Home Documentation Forums Lists Bugs Planet Store Wiki Get Gentoo!
Bug 953844 - amdgpu crashes for seemingly no reason
Summary: amdgpu crashes for seemingly no reason
Status: RESOLVED NEEDINFO
Alias: None
Product: Gentoo Linux
Classification: Unclassified
Component: Current packages (show other bugs)
Hardware: AMD64 Linux
: Normal major
Assignee: Gentoo Linux bug wranglers
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2025-04-14 14:39 UTC by Bob
Modified: 2025-04-14 20:21 UTC (History)
2 users (show)

See Also:
Package list:
Runtime testing required: ---


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Bob 2025-04-14 14:39:21 UTC
Please see the following:

[46054.947725] ------------[ cut here ]------------                                                                                           
[46054.947726] WARNING: CPU: 15 PID: 2686 at drivers/gpu/drm/amd/amdgpu/../display/dc/hubbub/dcn31/dcn31_hubbub.c:151 dcn31_program_compbuf_size+0xd1/0x230 [amdgpu]
[46054.947928] Modules linked in: fuse amdgpu 8021q garp mrp vfat fat binfmt_misc mt7921e mt7921_common mt792x_lib mt76_connac_lib mt76 mac80211 amd_atl intel_rapl_msr intel_rapl_common snd_hda_codec_hdmi snd_usb_audio snd_hda_intel kvm_amd btusb vfio_pci vfio_pci_core snd_intel_dspc
fg snd_hda_codec vfio_iommu_type1 btrtl btintel kvm vfio libarc4 btbcm btmtk snd_usbmidi_lib snd_hda_core cfg80211 bluetooth snd_ump amdxcp i2c_algo_bit snd_rawmidi drm_ttm_helper asus_nb_wmi eeepc_wmi asus_wmi snd_hwdep snd_pcm ttm sparse_keymap wmi_bmof platform_profile drm_exec gp
u_sched snd_timer drm_suballoc_helper igc drm_buddy snd rapl drm_display_helper i2c_piix4 pcspkr video mc k10temp i2c_smbus rfkill soundcore wmi gpio_amdpt gpio_generic dm_crypt nvme ccp ucsi_ccg nvme_core typec_ucsi typec sp5100_tco
[46054.947980] CPU: 15 UID: 1000 PID: 2686 Comm: sway Not tainted 6.12.16-gentoo-gentoo-dist #2                 
[46054.947982] Hardware name: ASUS System Product Name/ROG STRIX B650E-I GAMING WIFI, BIOS 3222 03/05/2025
[46054.947983] RIP: 0010:dcn31_program_compbuf_size+0xd1/0x230 [amdgpu]
[46054.948145] Code: 00 48 8b 43 28 8b 88 b0 01 00 00 48 8b 43 20 0f b6 50 6c 48 8b 43 18 8b b0 14 01 00 00 e8 e7 45 0e 00 85 c0 0f 85 33 01 00 00 <0f> 0b 48 8b 44 24 08 65 48 2b 04 25 28 00 00 00 0f 85 35 01 00 00
[46054.948146] RSP: 0018:ffffbb8480bbf618 EFLAGS: 00010202
[46054.948148] RAX: 0000000000000001 RBX: ffff9842c83ec000 RCX: 000000000000001f
[46054.948149] RDX: 0000000000000000 RSI: 000000000000398b RDI: ffff984331f80000
[46054.948150] RBP: 0000000000000004 R08: ffffbb8480bbf61c R09: 000000000000000d
[46054.948151] R10: ffffffffb5514028 R11: 0000000000000003 R12: ffff98431c9c0000
[46054.948152] R13: ffff984332800000 R14: ffff9842c83ec000 R15: 0000000000000001
[46054.948153] FS:  00007fcf46149a00(0000) GS:ffff9849fe780000(0000) knlGS:0000000000000000
[46054.948155] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[46054.948156] CR2: 000055a1e816b1d0 CR3: 00000001eb104000 CR4: 0000000000f50ef0
[46054.948157] PKRU: 55555554
[46054.948158] Call Trace:
[46054.948160]  <TASK>
[46054.948161]  ? dcn31_program_compbuf_size+0xd1/0x230 [amdgpu]
[46054.948297]  ? __warn.cold+0x93/0xf0
[46054.948300]  ? dcn31_program_compbuf_size+0xd1/0x230 [amdgpu]
[46054.948420]  ? report_bug+0xff/0x140
[46054.948423]  ? handle_bug+0x58/0x90
[46054.948425]  ? exc_invalid_op+0x17/0x70
[46054.948427]  ? asm_exc_invalid_op+0x1a/0x20
[46054.948431]  ? dcn31_program_compbuf_size+0xd1/0x230 [amdgpu]
[46054.948544]  ? dcn31_program_compbuf_size+0xc9/0x230 [amdgpu]
[46054.948655]  dcn20_optimize_bandwidth+0xe4/0x220 [amdgpu]
[46054.948814]  dc_commit_state_no_check+0xc5b/0xeb0 [amdgpu]
[46054.948960]  dc_commit_streams+0x31f/0x420 [amdgpu]
[46054.949099]  amdgpu_dm_atomic_commit_tail+0x65d/0x3a80 [amdgpu]
[46054.949265]  ? srso_alias_return_thunk+0x5/0xfbef5
[46054.949268]  ? srso_alias_return_thunk+0x5/0xfbef5
[46054.949270]  ? srso_alias_return_thunk+0x5/0xfbef5
[46054.949271]  ? amdgpu_dm_atomic_check+0x15df/0x17c0 [amdgpu]
[46054.949417]  ? srso_alias_return_thunk+0x5/0xfbef5
[46054.949419]  ? wait_for_completion_timeout+0x13b/0x170
[46054.949421]  ? wait_for_completion_interruptible+0x12d/0x1e0
[46054.949424]  commit_tail+0x91/0x130
[46054.949426]  drm_atomic_helper_commit+0x11a/0x140
[46054.949428]  drm_atomic_commit+0xa6/0xe0
[46054.949431]  ? __pfx___drm_printfn_info+0x10/0x10
[46054.949433]  drm_mode_atomic_ioctl+0xa73/0xcb0
[46054.949437]  ? __pfx_drm_mode_atomic_ioctl+0x10/0x10
[46054.949439]  drm_ioctl_kernel+0xad/0x100
[46054.949442]  drm_ioctl+0x277/0x4d0
[46054.949444]  ? __pfx_drm_mode_atomic_ioctl+0x10/0x10
[46054.949448]  amdgpu_drm_ioctl+0x4b/0x80 [amdgpu]
[46054.949564]  __x64_sys_ioctl+0x91/0xd0
[46054.949567]  do_syscall_64+0x82/0x190
[46054.949570]  ? srso_alias_return_thunk+0x5/0xfbef5
[46054.949571]  ? __count_memcg_events+0x53/0xf0
[46054.949573]  ? srso_alias_return_thunk+0x5/0xfbef5
[46054.949574]  ? count_memcg_events.constprop.0+0x1a/0x30
[46054.949576]  ? srso_alias_return_thunk+0x5/0xfbef5
[46054.949577]  ? handle_mm_fault+0x1bb/0x2c0
[46054.949579]  ? srso_alias_return_thunk+0x5/0xfbef5
[46054.949581]  ? do_user_addr_fault+0x36c/0x620
[46054.949583]  ? srso_alias_return_thunk+0x5/0xfbef5
[46054.949584]  ? exc_page_fault+0x7e/0x180
[46054.949586]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[46054.949588] RIP: 0033:0x7fcf46c9120f
[46054.949590] Code: 00 48 89 44 24 18 31 c0 c7 04 24 10 00 00 00 48 8d 44 24 60 48 89 44 24 08 48 8d 44 24 20 48 89 44 24 10 b8 10 00 00 00 0f 05 <89> c2 3d 00 f0 ff ff 77 18 48 8b 44 24 18 64 48 2b 04 25 28 00 00
[46054.949591] RSP: 002b:00007ffdad034f50 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[46054.949593] RAX: ffffffffffffffda RBX: 000055a1e7177050 RCX: 00007fcf46c9120f
[46054.949594] RDX: 00007ffdad035000 RSI: 00000000c03864bc RDI: 000000000000000b
[46054.949595] RBP: 00007ffdad035000 R08: 0000000000000007 R09: 0000000000000002
[46054.949596] R10: 0000000000000003 R11: 0000000000000246 R12: 00000000c03864bc
[46054.949597] R13: 000000000000000b R14: 000055a1e7f5e720 R15: 000055a1e88a3b10
[46054.949600]  </TASK>
[46054.949600] ---[ end trace 0000000000000000 ]---


Reproducible: Always

Steps to Reproduce:
1.Just wait long enough
2.
3.
Actual Results:  
Crash

Expected Results:  
Not crash

I have 2 amdgpu cards. One is active (igpu built into ryzen 7700) and in use. The other is a discrete gpu (6900 XT) which is bound to vfio-pci for use with a VM as a passthrough device.

Whether or not I run the VM that uses this passthrough device, over time this crash occurs. 

Probably related, in further time the system will reboot on its own.
Comment 1 Ionen Wolkens gentoo-dev 2025-04-14 14:54:10 UTC
I don't think CC'ing me makes sense, I don't handle anything related to amdgpu nor keep up with issues.
Comment 2 Enne Eziarc 2025-04-14 16:44:37 UTC
You'll probably get a better answer on FDO's bug tracker (there seem to be a few reports like this there already):
https://gitlab.freedesktop.org/drm/amd/-/issues/

From a cursory glance, that's power management code being triggered by what looks like a spurious wakeup/hotplug event. There's a known problem where flaky DP connections are enough to crash host software (though it's usually the compositor, not the kernel).
Comment 3 Mike Gilbert gentoo-dev 2025-04-14 20:21:46 UTC
Please seek help in support channels. We can't help you here.