Gentoo Websites Logo
Go to: Gentoo Home Documentation Forums Lists Bugs Planet Store Wiki Get Gentoo!
Bug 927085 - >=sys-kernel/gentoo-sources-6.7.6: amdgpu: [gfxhub] page fault (src_id:0 ring:169 vmid:0 pasid:0, for process pid 0 thread pid 0)
Summary: >=sys-kernel/gentoo-sources-6.7.6: amdgpu: [gfxhub] page fault (src_id:0 ring...
Status: RESOLVED UPSTREAM
Alias: None
Product: Gentoo Linux
Classification: Unclassified
Component: Current packages (show other bugs)
Hardware: All Linux
: Normal normal (vote)
Assignee: Gentoo Kernel Bug Wranglers and Kernel Maintainers
URL: https://gitlab.freedesktop.org/drm/am...
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2024-03-15 19:22 UTC by Michael Mair-Keimberger (iamnr3)
Modified: 2024-03-18 17:55 UTC (History)
1 user (show)

See Also:
Package list:
Runtime testing required: ---


Attachments
dmesg.log (dmesg.log,135.29 KB, text/x-log)
2024-03-15 19:22 UTC, Michael Mair-Keimberger (iamnr3)
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Michael Mair-Keimberger (iamnr3) 2024-03-15 19:22:03 UTC
Created attachment 887710 [details]
dmesg.log

Hi,

I'm having troubles with >=sys-kernel/gentoo-sources-6.7.6. Since this version, gpu output doesn't work anymore when starting the system. (not even a single line from the kernel booting). However, the system boots normally and i can connect via ssh. Futhermore, radeontop also reports 100% memory utilization and i can hear the fan going up to fullspeed.

Now this happens since 6.7.6. 6.7.5 is still fine and is what i'm using at the moment. I've also tested 6.7.7 and today 6.8.0. All of them have the same problem.
There must be some changes between 6.7.5 and 6.7.6 which causes these errors.

i've also saved the dmesg output from 6.8.0 and found following errors:

[    0.859294] [drm] DMUB hardware initialized: version=0x07002600
[    1.367857] amdgpu 0000:03:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:169 vmid:0 pasid:0, for process  pid 0 thread  pid 0)
[    1.367866] amdgpu 0000:03:00.0: amdgpu:   in page starting at address 0x0000fffff017f000 from client 10
[    1.367870] amdgpu 0000:03:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00040B53
[    1.367873] amdgpu 0000:03:00.0: amdgpu: 	Faulty UTCL2 client ID: CPC (0x5)
[    1.367877] amdgpu 0000:03:00.0: amdgpu: 	MORE_FAULTS: 0x1
[    1.367879] amdgpu 0000:03:00.0: amdgpu: 	WALKER_ERROR: 0x1
[    1.367881] amdgpu 0000:03:00.0: amdgpu: 	PERMISSION_FAULTS: 0x5
[    1.367883] amdgpu 0000:03:00.0: amdgpu: 	MAPPING_ERROR: 0x1
[    1.367885] amdgpu 0000:03:00.0: amdgpu: 	RW: 0x1
[    1.367891] amdgpu 0000:03:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:153 vmid:0 pasid:0, for process  pid 0 thread  pid 0)
[    1.367895] amdgpu 0000:03:00.0: amdgpu:   in page starting at address 0x0000fffff017f000 from client 10
[    1.367898] amdgpu 0000:03:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00000000
[    1.367900] amdgpu 0000:03:00.0: amdgpu: 	Faulty UTCL2 client ID: CB/DB (0x0)
[    1.367903] amdgpu 0000:03:00.0: amdgpu: 	MORE_FAULTS: 0x0
[    1.367906] amdgpu 0000:03:00.0: amdgpu: 	WALKER_ERROR: 0x0
[    1.367908] amdgpu 0000:03:00.0: amdgpu: 	PERMISSION_FAULTS: 0x0
[    1.367910] amdgpu 0000:03:00.0: amdgpu: 	MAPPING_ERROR: 0x0
[    1.367912] amdgpu 0000:03:00.0: amdgpu: 	RW: 0x0
[    1.368699] [drm] kiq ring mec 3 pipe 1 q 0
[    1.556561] amdgpu 0000:03:00.0: [drm:amdgpu_ring_test_helper] *ERROR* ring mes_kiq_3.1.0 test failed (-110)
[    1.556574] [drm:amdgpu_gfx_enable_kcq] *ERROR* KCQ enable failed
[    1.556578] [drm:amdgpu_device_init] *ERROR* hw_init of IP block <gfx_v11_0> failed -110
[    1.556580] tsc: Refined TSC clocksource calibration: 4192.123 MHz
[    1.556583] amdgpu 0000:03:00.0: amdgpu: amdgpu_device_ip_init failed
[    1.556587] amdgpu 0000:03:00.0: amdgpu: Fatal error during GPU init
[    1.556588] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x3c6d509c481, max_idle_ns: 440795405039 ns
[    1.556590] amdgpu 0000:03:00.0: amdgpu: amdgpu: finishing device.
[    1.556870] clocksource: Switched to clocksource tsc
[    1.556878] [drm] DSC precompute is not needed.
[    1.556895] ------------[ cut here ]------------
[    1.556895] WARNING: CPU: 17 PID: 1 at drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c:622 amdgpu_irq_put+0x41/0x70
[    1.556900] Modules linked in:
[    1.556902] CPU: 17 PID: 1 Comm: swapper/0 Not tainted 6.8.0-gentoo #1
[    1.556904] Hardware name: ASUS System Product Name/TUF GAMING X670E-PLUS, BIOS 1654 08/25/2023
[    1.556905] RIP: 0010:amdgpu_irq_put+0x41/0x70
[    1.556907] Code: c0 74 33 48 8b 4e 10 48 83 39 00 74 29 89 d1 48 8d 04 88 8b 08 85 c9 74 11 f0 ff 08 74 07 31 c0 e9 0f eb 92 00 e9 6f fd ff ff <0f> 0b b8 ea ff ff ff e9 fe ea 92 00 b8 ea ff ff ff e9 f4 ea 92 00
[    1.556908] RSP: 0018:ffffbdb540097c28 EFLAGS: 00010246
[    1.556909] RAX: ffff9f9c037d00c0 RBX: ffff9f9c06498778 RCX: 0000000000000000
[    1.556910] RDX: 0000000000000000 RSI: ffff9f9c064a4d88 RDI: ffff9f9c06480000
[    1.556911] RBP: ffff9f9c064901e8 R08: ffff9f9c04d4a200 R09: ffffffffae8c3b00
[    1.556912] R10: ffff9f9c00042a00 R11: ffff9f9c02528200 R12: ffff9f9c064905c8
[    1.556912] R13: ffff9f9c06480010 R14: ffff9f9c06480000 R15: ffff9f9c064a4d88
[    1.556913] FS:  0000000000000000(0000) GS:ffff9fab18640000(0000) knlGS:0000000000000000
[    1.556914] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    1.556915] CR2: 0000000000000000 CR3: 00000004d0c16000 CR4: 0000000000750ef0
[    1.556916] PKRU: 55555554
[    1.556916] Call Trace:
[    1.556918]  <TASK>
[    1.556919]  ? __warn+0x6f/0xd0
[    1.556922]  ? amdgpu_irq_put+0x41/0x70
[    1.556923]  ? report_bug+0x14b/0x1a0
[    1.556926]  ? handle_bug+0x3a/0x70
[    1.556929]  ? exc_invalid_op+0x17/0x70
[    1.556931]  ? asm_exc_invalid_op+0x1a/0x20
[    1.556934]  ? drm_atomic_state_default_clear+0x60/0x2d0
[    1.556937]  ? amdgpu_irq_put+0x41/0x70
[    1.556938]  ? srso_alias_return_thunk+0x5/0xfbef5
[    1.556939]  amdgpu_fence_driver_hw_fini+0xf9/0x130
[    1.556942]  amdgpu_device_fini_hw+0xa1/0x400
[    1.556943]  ? blocking_notifier_chain_unregister+0x49/0xb0
[    1.556946]  amdgpu_driver_load_kms+0xe1/0x170
[    1.556948]  amdgpu_pci_probe+0x140/0x460
[    1.556951]  local_pci_probe+0x3d/0x90
[    1.556955]  pci_device_probe+0xab/0x180
[    1.556957]  really_probe+0xbb/0x2e0
[    1.556960]  ? __pfx___driver_attach+0x10/0x10
[    1.556961]  __driver_probe_device+0x6e/0x110
[    1.556963]  driver_probe_device+0x1a/0xe0
[    1.556965]  __driver_attach+0x83/0x180
[    1.556966]  bus_for_each_dev+0x80/0xd0
[    1.556968]  bus_add_driver+0xe7/0x1f0
[    1.556970]  driver_register+0x54/0x100
[    1.556972]  ? __pfx_amdgpu_init+0x10/0x10
[    1.556974]  do_one_initcall+0x81/0x1f0
[    1.556977]  kernel_init_freeable+0x194/0x220
[    1.556980]  ? __pfx_kernel_init+0x10/0x10
[    1.556981]  kernel_init+0x15/0x1b0
[    1.556983]  ret_from_fork+0x2c/0x50
[    1.556985]  ? __pfx_kernel_init+0x10/0x10
[    1.556987]  ret_from_fork_asm+0x1b/0x30
[    1.556990]  </TASK>
[    1.556990] ---[ end trace 0000000000000000 ]---
[    1.556997] ------------[ cut here ]------------

These crashes happen a few time afterwards. I've also attached the full dmesg.log


Comparing this with the dmesg from 6.7.5 i see following:
[    0.859537] [drm] DMUB hardware initialized: version=0x07002400

Don't know if that has something todo with this but it seems that the DMUB hardware version changed...



My guess now is that the gentoo patches from gentoo-sources wouldn't have any influence on amdgpu and i should open an bug at bugzilla.kernel.org?
Comment 1 Mike Pagano gentoo-dev 2024-03-15 21:42:51 UTC
(In reply to Michael Mair-Keimberger (iamnr3) from comment #0)

> 
> My guess now is that the gentoo patches from gentoo-sources wouldn't have
> any influence on amdgpu and i should open an bug at bugzilla.kernel.org?

I think https://gitlab.freedesktop.org/drm/amd/-/issues would be more appropiate
Comment 2 Michael Mair-Keimberger (iamnr3) 2024-03-16 11:26:31 UTC
(In reply to Mike Pagano from comment #1)
> (In reply to Michael Mair-Keimberger (iamnr3) from comment #0)
> 
> > 
> > My guess now is that the gentoo patches from gentoo-sources wouldn't have
> > any influence on amdgpu and i should open an bug at bugzilla.kernel.org?
> 
> I think https://gitlab.freedesktop.org/drm/amd/-/issues would be more
> appropiate

Thanks for pointing that out. I've now made an bug report there.
Comment 3 Mike Pagano gentoo-dev 2024-03-16 21:52:10 UTC
Thanks, we'll keep an eye on the upstream bug and backport and identified fixes.
Comment 4 Michael Mair-Keimberger (iamnr3) 2024-03-18 17:55:46 UTC
(In reply to Mike Pagano from comment #3)
> Thanks, we'll keep an eye on the upstream bug and backport and identified
> fixes.

@Mike: This issue is resolved. Alex Deucher pointed me to the culprit of the problem (which was a user problem.. ;) ). Sorry for the noise.