790566 – sys-kernel/linux-firmware-20210511: AMDGPU broken

Bug 790566 - sys-kernel/linux-firmware-20210511: AMDGPU broken

Summary: sys-kernel/linux-firmware-20210511: AMDGPU broken

Status:	RESOLVED OBSOLETE

Alias:	None

Product:	Gentoo Linux
Classification:	Unclassified
Component:	Current packages (show other bugs)
Hardware:	All Linux

Importance:	Normal normal
Assignee:	Chí-Thanh Christopher Nguyễn

URL:	https://lists.freedesktop.org/archive...
Whiteboard:
Keywords:

Duplicates (1):	790683 (view as bug list)
Depends on:
Blocks:

Reported:	2021-05-16 17:50 UTC by Maciej Barć
Modified:	2022-01-31 01:25 UTC (History)
CC List:	6 users (show)

See Also:
Package list:
Runtime testing required:	---

Attachments
emerge --info (emerge_info.txt,10.88 KB, text/plain) 2021-05-16 17:51 UTC, Maciej Barć	Details
kernel errors 2012-05-15 (errors_sat_may_15_072825_pm_cest_2021.log,14.21 KB, text/x-log) 2021-05-16 17:52 UTC, Maciej Barć	Details
config - 5.12.4-gentoo-magentalane-v0.2.7 (config_5.12.4-gentoo-magentalane-v0.2.7,182.84 KB, text/plain) 2021-05-16 17:55 UTC, Maciej Barć	Details
glxinfo -B (glxinfo_brief.txt,1.76 KB, text/plain) 2021-05-16 20:00 UTC, Maciej Barć	Details
dmesg output when graphics crashes (dmesg.txt,26.22 KB, text/plain) 2021-05-17 19:19 UTC, tomtom69	Details
dmesg sys-kernel/linux-firmware-20210518 (dmesg-20210518.txt,65.16 KB, text/plain) 2021-06-11 07:19 UTC, Stefan de Konink	Details
dmesg of chromium errors (more-errors.txt,65.32 KB, text/plain) 2021-06-13 17:33 UTC, Stefan de Konink	Details
dmesg of chromium errors (5.13.1/20210629) (chromium-5.13.1-dmesg.txt,65.38 KB, text/plain) 2021-07-11 09:51 UTC, Stefan de Konink	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Maciej Barć gentoo-dev

2021-05-16 17:50:42 UTC

So, my main machine is a AMD laptop "81NC Lenovo IdeaPad S340-15API" and recently some breakages started happening for me. In about 1h after bootup while using a KDE desktop machine GUI would freeze. Sometimes it would be possible to move the mouse but the rest will be frozen. Screen may start blinking or go black.

I'm not sure if this is my kernel, firmware or the hardware.
I don't understands dmesg that's why I'm guessing, but I think it is the firmware
since this behaviour started around 2021-05-15.
From my portage logs I see that I updated my firmware on 2021-05-14 at 18:16:06.

So breakages started with my kernel: 5.10.27 and FW: 20210511.
After breakage I jumped to a older kernel 5.4.97 and compiled 5.12.4. 
I didn't notice a breakage on 5.4.97 but system ran ~40 minutes.
So I booted to 5.12.4 where I was ~1h and it borke.
So I booted to 5.4.97 again and downgraded my FW.
While I'm writing this I'm booted to kernel: 5.12.4 with FW: 20210315.

My uptime is: 2 hours, 45 minutes

Comment 1 Maciej Barć gentoo-dev

2021-05-16 17:51:52 UTC

Created attachment 709164 [details]
emerge --info

Comment 2 Maciej Barć gentoo-dev

2021-05-16 17:52:54 UTC

Created attachment 709167 [details]
kernel errors 2012-05-15

Comment 3 Maciej Barć gentoo-dev

2021-05-16 17:55:08 UTC

Created attachment 709170 [details]
config - 5.12.4-gentoo-magentalane-v0.2.7

Comment 4 Maciej Barć gentoo-dev

2021-05-16 17:55:50 UTC

I use rsyslog. Any parts of syslog that I should provide?

Comment 5 Maciej Barć gentoo-dev

2021-05-16 19:57:48 UTC

lspci -k -v

04:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Picasso (rev c2) (prog-if 00 [VGA controller])
        Subsystem: Lenovo Picasso
        Flags: bus master, fast devsel, latency 0, IRQ 63, IOMMU group 11
        Memory at b0000000 (64-bit, prefetchable) [size=256M]
        Memory at c0000000 (64-bit, prefetchable) [size=2M]
        I/O ports at 1000 [size=256]
        Memory at c0800000 (32-bit, non-prefetchable) [size=512K]
        Expansion ROM at 000c0000 [virtual] [disabled] [size=128K]
        Capabilities: <access denied>
        Kernel driver in use: amdgpu
        Kernel modules: amdgpu

Comment 6 Maciej Barć gentoo-dev

2021-05-16 19:58:54 UTC

lshw -numeric -C display                                                              

  *-display
       description: VGA compatible controller
       product: Picasso [1002:15D8]
       vendor: Advanced Micro Devices, Inc. [AMD/ATI] [1002]
       physical id: 0
       bus info: pci@0000:04:00.0
       version: c2
       width: 64 bits
       clock: 33MHz
       capabilities: pm pciexpress msi msix vga_controller bus_master cap_list rom
       configuration: driver=amdgpu latency=0
       resources: irq:63 memory:b0000000-bfffffff memory:c0000000-c01fffff ioport:1000(size=256) memory:c0800000-c087ffff memory:c0000-dffff

Comment 7 Maciej Barć gentoo-dev

2021-05-16 20:00:37 UTC

Created attachment 709185 [details]
glxinfo -B

Comment 8 Thomas Deutschmann (RETIRED) gentoo-dev

2021-05-16 21:13:57 UTC

Thank you for letting us know but there is not much we can do for you: Please report upstream on your own and update this bug report with a link to your bug report/mail to LKML.

Comment 9 tomtom69 2021-05-17 19:18:10 UTC

Same here on a Ryzen 3350G PRO system (GPU family is also "picasso"). Since last system update (not many packages, but including linux-firmare to 20210511) system hangs or goes back to login screen after 1-3 hours of normal usuage.
Kernel version is 5.10.27.
I'll try to downgrade linux-firmware to 20210315 and test to be sure whether this is the cause or not.

Comment 10 tomtom69 2021-05-17 19:19:04 UTC

Created attachment 709455 [details]
dmesg output when graphics crashes

Comment 11 Maciej Barć gentoo-dev

2021-05-18 14:16:39 UTC

Another case:
https://lists.freedesktop.org/archives/amd-gfx/2021-May/063852.html

Comment 12 Ionen Wolkens gentoo-dev

2021-05-19 20:40:33 UTC

*** Bug 790683 has been marked as a duplicate of this bug. ***

Comment 13 Stefan de Konink 2021-06-11 07:19:52 UTC

Created attachment 715227 [details]
dmesg sys-kernel/linux-firmware-20210518

Problem remains with sys-kernel/linux-firmware-20210518 and sys-kernel/gentoo-sources-5.12.10.

Comment 14 tomtom69 2021-06-12 08:19:23 UTC

There is an upstream patch available which should fix this
https://patchwork.freedesktop.org/patch/433701/
But I found this patgch included in 5.12.10, so there maybe another issue.

Comment 15 Stefan de Konink 2021-06-12 08:45:42 UTC

(In reply to tomtom69 from comment #14)
> There is an upstream patch available which should fix this
> https://patchwork.freedesktop.org/patch/433701/

I am curious, does that mean that functionally that was previously available on this hardware, is now disabled, as being planned obsolescence?


> But I found this patgch included in 5.12.10, so there maybe another issue.

Likely.

Comment 16 Stefan de Konink 2021-06-13 17:33:02 UTC

Created attachment 715743 [details]
dmesg of chromium errors

[ 4233.080397] amdgpu 0000:05:00.0: amdgpu: [gfxhub0] retry page fault (src_id:0 ring:0 vmid:3 pasid:32774, for process chrome pid 2369 thread chrome:cs0 pid 2393)
[ 4233.080415] amdgpu 0000:05:00.0: amdgpu:   in page starting at address 0x800114000000 from client 27
[ 4233.080427] amdgpu 0000:05:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00341051
[ 4233.080431] amdgpu 0000:05:00.0: amdgpu:      Faulty UTCL2 client ID: TCP (0x8)
[ 4233.080435] amdgpu 0000:05:00.0: amdgpu:      MORE_FAULTS: 0x1
[ 4233.080438] amdgpu 0000:05:00.0: amdgpu:      WALKER_ERROR: 0x0
[ 4233.080440] amdgpu 0000:05:00.0: amdgpu:      PERMISSION_FAULTS: 0x5
[ 4233.080442] amdgpu 0000:05:00.0: amdgpu:      MAPPING_ERROR: 0x0
[ 4233.080444] amdgpu 0000:05:00.0: amdgpu:      RW: 0x1

These kind of errors maybe completely unrelated, but cause a full stop. Thankfully the system does recover.

Comment 17 Stefan de Konink 2021-07-11 09:51:44 UTC

Created attachment 723235 [details]
dmesg of chromium errors (5.13.1/20210629)

Issues with 5.13.1/20210629 and Chromium remain.

Comment 18 Maciej Barć gentoo-dev

2021-07-11 12:28:35 UTC

(In reply to Stefan de Konink from comment #17)
> Created attachment 723235 [details]
> dmesg of chromium errors (5.13.1/20210629)
> 
> Issues with 5.13.1/20210629 and Chromium remain.

Thanks for testing, I was not able to test latest version cause I haven't had time to fight those bugs (especially because they are so annoying). 
Still on 20210315.

Comment 19 Stefan de Konink 2021-07-11 12:31:56 UTC

I switched back as well. I still have to confirm that it would be possible to test this change without recompiling the kernel.

Comment 20 Andrew Aladjev 2021-08-03 09:27:31 UTC

For now we have to mask:

# amdgpu
=sys-kernel/linux-firmware-20210315
=sys-kernel/linux-firmware-20210518
=sys-kernel/linux-firmware-20210629

=sys-kernel/linux-firmware-20210208 was the last good firmware for amdgpu.

Comment 21 Stefan de Konink 2021-08-03 09:33:43 UTC

(In reply to Andrew Aladjev from comment #20)
> For now we have to mask:
> 
> # amdgpu
> =sys-kernel/linux-firmware-20210315

What issues have you experienced with 20210315? I "barely" have issues with this one. The issues that I still have on my platform is suspend-resume. And a reproducible kernel panic at powerdown when the device woke up from a cold suspend.

I wonder if the OpenCL issues (I experienced them with tesseract last year) or if it is a general issue with Raven Ridge (being unsupported now upstream for ROC). 
 https://bugs.gentoo.org/764605

Comment 22 Andrew Aladjev 2021-08-03 11:00:21 UTC

I've added mask file for amdgpu several months ago when received random hang (just "ring gfx timeout" without additional info) with old kernel 5.10 and firmware 20210315. Than I've upgraded firmware to 20210511 + kernel to 5.12 and received stable hang (VM_L2_PROTECTION_FAULT_STATUS + "ring gfx timeout") so added 20210511 to mask file, same thing for 20210629. So firmware 20210208 is the island of "stability".

If you want stable GPU than please do not use amdgpu (at least for now). Please review this issue https://gitlab.freedesktop.org/drm/amd/-/issues/892. This issue is the volcano of amdgpu linux user "experience". You can grep linux sources using "TIMEOUT_FOR_FLIP_PENDING", found "dcn20_hwseq.c" file and review the quality of "code" around. You will immediately feel how "dcn20_pipe_control_lock", "dcn20_enable_stream_timing", "dcn20_update_dchubp_dpp", "dcn20_enable_plane", "dcn20_update_mpcc" smells like. This code is experimental, it was not designed to be stable and it won't become stable. This code should be rewritten completely by amd core developers. This rewrite may happen in next 5 years.

If you want a stable GPU for now than use radeon (famous r600/r700/etc) <= gcn 1.0.

Comment 23 Andrew Aladjev 2021-08-03 15:05:05 UTC

See also https://bugzilla.kernel.org/show_bug.cgi?id=205169

Comment 24 tomtom69 2021-08-13 16:43:38 UTC

Some GPU firmware files (*sdma.bin) were now reverted upstream:
https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/commit/?id=d843e520a4b0d92b986645548d11ade3b9b239a4
https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/commit/?id=99d72504bff7ab40c261b8509c0b9d8abf98b296
https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/commit/?id=d7b50e61669dc137924337d03d09b8986eb752a3

I also found out only the picasso_sdma.bin file from newer versions caused the issue here. So I use the current firmware files and only keep picasso_sdma.bin from linux-firmware-20210315:
https://gitlab.freedesktop.org/drm/amd/-/issues/1609

Hopefully these upstream patches fix the problem for now, as soon as they arrive in the portage tree (however it is only an intermediate solution, not a real bugfix).

Comment 25 Maciej Barć gentoo-dev

2021-09-24 00:47:36 UTC

I've been running version 20210818 for 4 days now, seems the issue is gone.

Comment 26 Marc Lee Towers 2021-10-01 08:27:00 UTC

I had freezing and a way to reproduce was emerge in qtwebengine with jumbo-build.. my system would even lose a couple of minutes during the freeze and kde crash sometimes losing me opencl... 05:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Picasso (rev c2)

Finding wayland a bit messy but as for the fix I'm afraid I couldn't determine if it has pushed graphics memory to my swap file, protected video ram could be swapped by accident?  regardless only fix I found was rather than 8gb of ram I added another 16gb to the already 8gb and issues all went away... not exactly the best fix but a fix all the same... they is nothing quiet right still and wayland is a bit buggy but seems lack of ram and some sort of swap space issue on my laptop...

Comment 27 Stefan de Konink 2021-11-24 16:39:33 UTC

Currently running 5.15.2 and linux-firmware-20211027, kernel panics are back.

Comment 28 Stefan de Konink 2021-11-25 10:08:50 UTC

Using 5.15.4-gentoo, and the latest firmware. Currently not (yet) crashing.

[ 4151.697052] amdgpu 0000:05:00.0: amdgpu: [mmhub0] retry page fault (src_id:0 ring:0 vmid:1 pasid:32775, for process chrome pid 1636 thread chrome:cs0 pid 1662)
[ 4151.697069] amdgpu 0000:05:00.0: amdgpu:   in page starting at address 0x000080011e7ef000 from IH client 0x12 (VMC)
[ 4151.697079] amdgpu 0000:05:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00140450
[ 4151.697112] amdgpu 0000:05:00.0: amdgpu: 	 Faulty UTCL2 client ID: VCN (0x2)
[ 4151.697115] amdgpu 0000:05:00.0: amdgpu: 	 MORE_FAULTS: 0x0
[ 4151.697117] amdgpu 0000:05:00.0: amdgpu: 	 WALKER_ERROR: 0x0
[ 4151.697119] amdgpu 0000:05:00.0: amdgpu: 	 PERMISSION_FAULTS: 0x5
[ 4151.697122] amdgpu 0000:05:00.0: amdgpu: 	 MAPPING_ERROR: 0x0
[ 4151.697124] amdgpu 0000:05:00.0: amdgpu: 	 RW: 0x1
[ 7513.289273] amdgpu 0000:05:00.0: amdgpu: [mmhub0] retry page fault (src_id:0 ring:0 vmid:1 pasid:32775, for process chrome pid 1636 thread chrome:cs0 pid 1662)
[ 7513.289288] amdgpu 0000:05:00.0: amdgpu:   in page starting at address 0x000080011e7ef000 from IH client 0x12 (VMC)
[ 7513.289302] amdgpu 0000:05:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00140451
[ 7513.289306] amdgpu 0000:05:00.0: amdgpu: 	 Faulty UTCL2 client ID: VCN (0x2)
[ 7513.289309] amdgpu 0000:05:00.0: amdgpu: 	 MORE_FAULTS: 0x1
[ 7513.289314] amdgpu 0000:05:00.0: amdgpu: 	 WALKER_ERROR: 0x0
[ 7513.289317] amdgpu 0000:05:00.0: amdgpu: 	 PERMISSION_FAULTS: 0x5
[ 7513.289319] amdgpu 0000:05:00.0: amdgpu: 	 MAPPING_ERROR: 0x0
[ 7513.289353] amdgpu 0000:05:00.0: amdgpu: 	 RW: 0x1
[ 7513.289393] amdgpu 0000:05:00.0: amdgpu: [mmhub0] retry page fault (src_id:0 ring:0 vmid:1 pasid:32775, for process chrome pid 1636 thread chrome:cs0 pid 1662)
[ 7513.289424] amdgpu 0000:05:00.0: amdgpu:   in page starting at address 0x000080011e5e1000 from IH client 0x12 (VMC)
[ 7513.289457] amdgpu 0000:05:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00140450
[ 7513.289460] amdgpu 0000:05:00.0: amdgpu: 	 Faulty UTCL2 client ID: VCN (0x2)
[ 7513.289463] amdgpu 0000:05:00.0: amdgpu: 	 MORE_FAULTS: 0x0
[ 7513.289465] amdgpu 0000:05:00.0: amdgpu: 	 WALKER_ERROR: 0x0
[ 7513.289468] amdgpu 0000:05:00.0: amdgpu: 	 PERMISSION_FAULTS: 0x5
[ 7513.289470] amdgpu 0000:05:00.0: amdgpu: 	 MAPPING_ERROR: 0x0
[ 7513.289473] amdgpu 0000:05:00.0: amdgpu: 	 RW: 0x1
[ 7513.289501] amdgpu 0000:05:00.0: amdgpu: [mmhub0] retry page fault (src_id:0 ring:0 vmid:1 pasid:32775, for process chrome pid 1636 thread chrome:cs0 pid 1662)
[ 7513.289508] amdgpu 0000:05:00.0: amdgpu:   in page starting at address 0x000080011e5e1000 from IH client 0x12 (VMC)
[ 7513.289543] amdgpu 0000:05:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
[ 7513.289547] amdgpu 0000:05:00.0: amdgpu: 	 Faulty UTCL2 client ID: MP1 (0x0)
[ 7513.289550] amdgpu 0000:05:00.0: amdgpu: 	 MORE_FAULTS: 0x0
[ 7513.289593] amdgpu 0000:05:00.0: amdgpu: 	 WALKER_ERROR: 0x0
[ 7513.289596] amdgpu 0000:05:00.0: amdgpu: 	 PERMISSION_FAULTS: 0x0
[ 7513.289598] amdgpu 0000:05:00.0: amdgpu: 	 MAPPING_ERROR: 0x0
[ 7513.289601] amdgpu 0000:05:00.0: amdgpu: 	 RW: 0x0
[ 7513.289634] amdgpu 0000:05:00.0: amdgpu: [mmhub0] retry page fault (src_id:0 ring:0 vmid:1 pasid:32775, for process chrome pid 1636 thread chrome:cs0 pid 1662)
[ 7513.289649] amdgpu 0000:05:00.0: amdgpu:   in page starting at address 0x000080011e5e1000 from IH client 0x12 (VMC)
[ 7513.289672] amdgpu 0000:05:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
[ 7513.289674] amdgpu 0000:05:00.0: amdgpu: 	 Faulty UTCL2 client ID: MP1 (0x0)
[ 7513.289677] amdgpu 0000:05:00.0: amdgpu: 	 MORE_FAULTS: 0x0
[ 7513.289679] amdgpu 0000:05:00.0: amdgpu: 	 WALKER_ERROR: 0x0
[ 7513.289681] amdgpu 0000:05:00.0: amdgpu: 	 PERMISSION_FAULTS: 0x0
[ 7513.289683] amdgpu 0000:05:00.0: amdgpu: 	 MAPPING_ERROR: 0x0
[ 7513.289685] amdgpu 0000:05:00.0: amdgpu: 	 RW: 0x0
[ 7513.289796] amdgpu 0000:05:00.0: amdgpu: [mmhub0] retry page fault (src_id:0 ring:0 vmid:1 pasid:32775, for process chrome pid 1636 thread chrome:cs0 pid 1662)
[ 7513.289803] amdgpu 0000:05:00.0: amdgpu:   in page starting at address 0x000080011e5e1000 from IH client 0x12 (VMC)
[ 7513.289814] amdgpu 0000:05:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
[ 7513.289816] amdgpu 0000:05:00.0: amdgpu: 	 Faulty UTCL2 client ID: MP1 (0x0)
[ 7513.289818] amdgpu 0000:05:00.0: amdgpu: 	 MORE_FAULTS: 0x0
[ 7513.289820] amdgpu 0000:05:00.0: amdgpu: 	 WALKER_ERROR: 0x0
[ 7513.289823] amdgpu 0000:05:00.0: amdgpu: 	 PERMISSION_FAULTS: 0x0
[ 7513.289825] amdgpu 0000:05:00.0: amdgpu: 	 MAPPING_ERROR: 0x0
[ 7513.289827] amdgpu 0000:05:00.0: amdgpu: 	 RW: 0x0
[ 7513.289883] amdgpu 0000:05:00.0: amdgpu: [mmhub0] retry page fault (src_id:0 ring:0 vmid:1 pasid:32775, for process chrome pid 1636 thread chrome:cs0 pid 1662)
[ 7513.289888] amdgpu 0000:05:00.0: amdgpu:   in page starting at address 0x000080011e5e1000 from IH client 0x12 (VMC)
[ 7513.289903] amdgpu 0000:05:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
[ 7513.289906] amdgpu 0000:05:00.0: amdgpu: 	 Faulty UTCL2 client ID: MP1 (0x0)
[ 7513.289909] amdgpu 0000:05:00.0: amdgpu: 	 MORE_FAULTS: 0x0
[ 7513.289912] amdgpu 0000:05:00.0: amdgpu: 	 WALKER_ERROR: 0x0
[ 7513.289914] amdgpu 0000:05:00.0: amdgpu: 	 PERMISSION_FAULTS: 0x0
[ 7513.289915] amdgpu 0000:05:00.0: amdgpu: 	 MAPPING_ERROR: 0x0
[ 7513.289917] amdgpu 0000:05:00.0: amdgpu: 	 RW: 0x0
[ 7513.289928] amdgpu 0000:05:00.0: amdgpu: [mmhub0] retry page fault (src_id:0 ring:0 vmid:1 pasid:32775, for process chrome pid 1636 thread chrome:cs0 pid 1662)
[ 7513.289943] amdgpu 0000:05:00.0: amdgpu:   in page starting at address 0x000080011e5e1000 from IH client 0x12 (VMC)
[ 7513.289983] amdgpu 0000:05:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
[ 7513.290022] amdgpu 0000:05:00.0: amdgpu: 	 Faulty UTCL2 client ID: MP1 (0x0)
[ 7513.290024] amdgpu 0000:05:00.0: amdgpu: 	 MORE_FAULTS: 0x0
[ 7513.290026] amdgpu 0000:05:00.0: amdgpu: 	 WALKER_ERROR: 0x0
[ 7513.290029] amdgpu 0000:05:00.0: amdgpu: 	 PERMISSION_FAULTS: 0x0
[ 7513.290030] amdgpu 0000:05:00.0: amdgpu: 	 MAPPING_ERROR: 0x0
[ 7513.290032] amdgpu 0000:05:00.0: amdgpu: 	 RW: 0x0
[ 7513.290043] amdgpu 0000:05:00.0: amdgpu: [mmhub0] retry page fault (src_id:0 ring:0 vmid:1 pasid:32775, for process chrome pid 1636 thread chrome:cs0 pid 1662)
[ 7513.290058] amdgpu 0000:05:00.0: amdgpu:   in page starting at address 0x000080011e5e1000 from IH client 0x12 (VMC)
[ 7513.290068] amdgpu 0000:05:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
[ 7513.290071] amdgpu 0000:05:00.0: amdgpu: 	 Faulty UTCL2 client ID: MP1 (0x0)
[ 7513.290074] amdgpu 0000:05:00.0: amdgpu: 	 MORE_FAULTS: 0x0
[ 7513.290076] amdgpu 0000:05:00.0: amdgpu: 	 WALKER_ERROR: 0x0
[ 7513.290078] amdgpu 0000:05:00.0: amdgpu: 	 PERMISSION_FAULTS: 0x0
[ 7513.290080] amdgpu 0000:05:00.0: amdgpu: 	 MAPPING_ERROR: 0x0
[ 7513.290081] amdgpu 0000:05:00.0: amdgpu: 	 RW: 0x0
[ 7513.290087] amdgpu 0000:05:00.0: amdgpu: [mmhub0] retry page fault (src_id:0 ring:0 vmid:1 pasid:32775, for process chrome pid 1636 thread chrome:cs0 pid 1662)
[ 7513.290092] amdgpu 0000:05:00.0: amdgpu:   in page starting at address 0x000080011e5e1000 from IH client 0x12 (VMC)
[ 7513.290129] amdgpu 0000:05:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
[ 7513.290132] amdgpu 0000:05:00.0: amdgpu: 	 Faulty UTCL2 client ID: MP1 (0x0)
[ 7513.290134] amdgpu 0000:05:00.0: amdgpu: 	 MORE_FAULTS: 0x0
[ 7513.290136] amdgpu 0000:05:00.0: amdgpu: 	 WALKER_ERROR: 0x0
[ 7513.290138] amdgpu 0000:05:00.0: amdgpu: 	 PERMISSION_FAULTS: 0x0
[ 7513.290140] amdgpu 0000:05:00.0: amdgpu: 	 MAPPING_ERROR: 0x0
[ 7513.290142] amdgpu 0000:05:00.0: amdgpu: 	 RW: 0x0
[ 7513.290196] amdgpu 0000:05:00.0: amdgpu: [mmhub0] retry page fault (src_id:0 ring:0 vmid:1 pasid:32775, for process chrome pid 1636 thread chrome:cs0 pid 1662)
[ 7513.290202] amdgpu 0000:05:00.0: amdgpu:   in page starting at address 0x000080011e5e1000 from IH client 0x12 (VMC)
[ 7513.290214] amdgpu 0000:05:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
[ 7513.290216] amdgpu 0000:05:00.0: amdgpu: 	 Faulty UTCL2 client ID: MP1 (0x0)
[ 7513.290218] amdgpu 0000:05:00.0: amdgpu: 	 MORE_FAULTS: 0x0
[ 7513.290220] amdgpu 0000:05:00.0: amdgpu: 	 WALKER_ERROR: 0x0
[ 7513.290222] amdgpu 0000:05:00.0: amdgpu: 	 PERMISSION_FAULTS: 0x0
[ 7513.290224] amdgpu 0000:05:00.0: amdgpu: 	 MAPPING_ERROR: 0x0
[ 7513.290226] amdgpu 0000:05:00.0: amdgpu: 	 RW: 0x0
[ 7518.747834] [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out!
[ 7518.747846] [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out!
[ 7523.788343] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered

Comment 29 Maciej Barć gentoo-dev

2022-01-31 01:25:44 UTC

Since my last report I had no problem with my GPU, now running version 20211216.
To people who had similar problems: if any other version causes problems file reports for that version.
The version 20210511 is no longer available in the tree, closing this.