Summary: | sys-kernel/linux-firmware-20210511: AMDGPU broken | ||
---|---|---|---|
Product: | Gentoo Linux | Reporter: | Maciej Barć <xgqt> |
Component: | Current packages | Assignee: | Chí-Thanh Christopher Nguyễn <chithanh> |
Status: | RESOLVED OBSOLETE | ||
Severity: | normal | CC: | aladjev.andrew, brezensalzer, jstein, kernel, stefan, zerochaos |
Priority: | Normal | ||
Version: | unspecified | ||
Hardware: | All | ||
OS: | Linux | ||
URL: | https://lists.freedesktop.org/archives/amd-gfx/2021-May/063759.html | ||
Whiteboard: | |||
Package list: | Runtime testing required: | --- | |
Attachments: |
emerge --info
kernel errors 2012-05-15 config - 5.12.4-gentoo-magentalane-v0.2.7 glxinfo -B dmesg output when graphics crashes dmesg sys-kernel/linux-firmware-20210518 dmesg of chromium errors dmesg of chromium errors (5.13.1/20210629) |
Description
Maciej Barć
2021-05-16 17:50:42 UTC
Created attachment 709164 [details]
emerge --info
Created attachment 709167 [details]
kernel errors 2012-05-15
Created attachment 709170 [details]
config - 5.12.4-gentoo-magentalane-v0.2.7
I use rsyslog. Any parts of syslog that I should provide? lspci -k -v 04:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Picasso (rev c2) (prog-if 00 [VGA controller]) Subsystem: Lenovo Picasso Flags: bus master, fast devsel, latency 0, IRQ 63, IOMMU group 11 Memory at b0000000 (64-bit, prefetchable) [size=256M] Memory at c0000000 (64-bit, prefetchable) [size=2M] I/O ports at 1000 [size=256] Memory at c0800000 (32-bit, non-prefetchable) [size=512K] Expansion ROM at 000c0000 [virtual] [disabled] [size=128K] Capabilities: <access denied> Kernel driver in use: amdgpu Kernel modules: amdgpu lshw -numeric -C display *-display description: VGA compatible controller product: Picasso [1002:15D8] vendor: Advanced Micro Devices, Inc. [AMD/ATI] [1002] physical id: 0 bus info: pci@0000:04:00.0 version: c2 width: 64 bits clock: 33MHz capabilities: pm pciexpress msi msix vga_controller bus_master cap_list rom configuration: driver=amdgpu latency=0 resources: irq:63 memory:b0000000-bfffffff memory:c0000000-c01fffff ioport:1000(size=256) memory:c0800000-c087ffff memory:c0000-dffff Created attachment 709185 [details]
glxinfo -B
Thank you for letting us know but there is not much we can do for you: Please report upstream on your own and update this bug report with a link to your bug report/mail to LKML. Same here on a Ryzen 3350G PRO system (GPU family is also "picasso"). Since last system update (not many packages, but including linux-firmare to 20210511) system hangs or goes back to login screen after 1-3 hours of normal usuage. Kernel version is 5.10.27. I'll try to downgrade linux-firmware to 20210315 and test to be sure whether this is the cause or not. Created attachment 709455 [details]
dmesg output when graphics crashes
*** Bug 790683 has been marked as a duplicate of this bug. *** Created attachment 715227 [details]
dmesg sys-kernel/linux-firmware-20210518
Problem remains with sys-kernel/linux-firmware-20210518 and sys-kernel/gentoo-sources-5.12.10.
There is an upstream patch available which should fix this https://patchwork.freedesktop.org/patch/433701/ But I found this patgch included in 5.12.10, so there maybe another issue. (In reply to tomtom69 from comment #14) > There is an upstream patch available which should fix this > https://patchwork.freedesktop.org/patch/433701/ I am curious, does that mean that functionally that was previously available on this hardware, is now disabled, as being planned obsolescence? > But I found this patgch included in 5.12.10, so there maybe another issue. Likely. Created attachment 715743 [details]
dmesg of chromium errors
[ 4233.080397] amdgpu 0000:05:00.0: amdgpu: [gfxhub0] retry page fault (src_id:0 ring:0 vmid:3 pasid:32774, for process chrome pid 2369 thread chrome:cs0 pid 2393)
[ 4233.080415] amdgpu 0000:05:00.0: amdgpu: in page starting at address 0x800114000000 from client 27
[ 4233.080427] amdgpu 0000:05:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00341051
[ 4233.080431] amdgpu 0000:05:00.0: amdgpu: Faulty UTCL2 client ID: TCP (0x8)
[ 4233.080435] amdgpu 0000:05:00.0: amdgpu: MORE_FAULTS: 0x1
[ 4233.080438] amdgpu 0000:05:00.0: amdgpu: WALKER_ERROR: 0x0
[ 4233.080440] amdgpu 0000:05:00.0: amdgpu: PERMISSION_FAULTS: 0x5
[ 4233.080442] amdgpu 0000:05:00.0: amdgpu: MAPPING_ERROR: 0x0
[ 4233.080444] amdgpu 0000:05:00.0: amdgpu: RW: 0x1
These kind of errors maybe completely unrelated, but cause a full stop. Thankfully the system does recover.
Created attachment 723235 [details]
dmesg of chromium errors (5.13.1/20210629)
Issues with 5.13.1/20210629 and Chromium remain.
(In reply to Stefan de Konink from comment #17) > Created attachment 723235 [details] > dmesg of chromium errors (5.13.1/20210629) > > Issues with 5.13.1/20210629 and Chromium remain. Thanks for testing, I was not able to test latest version cause I haven't had time to fight those bugs (especially because they are so annoying). Still on 20210315. I switched back as well. I still have to confirm that it would be possible to test this change without recompiling the kernel. For now we have to mask: # amdgpu =sys-kernel/linux-firmware-20210315 =sys-kernel/linux-firmware-20210518 =sys-kernel/linux-firmware-20210629 =sys-kernel/linux-firmware-20210208 was the last good firmware for amdgpu. (In reply to Andrew Aladjev from comment #20) > For now we have to mask: > > # amdgpu > =sys-kernel/linux-firmware-20210315 What issues have you experienced with 20210315? I "barely" have issues with this one. The issues that I still have on my platform is suspend-resume. And a reproducible kernel panic at powerdown when the device woke up from a cold suspend. I wonder if the OpenCL issues (I experienced them with tesseract last year) or if it is a general issue with Raven Ridge (being unsupported now upstream for ROC). https://bugs.gentoo.org/764605 I've added mask file for amdgpu several months ago when received random hang (just "ring gfx timeout" without additional info) with old kernel 5.10 and firmware 20210315. Than I've upgraded firmware to 20210511 + kernel to 5.12 and received stable hang (VM_L2_PROTECTION_FAULT_STATUS + "ring gfx timeout") so added 20210511 to mask file, same thing for 20210629. So firmware 20210208 is the island of "stability". If you want stable GPU than please do not use amdgpu (at least for now). Please review this issue https://gitlab.freedesktop.org/drm/amd/-/issues/892. This issue is the volcano of amdgpu linux user "experience". You can grep linux sources using "TIMEOUT_FOR_FLIP_PENDING", found "dcn20_hwseq.c" file and review the quality of "code" around. You will immediately feel how "dcn20_pipe_control_lock", "dcn20_enable_stream_timing", "dcn20_update_dchubp_dpp", "dcn20_enable_plane", "dcn20_update_mpcc" smells like. This code is experimental, it was not designed to be stable and it won't become stable. This code should be rewritten completely by amd core developers. This rewrite may happen in next 5 years. If you want a stable GPU for now than use radeon (famous r600/r700/etc) <= gcn 1.0. Some GPU firmware files (*sdma.bin) were now reverted upstream: https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/commit/?id=d843e520a4b0d92b986645548d11ade3b9b239a4 https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/commit/?id=99d72504bff7ab40c261b8509c0b9d8abf98b296 https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/commit/?id=d7b50e61669dc137924337d03d09b8986eb752a3 I also found out only the picasso_sdma.bin file from newer versions caused the issue here. So I use the current firmware files and only keep picasso_sdma.bin from linux-firmware-20210315: https://gitlab.freedesktop.org/drm/amd/-/issues/1609 Hopefully these upstream patches fix the problem for now, as soon as they arrive in the portage tree (however it is only an intermediate solution, not a real bugfix). I've been running version 20210818 for 4 days now, seems the issue is gone. I had freezing and a way to reproduce was emerge in qtwebengine with jumbo-build.. my system would even lose a couple of minutes during the freeze and kde crash sometimes losing me opencl... 05:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Picasso (rev c2) Finding wayland a bit messy but as for the fix I'm afraid I couldn't determine if it has pushed graphics memory to my swap file, protected video ram could be swapped by accident? regardless only fix I found was rather than 8gb of ram I added another 16gb to the already 8gb and issues all went away... not exactly the best fix but a fix all the same... they is nothing quiet right still and wayland is a bit buggy but seems lack of ram and some sort of swap space issue on my laptop... Currently running 5.15.2 and linux-firmware-20211027, kernel panics are back. Using 5.15.4-gentoo, and the latest firmware. Currently not (yet) crashing. [ 4151.697052] amdgpu 0000:05:00.0: amdgpu: [mmhub0] retry page fault (src_id:0 ring:0 vmid:1 pasid:32775, for process chrome pid 1636 thread chrome:cs0 pid 1662) [ 4151.697069] amdgpu 0000:05:00.0: amdgpu: in page starting at address 0x000080011e7ef000 from IH client 0x12 (VMC) [ 4151.697079] amdgpu 0000:05:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00140450 [ 4151.697112] amdgpu 0000:05:00.0: amdgpu: Faulty UTCL2 client ID: VCN (0x2) [ 4151.697115] amdgpu 0000:05:00.0: amdgpu: MORE_FAULTS: 0x0 [ 4151.697117] amdgpu 0000:05:00.0: amdgpu: WALKER_ERROR: 0x0 [ 4151.697119] amdgpu 0000:05:00.0: amdgpu: PERMISSION_FAULTS: 0x5 [ 4151.697122] amdgpu 0000:05:00.0: amdgpu: MAPPING_ERROR: 0x0 [ 4151.697124] amdgpu 0000:05:00.0: amdgpu: RW: 0x1 [ 7513.289273] amdgpu 0000:05:00.0: amdgpu: [mmhub0] retry page fault (src_id:0 ring:0 vmid:1 pasid:32775, for process chrome pid 1636 thread chrome:cs0 pid 1662) [ 7513.289288] amdgpu 0000:05:00.0: amdgpu: in page starting at address 0x000080011e7ef000 from IH client 0x12 (VMC) [ 7513.289302] amdgpu 0000:05:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00140451 [ 7513.289306] amdgpu 0000:05:00.0: amdgpu: Faulty UTCL2 client ID: VCN (0x2) [ 7513.289309] amdgpu 0000:05:00.0: amdgpu: MORE_FAULTS: 0x1 [ 7513.289314] amdgpu 0000:05:00.0: amdgpu: WALKER_ERROR: 0x0 [ 7513.289317] amdgpu 0000:05:00.0: amdgpu: PERMISSION_FAULTS: 0x5 [ 7513.289319] amdgpu 0000:05:00.0: amdgpu: MAPPING_ERROR: 0x0 [ 7513.289353] amdgpu 0000:05:00.0: amdgpu: RW: 0x1 [ 7513.289393] amdgpu 0000:05:00.0: amdgpu: [mmhub0] retry page fault (src_id:0 ring:0 vmid:1 pasid:32775, for process chrome pid 1636 thread chrome:cs0 pid 1662) [ 7513.289424] amdgpu 0000:05:00.0: amdgpu: in page starting at address 0x000080011e5e1000 from IH client 0x12 (VMC) [ 7513.289457] amdgpu 0000:05:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00140450 [ 7513.289460] amdgpu 0000:05:00.0: amdgpu: Faulty UTCL2 client ID: VCN (0x2) [ 7513.289463] amdgpu 0000:05:00.0: amdgpu: MORE_FAULTS: 0x0 [ 7513.289465] amdgpu 0000:05:00.0: amdgpu: WALKER_ERROR: 0x0 [ 7513.289468] amdgpu 0000:05:00.0: amdgpu: PERMISSION_FAULTS: 0x5 [ 7513.289470] amdgpu 0000:05:00.0: amdgpu: MAPPING_ERROR: 0x0 [ 7513.289473] amdgpu 0000:05:00.0: amdgpu: RW: 0x1 [ 7513.289501] amdgpu 0000:05:00.0: amdgpu: [mmhub0] retry page fault (src_id:0 ring:0 vmid:1 pasid:32775, for process chrome pid 1636 thread chrome:cs0 pid 1662) [ 7513.289508] amdgpu 0000:05:00.0: amdgpu: in page starting at address 0x000080011e5e1000 from IH client 0x12 (VMC) [ 7513.289543] amdgpu 0000:05:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00000000 [ 7513.289547] amdgpu 0000:05:00.0: amdgpu: Faulty UTCL2 client ID: MP1 (0x0) [ 7513.289550] amdgpu 0000:05:00.0: amdgpu: MORE_FAULTS: 0x0 [ 7513.289593] amdgpu 0000:05:00.0: amdgpu: WALKER_ERROR: 0x0 [ 7513.289596] amdgpu 0000:05:00.0: amdgpu: PERMISSION_FAULTS: 0x0 [ 7513.289598] amdgpu 0000:05:00.0: amdgpu: MAPPING_ERROR: 0x0 [ 7513.289601] amdgpu 0000:05:00.0: amdgpu: RW: 0x0 [ 7513.289634] amdgpu 0000:05:00.0: amdgpu: [mmhub0] retry page fault (src_id:0 ring:0 vmid:1 pasid:32775, for process chrome pid 1636 thread chrome:cs0 pid 1662) [ 7513.289649] amdgpu 0000:05:00.0: amdgpu: in page starting at address 0x000080011e5e1000 from IH client 0x12 (VMC) [ 7513.289672] amdgpu 0000:05:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00000000 [ 7513.289674] amdgpu 0000:05:00.0: amdgpu: Faulty UTCL2 client ID: MP1 (0x0) [ 7513.289677] amdgpu 0000:05:00.0: amdgpu: MORE_FAULTS: 0x0 [ 7513.289679] amdgpu 0000:05:00.0: amdgpu: WALKER_ERROR: 0x0 [ 7513.289681] amdgpu 0000:05:00.0: amdgpu: PERMISSION_FAULTS: 0x0 [ 7513.289683] amdgpu 0000:05:00.0: amdgpu: MAPPING_ERROR: 0x0 [ 7513.289685] amdgpu 0000:05:00.0: amdgpu: RW: 0x0 [ 7513.289796] amdgpu 0000:05:00.0: amdgpu: [mmhub0] retry page fault (src_id:0 ring:0 vmid:1 pasid:32775, for process chrome pid 1636 thread chrome:cs0 pid 1662) [ 7513.289803] amdgpu 0000:05:00.0: amdgpu: in page starting at address 0x000080011e5e1000 from IH client 0x12 (VMC) [ 7513.289814] amdgpu 0000:05:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00000000 [ 7513.289816] amdgpu 0000:05:00.0: amdgpu: Faulty UTCL2 client ID: MP1 (0x0) [ 7513.289818] amdgpu 0000:05:00.0: amdgpu: MORE_FAULTS: 0x0 [ 7513.289820] amdgpu 0000:05:00.0: amdgpu: WALKER_ERROR: 0x0 [ 7513.289823] amdgpu 0000:05:00.0: amdgpu: PERMISSION_FAULTS: 0x0 [ 7513.289825] amdgpu 0000:05:00.0: amdgpu: MAPPING_ERROR: 0x0 [ 7513.289827] amdgpu 0000:05:00.0: amdgpu: RW: 0x0 [ 7513.289883] amdgpu 0000:05:00.0: amdgpu: [mmhub0] retry page fault (src_id:0 ring:0 vmid:1 pasid:32775, for process chrome pid 1636 thread chrome:cs0 pid 1662) [ 7513.289888] amdgpu 0000:05:00.0: amdgpu: in page starting at address 0x000080011e5e1000 from IH client 0x12 (VMC) [ 7513.289903] amdgpu 0000:05:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00000000 [ 7513.289906] amdgpu 0000:05:00.0: amdgpu: Faulty UTCL2 client ID: MP1 (0x0) [ 7513.289909] amdgpu 0000:05:00.0: amdgpu: MORE_FAULTS: 0x0 [ 7513.289912] amdgpu 0000:05:00.0: amdgpu: WALKER_ERROR: 0x0 [ 7513.289914] amdgpu 0000:05:00.0: amdgpu: PERMISSION_FAULTS: 0x0 [ 7513.289915] amdgpu 0000:05:00.0: amdgpu: MAPPING_ERROR: 0x0 [ 7513.289917] amdgpu 0000:05:00.0: amdgpu: RW: 0x0 [ 7513.289928] amdgpu 0000:05:00.0: amdgpu: [mmhub0] retry page fault (src_id:0 ring:0 vmid:1 pasid:32775, for process chrome pid 1636 thread chrome:cs0 pid 1662) [ 7513.289943] amdgpu 0000:05:00.0: amdgpu: in page starting at address 0x000080011e5e1000 from IH client 0x12 (VMC) [ 7513.289983] amdgpu 0000:05:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00000000 [ 7513.290022] amdgpu 0000:05:00.0: amdgpu: Faulty UTCL2 client ID: MP1 (0x0) [ 7513.290024] amdgpu 0000:05:00.0: amdgpu: MORE_FAULTS: 0x0 [ 7513.290026] amdgpu 0000:05:00.0: amdgpu: WALKER_ERROR: 0x0 [ 7513.290029] amdgpu 0000:05:00.0: amdgpu: PERMISSION_FAULTS: 0x0 [ 7513.290030] amdgpu 0000:05:00.0: amdgpu: MAPPING_ERROR: 0x0 [ 7513.290032] amdgpu 0000:05:00.0: amdgpu: RW: 0x0 [ 7513.290043] amdgpu 0000:05:00.0: amdgpu: [mmhub0] retry page fault (src_id:0 ring:0 vmid:1 pasid:32775, for process chrome pid 1636 thread chrome:cs0 pid 1662) [ 7513.290058] amdgpu 0000:05:00.0: amdgpu: in page starting at address 0x000080011e5e1000 from IH client 0x12 (VMC) [ 7513.290068] amdgpu 0000:05:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00000000 [ 7513.290071] amdgpu 0000:05:00.0: amdgpu: Faulty UTCL2 client ID: MP1 (0x0) [ 7513.290074] amdgpu 0000:05:00.0: amdgpu: MORE_FAULTS: 0x0 [ 7513.290076] amdgpu 0000:05:00.0: amdgpu: WALKER_ERROR: 0x0 [ 7513.290078] amdgpu 0000:05:00.0: amdgpu: PERMISSION_FAULTS: 0x0 [ 7513.290080] amdgpu 0000:05:00.0: amdgpu: MAPPING_ERROR: 0x0 [ 7513.290081] amdgpu 0000:05:00.0: amdgpu: RW: 0x0 [ 7513.290087] amdgpu 0000:05:00.0: amdgpu: [mmhub0] retry page fault (src_id:0 ring:0 vmid:1 pasid:32775, for process chrome pid 1636 thread chrome:cs0 pid 1662) [ 7513.290092] amdgpu 0000:05:00.0: amdgpu: in page starting at address 0x000080011e5e1000 from IH client 0x12 (VMC) [ 7513.290129] amdgpu 0000:05:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00000000 [ 7513.290132] amdgpu 0000:05:00.0: amdgpu: Faulty UTCL2 client ID: MP1 (0x0) [ 7513.290134] amdgpu 0000:05:00.0: amdgpu: MORE_FAULTS: 0x0 [ 7513.290136] amdgpu 0000:05:00.0: amdgpu: WALKER_ERROR: 0x0 [ 7513.290138] amdgpu 0000:05:00.0: amdgpu: PERMISSION_FAULTS: 0x0 [ 7513.290140] amdgpu 0000:05:00.0: amdgpu: MAPPING_ERROR: 0x0 [ 7513.290142] amdgpu 0000:05:00.0: amdgpu: RW: 0x0 [ 7513.290196] amdgpu 0000:05:00.0: amdgpu: [mmhub0] retry page fault (src_id:0 ring:0 vmid:1 pasid:32775, for process chrome pid 1636 thread chrome:cs0 pid 1662) [ 7513.290202] amdgpu 0000:05:00.0: amdgpu: in page starting at address 0x000080011e5e1000 from IH client 0x12 (VMC) [ 7513.290214] amdgpu 0000:05:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00000000 [ 7513.290216] amdgpu 0000:05:00.0: amdgpu: Faulty UTCL2 client ID: MP1 (0x0) [ 7513.290218] amdgpu 0000:05:00.0: amdgpu: MORE_FAULTS: 0x0 [ 7513.290220] amdgpu 0000:05:00.0: amdgpu: WALKER_ERROR: 0x0 [ 7513.290222] amdgpu 0000:05:00.0: amdgpu: PERMISSION_FAULTS: 0x0 [ 7513.290224] amdgpu 0000:05:00.0: amdgpu: MAPPING_ERROR: 0x0 [ 7513.290226] amdgpu 0000:05:00.0: amdgpu: RW: 0x0 [ 7518.747834] [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out! [ 7518.747846] [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out! [ 7523.788343] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered Since my last report I had no problem with my GPU, now running version 20211216. To people who had similar problems: if any other version causes problems file reports for that version. The version 20210511 is no longer available in the tree, closing this. |