I am seeing amdgpu crashes on an AMD RX 560 video card. $ dmesg | grep drm:amdgpu_job_timedout > [ 64.482881] [drm:amdgpu_job_timedout] *ERROR* ring gfx timeout, signaled seq=1115, emitted seq=1117 > [ 64.482883] [drm:amdgpu_job_timedout] *ERROR* Process information: process nextcloud pid 4318 thread nextcloud:cs0 pid 4420 The screen becomes corrupted and often there is a graphics card crash: screen scrambled (mucked up colours, frozen screen), mouse is still overlaid and moves a square of shifted colours. After a short period (1-2 sec) the keyboard no longer responds. The keyboard becomes non-responsive once the gpu reset is underway, so that Ctrl-Alt-1 will only get you to a console if you are quick enough. From a console, "/etc/init.d/xdm restart" will often recover to a login screen. If the amdgpu driver has "recovery" enabled (depends on kernel version, disabled in earlier kernels), the X server sometimes recovers on it's own with an X restart and you are returned to the X login screen. One can still SSH in to reboot. If no "heavy" OpenGl processes are started, it is often possible to get a bit of work done (xfce4-terminal, firefox). Occasionally, corrupted letters will overwrite a section of a terminal window. Scrolling to redraw the window causes the corruption to be fixed. There is occasional tearing/blipping where black squares/triangles show up on the taskbar and window decorations. Setting XFCE's drawing mode (xfwm4 --replace --vblank=off/xpresent/glx) or disabling the window manager compositor (--compositor=off) do not change the crash behaviour. Firefox does not seem to experience these oddities. glxinfo and glxgears do not trigger the crash. clgpustress (below) (when openCL is installed) does not trigger the crash, but reports an error due to incorrect results returned from the GPU. Kernels: linux-4.14.166-gentoo, linux-4.19.97-gentoo, linux-5.5.11-gentoo Mesa: media-libs/mesa-19.3.5, media-libs/mesa-20.0.2 DRM: x11-libs/libdrm-2.4.100 X: x11-base/xorg-server-1.20.7 I have also tried dual booting into an Ubuntu 18.04.4 system, updated as of Apr 23, 2020. The open source amdgpu kernel (Ubuntu kernel is currently 5.3.0-46-generic) gives the same results. The proprietary amdgpu-pro drivers have two options: an "open stack" and "pro" install. The open stack installs .debs that capture versions of the opensource stack. This "open stack" also fails the same way. ** Installing the amdgpu-pro drivers with full opencl support in Ubuntu, results in a working system. ** $ tar -Jxvf amdgpu-pro-20.10-1048554-ubuntu-18.04.tar.xz $ cd amdgpu-pro-20.10-1048554-ubuntu-18.04 $ ./amdgpu-install --pro -y --opencl=legacy,pal With the amdgpu-pro drivers in Ubuntu, there are still some graphics corruption artifacts (small regions of ~10x10 pixels are scrambled), but the card no longer crashes, the system is usable, and I can successfully run electron-based software and launch steam. Running clgpustress () gives correct results. AMD's "RX 560" support page currently recommends version 12.10 of their amdgpu-pro drivers, and they provide a .tar.xz package of .deb files and an install script. This is the same package used by Gentoo's amdgpu-pro-opencl ebuild. Previously, clgpustress would report > Preparing StressTester for > #0 Clover:Radeon RX 560 Series (POLARIS11, DRM 3.36.0, 5.5.11-gentoo, LLVM 9.0.1) > ... > Exception happened: FAILED COMPUTATIONS!!!! PASS #1, Elapsed time: 0:00:01.415 clgpustress can also be run on the CPU (OpenCL using pocl), pocl is no longer available in Gentoo, but on Ubuntu, this gives the correct result. I have an ebuild of clgpustress at https://github.com/boyle/boyle-portage-tree/tree/master/app-benchmarks/clgpustress I have been looking through the amdgpu-pro installer and .debs, from which I can infer some version info to match against my Gentoo ebuild. The amdgpu driver they install is a dkms open source module. It looks like AMD based this release on kernel 5.4.7. The work-in-progress ebuild is at https://github.com/boyle/amdgpu-pro-rx560 I have not yet diffed their kernel driver code against the released kernel and the most recent (5.5.11) kernel. I have confirmed that the linux-firmware (/lib/firmware/amdgpu/polaris11_*.bin) matches the amdgpu-pro-20.10 firmware. The long and the short of it so far: the proprietary drivers work on Ubuntu. The open source stack does not work on Gentoo or Ubuntu. Hopefully, I can either get the proprietary drivers installed and working on Gentoo, or find the missing sauce to get the open source amdgpu drivers in the kernel working. Reproducible: Always Steps to Reproduce: 1. boot, login at X 2. start any "heavy" OpenGL application: steam, electron-based software such as signal-desktop or nextcloud Actual Results: GPU crash, screen corruption, X lock up. Expected Results: Happy computing in X. Kernel config, logs, emerge --info, to follow.
Created attachment 635246 [details] emerge --info
Created attachment 635248 [details] Gentoo logs and debug info
Created attachment 635250 [details] Ubuntu logs and debug info
clgpustress ebuild now available in the same repo: https://github.com/boyle/amdgpu-pro-rx560
This seems to be a few "master" upstream bugs tracking similar "gfx timeouts." These are not yet resolved. https://gitlab.freedesktop.org/drm/amd/-/issues/892 https://gitlab.freedesktop.org/drm/amd/-/issues/934 https://gitlab.freedesktop.org/drm/amd/-/issues/588 Errors with "ring_sdmaX timeout" appear to be a different issue.
It is sad to read that you have problems with the hardware/software. The situation seems to be a bit more complicate and requires some analysis. We can not help you efficiently via bug tracker. The bug tracker aims rather on specific problems in .ebuilds and less on individual systems. I have had very good experience on the gentoo IRC [1] with questions like this. Of course there are also forums and mailing lists [2,3]. I hope you understand, that I will close the bug here therefore and wish you good luck on one of the mentioned channels [4]. Please reopen the ticket in order to provide an indication for an specific error in an ebuild or any gentoo related product. Please add the name of the package with the bug in the summary (title) of the bug ticket. [1] https://www.gentoo.org/get-involved/irc-channels/ [2] https://forums.gentoo.org/ [3] https://www.gentoo.org/get-involved/mailing-lists/all-lists.html [4] https://www.gentoo.org/support/
Bug filed upstream with AMD where it could be libdrm, mesa, llvm or the open source amdgpu kernel driver. I've marked this bug as "upstream." https://gitlab.freedesktop.org/drm/amd/-/issues/1141 I've updated the title. Jonas, > We can not help you efficiently via bug tracker. Thanks. I understand.