Gentoo Websites Logo
Go to: Gentoo Home Documentation Forums Lists Bugs Planet Store Wiki Get Gentoo!
Bug 720044 - sys-kernel/gentoo-sources-5.4.36 amdgpu driver: "[drm:amdgpu_job_timedout] *ERROR* ring gfx timeout" on AMD RX 560 and graphics card crash
Summary: sys-kernel/gentoo-sources-5.4.36 amdgpu driver: "[drm:amdgpu_job_timedout] *E...
Status: RESOLVED UPSTREAM
Alias: None
Product: Gentoo Linux
Classification: Unclassified
Component: Current packages (show other bugs)
Hardware: AMD64 Linux
: Normal normal
Assignee: Gentoo Linux bug wranglers
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2020-04-29 20:52 UTC by Alistair Boyle
Modified: 2020-05-11 14:43 UTC (History)
1 user (show)

See Also:
Package list:
Runtime testing required: ---


Attachments
emerge --info (emerge-info.txt,7.70 KB, text/plain)
2020-04-29 21:17 UTC, Alistair Boyle
Details
Gentoo logs and debug info (gentoo-amdgpu-debuginfo-20200429.tar.gz,56.51 KB, application/gzip)
2020-04-29 21:30 UTC, Alistair Boyle
Details
Ubuntu logs and debug info (ubuntu-amdgpu-debuginfo-20200425.tar.gz,68.51 KB, application/gzip)
2020-04-29 21:32 UTC, Alistair Boyle
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Alistair Boyle 2020-04-29 20:52:59 UTC
I am seeing amdgpu crashes on an AMD RX 560 video card.

$ dmesg | grep drm:amdgpu_job_timedout
> [   64.482881] [drm:amdgpu_job_timedout] *ERROR* ring gfx timeout, signaled seq=1115, emitted seq=1117
> [   64.482883] [drm:amdgpu_job_timedout] *ERROR* Process information: process nextcloud pid 4318 thread nextcloud:cs0 pid 4420

The screen becomes corrupted and often there is a graphics card crash: screen scrambled (mucked up colours, frozen screen), mouse is still overlaid and moves a square of shifted colours. After a short period (1-2 sec) the keyboard no longer responds.

The keyboard becomes non-responsive once the gpu reset is underway, so that Ctrl-Alt-1 will only get you to a console if you are quick enough. From a console,  "/etc/init.d/xdm restart" will often recover to a login screen.

If the amdgpu driver has "recovery" enabled (depends on kernel version, disabled in earlier kernels), the X server sometimes recovers on it's own with an X restart and you are returned to the X login screen. One can still SSH in to reboot.

If no "heavy" OpenGl processes are started, it is often possible to get a bit of work done (xfce4-terminal, firefox). Occasionally, corrupted letters will overwrite a section of a terminal window. Scrolling to redraw the window causes the corruption to be fixed. There is occasional tearing/blipping where black squares/triangles show up on the taskbar and window decorations. Setting XFCE's drawing mode (xfwm4 --replace --vblank=off/xpresent/glx) or disabling the window manager compositor (--compositor=off) do not change the crash behaviour. Firefox does not seem to experience these oddities. glxinfo and glxgears do not trigger the crash. clgpustress (below) (when openCL is installed) does not trigger the crash, but reports an error due to incorrect results returned from the GPU.

Kernels: linux-4.14.166-gentoo, linux-4.19.97-gentoo, linux-5.5.11-gentoo
Mesa: media-libs/mesa-19.3.5, media-libs/mesa-20.0.2
DRM: x11-libs/libdrm-2.4.100
X: x11-base/xorg-server-1.20.7

I have also tried dual booting into an Ubuntu 18.04.4 system, updated as of Apr 23, 2020. The open source amdgpu kernel (Ubuntu kernel is currently 5.3.0-46-generic) gives the same results.
The proprietary amdgpu-pro drivers have two options: an "open stack" and "pro" install. The open stack installs .debs that capture versions of the opensource stack. This "open stack" also fails the same way.

** Installing the amdgpu-pro drivers with full opencl support in Ubuntu, results in a working system. **
$ tar -Jxvf amdgpu-pro-20.10-1048554-ubuntu-18.04.tar.xz
$ cd amdgpu-pro-20.10-1048554-ubuntu-18.04
$ ./amdgpu-install --pro -y --opencl=legacy,pal
With the amdgpu-pro drivers in Ubuntu, there are still some graphics corruption artifacts (small regions of ~10x10 pixels are scrambled), but the card no longer crashes, the system is usable, and I can successfully run electron-based software and launch steam. Running clgpustress () gives correct results.

AMD's "RX 560" support page currently recommends version 12.10 of their amdgpu-pro drivers, and they provide a .tar.xz package of .deb files and an install script. This is the same package used by Gentoo's amdgpu-pro-opencl ebuild.


Previously, clgpustress would report 
> Preparing StressTester for
>   #0 Clover:Radeon RX 560 Series (POLARIS11, DRM 3.36.0, 5.5.11-gentoo, LLVM 9.0.1)
> ...
>     Exception happened: FAILED COMPUTATIONS!!!! PASS #1, Elapsed time: 0:00:01.415

clgpustress can also be run on the CPU (OpenCL using pocl), pocl is no longer available in Gentoo, but on Ubuntu, this gives the correct result.

I have an ebuild of clgpustress at
https://github.com/boyle/boyle-portage-tree/tree/master/app-benchmarks/clgpustress

I have been looking through the amdgpu-pro installer and .debs, from which I can infer some version info to match against my Gentoo ebuild. The amdgpu driver they install is a dkms open source module. It looks like AMD based this release on kernel 5.4.7. The work-in-progress ebuild is at

https://github.com/boyle/amdgpu-pro-rx560

I have not yet diffed their kernel driver code against the released kernel and the most recent (5.5.11) kernel. I have confirmed that the linux-firmware (/lib/firmware/amdgpu/polaris11_*.bin) matches the amdgpu-pro-20.10 firmware.

The long and the short of it so far: the proprietary drivers work on Ubuntu. The open source stack does not work on Gentoo or Ubuntu.

Hopefully, I can either get the proprietary drivers installed and working on Gentoo, or find the missing sauce to get the open source amdgpu drivers in the kernel working.

Reproducible: Always

Steps to Reproduce:
1. boot, login at X
2. start any "heavy" OpenGL application: steam, electron-based software such as signal-desktop or nextcloud

Actual Results:  
GPU crash, screen corruption, X lock up.



Expected Results:  
Happy computing in X.

Kernel config, logs, emerge --info, to follow.
Comment 1 Alistair Boyle 2020-04-29 21:17:10 UTC
Created attachment 635246 [details]
emerge --info
Comment 2 Alistair Boyle 2020-04-29 21:30:40 UTC
Created attachment 635248 [details]
Gentoo logs and debug info
Comment 3 Alistair Boyle 2020-04-29 21:32:02 UTC
Created attachment 635250 [details]
Ubuntu logs and debug info
Comment 4 Alistair Boyle 2020-04-29 21:39:20 UTC
clgpustress ebuild now available in the same repo: https://github.com/boyle/amdgpu-pro-rx560
Comment 5 Alistair Boyle 2020-04-29 22:26:20 UTC
This seems to be a few "master" upstream bugs tracking similar "gfx timeouts." These are not yet resolved.

https://gitlab.freedesktop.org/drm/amd/-/issues/892
https://gitlab.freedesktop.org/drm/amd/-/issues/934
https://gitlab.freedesktop.org/drm/amd/-/issues/588

Errors with "ring_sdmaX timeout" appear to be a different issue.
Comment 6 Jonas Stein gentoo-dev 2020-04-29 22:32:01 UTC
It is sad to read that you have problems with the hardware/software. The situation seems to be a bit more complicate and requires some analysis.

We can not help you efficiently via bug tracker. The bug tracker aims rather on specific problems in .ebuilds and less on individual systems. 

I have had very good experience on the gentoo IRC [1] with questions like this. Of course there are also forums and mailing lists [2,3].
I hope you understand, that I will close the bug here therefore and wish you good luck on one of the mentioned channels [4].
Please reopen the ticket in order to provide an indication for an specific error in an ebuild or any gentoo related product.

Please add the name of the package with the bug in the summary (title) of the bug ticket.

[1] https://www.gentoo.org/get-involved/irc-channels/
[2] https://forums.gentoo.org/
[3] https://www.gentoo.org/get-involved/mailing-lists/all-lists.html
[4] https://www.gentoo.org/support/
Comment 7 Alistair Boyle 2020-05-11 14:43:24 UTC
Bug filed upstream with AMD where it could be libdrm, mesa, llvm or the open source amdgpu kernel driver. I've marked this bug as "upstream."

https://gitlab.freedesktop.org/drm/amd/-/issues/1141

I've updated the title.

Jonas,
> We can not help you efficiently via bug tracker.

Thanks. I understand.